SDAccel Introduction and Overview

TheSDAccel™environment provides a framework for developing and delivering FPGA accelerated data center applications using standard programming languages. TheSDAccelenvironment includes a familiar software development flow with an Eclipse-based integrated development environment (IDE), and an architecturally optimizing compiler that makes efficient use of FPGA resources. Developers of accelerated applications will use a familiar software programming work flow to take advantage of FPGA acceleration with little or no prior FPGA or hardware design experience. Acceleration kernel developers can use a hardware-centric approach working through the HLS compiler with standard programming languages to produce a heterogeneous application with both software and hardware components. The software component, or application, is developed in C/C++ withOpenCL™API calls; the hardware component, or kernel, is developed in C/C++,OpenCL, or RTL. TheSDAccelenvironment accommodates various methodologies, allowing developers to start from either the software component or the hardware component.

Xilinx®FPGAs offer many advantages over traditional CPU/GPU acceleration, including a custom architecture capable of implementing any function that can run on a processor, resulting in better performance at lower power dissipation. To realize the advantages of software acceleration on aXilinxdevice, you should look to accelerate large compute-intensive portions of your application in hardware. Implementing these functions in custom hardware gives you an ideal balance between performance and power. TheSDAccelenvironment provides tools and reports to profile the performance of your host application, and determine where the opportunities for acceleration are. The tools also provide automated runtime instrumentation of cache, memory and bus usage to track real-time performance on the hardware.

TheSDAccelenvironment targets acceleration hardware platforms such as theXilinxAlveo™U200 and U250 Data Center accelerator cards. These acceleration platforms are designed for computationally intensive applications, specifically applications for live video transcoding, data analytics, and artificial intelligence (AI) applications using machine learning. There are also a number of available third-party acceleration platforms compatible with theSDAccelenvironment.

A growing number of FPGA-accelerated libraries are available through theSDAccelenvironment, such as theXilinxMachine Learning (ML) suite to optimize and deploy accelerated ML inference applications. Predefined accelerator functions include targeted applications, such as artificial intelligence, with support for many common machine learning frameworks such as: Caffe, MxNet, and TensorFlow; video processing, encryption, and big data analysis. These predefined accelerator libraries offered byXilinxand third-party developers can be integrated into your accelerated application project quickly to speed development.

Software Acceleration with SDAccel

When compared with processor architectures, the structures that comprise the programmable logic (PL) fabric in aXilinxFPGA enable a high degree of parallelism in application execution. The custom processing architecture generated bySDAccelfor a kernel presents a different execution paradigm from CPU execution, and provides opportunity for significant performance gains. While you can re-target an existing application for acceleration on an FPGA, understanding the FPGA architecture and revising your host and kernel code appropriately will significantly improve performance. Refer to theSDAccel Environment Programmers Guidefor more information on writing your host and kernel code, and managing data transfers between them.

CPUs have fixed resources and offer limited opportunities for parallelization of tasks or operations. A processor, regardless of its type, executes a program as a sequence of instructions generated by processor compiler tools, which transform an algorithm expressed in C/C++ into assembly language constructs that are native to the target processor. Even a simple operation, like the addition of two values, results in multiple assembly instructions that must be executed across multiple clock cycles. This is why software engineers spend so much time restructuring their algorithms to increase the cache hit rate and decrease the processor cycles used per instruction.

On the other hand, the FPGA is an inherently parallel processing device capable of implementing any function that can run on a processor.XilinxFPGAs have an abundance resources that can be programmed and configured to implement any custom architecture and achieve virtually any level of parallelism. Unlike a processor, where all computations share the same ALU, operations in an FPGA are distributed and executed across a configurable array of processing resources. The FPGA compiler creates a unique circuit optimized for each application or algorithm. The FPGA programming fabric acts as a blank canvas to define and implement your acceleration functions.

TheSDAccelcompiler exercises the capabilities of the FPGA fabric through the processes of scheduling, pipelining, and dataflow.

Scheduling: The process of identifying the data and control dependencies between different operations to determine when each will execute. The compiler analyzes dependencies between adjacent operations as well as across time, and groups operations to execute in the same clock cycle when possible, or to overlap the function calls as permitted by the dataflow dependencies.

Pipelining: A technique to increase instruction-level parallelism in the hardware implementation of an algorithm by overlapping independent stages of operations or functions. The data dependence in the original software implementation is preserved for functional equivalence, but the required circuit is divided into a chain of independent stages. All stages in the chain run in parallel on the same clock cycle. Pipelining is a fine-grain optimization that eliminates CPU restrictions requiring the current function call or operation to fully complete before the next can begin.
Dataflow: Enables multiple functions implemented in the FPGA to execute in a parallel and pipelined manner instead of sequentially, implementing task-level parallelism. The compiler extracts this level of parallelism by evaluating the interactions between different functions of a program based on their inputs and outputs. In terms of software execution, this transformation applies to parallel execution of functions within a single kernel.

Another advantage of aXilinxFPGA is the ability to be dynamically reconfigured. For example, loading a compiled program into a processor or reconfiguring the FPGA during runtime can re-purpose the resources of the FPGA to implement additional kernels as the accelerated application runs. This allows a singleSDAccelaccelerator board provide acceleration for multiple functions within an application, either sequentially or concurrently.

Execution Model of an SDAccel Application

TheSDAccelenvironment is designed to provide a simplified development experience for FPGA-based software acceleration platforms. The general structure of the acceleration platform is shown in the following figure.

The custom application is running on the host x86 server and usesOpenCLAPI calls to interact with the FPGA accelerators. TheXilinxruntime (XRT) manages those interactions. The application is written in C/C++ usingOpenCLAPIs. The custom kernels are running within aXilinxFPGA with the XRT managing interactions between the host application and the accelerator. Communication between the host x86 machine and the accelerator board occurs across thePCIebus.

TheSDAccelhardware platform contains global memory banks. The data transfer between the host machine and kernels, in either direction, occurs through these global memory banks. The kernels running on the FPGA can have one or more memory interfaces. The connection from the memory banks to those memory interfaces is programmable and determined by linking options of the compiler.

TheSDAccelexecution model follows these steps:

The host application writes the data needed by a kernel into the global memory of the attached device through thePCIeinterface.
The host application programs the kernel with its input parameters.
The host application triggers the execution of the kernel function on the FPGA.
The kernel performs the required computation while reading and writing data from global memory, as necessary.
The kernels write data back to the memory banks, and notify the host that it has completed its task.
The host application reads data back from global memory into the host memory space, and continues processing as needed.

The FPGA can accommodate multiple kernel instances at one time; this can occur between different types of kernels or multiple instances of the same kernel. The XRT transparently orchestrates the communication between the host application and the kernels in the accelerator. The number of instances of a kernel is determined by compilation options.

SDAccel Build Process

TheSDAccelenvironment offers all of the features of a standard software development environment:

Optimized compiler for host applications
Cross-compilers for the FPGA
Robust debugging environment to help identify and resolve issues in the code
Performance profilers to identify bottlenecks and optimize the code

Within this environment, the build process uses a standard compilation and linking process for both the software elements, and the hardware elements of the project. As shown in the following figure, the host application is built through one process using standard GCC compiler, and the FPGA binary is built through a separate process using theXilinxxocccompiler.

Host application build process using GCC:
- Each host application source file is compiled to an object file (.o).
- The object files (.o) are linked with theXilinxSDAccelruntime shared library to create the executable (.exe).
FPGA build process is highlighted in the following figure:
- Each kernel is independently compiled to aXilinxobject (.xo) file.
  - C/C++ andOpenCLC kernels are compiled for implementation on an FPGA using thexocccompiler. This step leverages theVivado®HLS compiler. Pragmas and attributes supported byVivadoHLS can be used in C/C++ andOpenCLC kernel source code to specify the desired kernel micro-architecture and control the result of the compilation process.
  - RTL kernels are compiled using thepackage_xoutility. The RTL kernel wizard in theSDAccelenvironment can be used to simplify this process.
- The kernel.xofiles are linked with the hardware platform (shell) to create the FPGA binary (.xclbin). Important architectural aspects are determined during the link step. In particular, this is where connections from kernel ports to global memory banks are established and where the number of instances for each kernel is specified.
  - When the build target is software or hardware emulation, as described below,xoccgenerates simulation models of the device contents.
  - When the build target is the system (actual hardware),xoccgenerates the FPGA binary for the device leveraging theVivado Design Suiteto run synthesis and implementation.

Note:The xocccompiler automatically uses the VivadoHLS and Vivado Design Suitetools to build the kernels to run on the FPGA platform. It uses these tools with predefined settings which have proven to provide good quality of results. Using the SDAccelenvironment and the xocccompiler does not require knowledge of these tools; however, hardware-savvy developers can fully leverage these tools and use all their available features to implement kernels.

Build Targets

TheSDAcceltool build process generates the host application executable (.exe) and the FPGA binary (.xclbin). TheSDAccelbuild target defines the nature of FPGA binary generated by the build process.

TheSDAcceltool provides three different build targets, two emulation targets used for debug and validation purposes, and the default hardware target used to generate the actual FPGA binary:

Software Emulation ( sw_emu): Both the host application code and the kernel code are compiled to run on the x86 processor. This allows iterative algorithm refinement through fast build-and-run loops. This target is useful for identifying syntax errors, performing source-level debugging of the kernel code running together with application, and verifying the behavior of the system.
Hardware Emulation ( hw_emu): The kernel code is compiled into a hardware model (RTL) which is run in a dedicated simulator. This build and run loop takes longer but provides a detailed, cycle-accurate, view of kernel activity. This target is useful for testing the functionality of the logic that will go in the FPGA and for getting initial performance estimates.
System ( hw): The kernel code is compiled into a hardware model (RTL) and is then implemented on the FPGA device, resulting in a binary that will run on the actual FPGA.

SDAccel Design Methodology

TheSDAccelenvironment supports the two primary use cases:

Software-Centric Design: This software-centric approach focuses on improving the performance of an application written by software programmers, by accelerating compute intensive functions or bottlenecks identified while profiling the application.

Hardware-Centric Design: The acceleration kernel developer creates an optimized kernel that may be called as a library element by the application developer. Kernel languages are not specific to the methodology. A software-centric flow can also use either C/C++, OpenCL, or RTL for kernel. The main differences between the two approaches are the starting point (software application or kernels) and the emphasis that comes with it.

The two use cases can be combined, allowing teams of software and hardware developers define accelerator kernels and develop applications to use them. This combined methodology involves different components of the application, developed by different people, potentially from different companies. You can leverage predefined kernel libraries available for use in your accelerated application, or develop all the acceleration functions within your own team.

Software-Centric Design

The software-centric approach to accelerated application development, or acceleration kernel development, uses code written as a standard software program, with some attention to the specific architecture of the code. For more information see theSDAccel Environment Profiling and Optimization Guide. The software development flow typically uses the following steps:

Profile application: Baseline the application in terms of functionalities and performance and isolate functions to be accelerated in hardware.
Functions that consume the most execution time are good candidates to be offloaded and accelerated onto FPGAs.
Code the desired kernel(s): Convert functions toOpenCLC or C/C++ kernels without any optimization.
The application code calling these kernels will also need to be converted to useOpenCLAPIs for data movement and task scheduling.
Verify functionality, iterate as needed: Run software emulation to ensure functional correctness. Run hardware emulation to generate host and kernel profiling data including:
- Estimated FPGA resource usage (non-RTL)
- Overall application performance
- Visual timeline showing host calls and kernel start/stop times
Optimize for performance, iterate as needed: Using the various compilation reports and profiling data generated during hardware emulation and system run to assist your optimization effort. Common optimization objectives include:
- Optimize data movement from the host to/from global memory, and data movement from global memory to/from the kernel.
- Maximize parallelism across software requests.
- Maximize parallelism across multiple kernels.
- Maximize task and instruction level parallelism within kernels.

Hardware-Centric Design

Hardware-centric flows first focuses on developing and optimizing the kernel(s) and typically leverages advanced FPGA design techniques. For more information, see theSDAccel Environment Profiling and Optimization Guide. The hardware-centric development flow typically uses the following steps:

Baseline the application in terms of functionalities and performance and isolate functions to be accelerated in hardware.
Estimate cycle budgets and performance requirements to define accelerator architecture and interfaces.
Develop accelerator.
Verify functionality and performance. Iterate as needed.
Optimize timing and resource utilization. Iterate as needed.
Import kernel intoSDAccel.
Develop sample host code to test with a dummy kernel having the same interfaces as the actual kernel.
Verify kernel works correctly with host code using hardware emulation, or running on actual hardware. Iterate as needed.
Use Activity Timeline, Profile Summary, and timers in the source code to measure performance to optimize host code for performance. Iterate as needed.

Best Practices for Acceleration with SDAccel

Below are some specific things to keep in mind when developing your application code and hardware function in theSDAccelenvironment. You can find additional information in theSDAccel Environment Profiling and Optimization Guide.

Look to accelerate functions that have a high ratio of compute time to input and output data volume. Compute time can be greatly reduced using FPGA kernels, but data volume adds transfer latency.
Accelerate functions that have a self-contained control structure and do not require regular synchronization with the host.
Transfer large blocks of data from host to global device memory. One large transfer is more efficient than several smaller transfers. Run a bandwidth test to find the optimal transfer size.
Only copy data back to host when necessary. Data written to global memory by a kernel can be directly read by another kernel. Memory resources include PLRAM (small size but fast access with lowest latency), HBM (moderate size and access speed with some latency), and DDR (large size but slow access with high latency).
Take advantage of the multiple global memory resources to evenly distribute bandwidth across kernels.
Maximize bandwidth usage between kernel and global memory by performing 512-bit wide bursts.
Cache data in local memory within the kernels. Accessing local memories is much faster than accessing global memory.
In the host application, use events and non-blocking transactions to launch multiple requests in a parallel and overlapping manner.
In the FPGA, use different kernels to take advantage of task-level parallelism and use multiple CUs to take advantage of data-level parallelism to execute multiple tasks in parallel and further increase performance.
Within the kernels take advantage of tasks-level with dataflow and instruction-level parallelism with loop unrolling and loop pipelining to maximize throughput.
SomeXilinxFPGAs contain multiple partitions called super logic regions (SLRs). Keep the kernel in the same SLR as the global memory bank that it accesses.
Use software and hardware emulation to validate your code frequently to make sure it is functionally correct.
Frequently review theSDAccelGuidance report as it provides clear and actionable feedback regarding deficiencies in your project.