SDAccel Introduction and Overview

TheSDAccel™environment provides a framework for developing and delivering FPGA accelerated data center applications using standard programming languages. TheSDAccelenvironment includes a familiar software development flow with an Eclipse-based integrated development environment (IDE), and an architecturally optimizing compiler that makes efficient use of FPGA resources. Developers of accelerated applications will use a familiar software programming work flow to take advantage of FPGA acceleration with little or no prior FPGA or hardware design experience. Acceleration kernel developers can use a hardware-centric approach working through the HLS compiler with standard programming languages to produce a heterogeneous application with both software and hardware components. The software component, or application, is developed in C/C++ withOpenCL™API calls; the hardware component, or kernel, is developed in C/C++,OpenCL, or RTL. TheSDAccelenvironment accommodates various methodologies, allowing developers to start from either the software component or the hardware component.

Xilinx®FPGAs offer many advantages over traditional CPU/GPU acceleration, including a custom architecture capable of implementing any function that can run on a processor, resulting in better performance at lower power dissipation. To realize the advantages of software acceleration on aXilinxdevice, you should look to accelerate large compute-intensive portions of your application in hardware. Implementing these functions in custom hardware gives you an ideal balance between performance and power. TheSDAccelenvironment provides tools and reports to profile the performance of your host application, and determine where the opportunities for acceleration are. The tools also provide automated runtime instrumentation of cache, memory and bus usage to track real-time performance on the hardware.

TheSDAccelenvironment targets acceleration hardware platforms such as theXilinxAlveo™Data Center accelerator cards. This acceleration platform is designed for computationally intensive applications, specifically applications for live video transcoding, data analytics, and artificial intelligence (AI) applications using machine learning. There are also a number of available third-party acceleration platforms compatible with theSDAccelenvironment.

A growing number of FPGA-accelerated libraries are available through theSDAccelenvironment, such as theXilinxMachine Learning (ML) suite to optimize and deploy accelerated ML inference applications. Predefined accelerator functions include targeted applications, such as artificial intelligence, with support for many common machine learning frameworks such as: Caffe, MxNet, and TensorFlow; video processing, encryption, and big data analysis. These predefined accelerator libraries offered byXilinxand third-party developers can be integrated into your accelerated application project quickly to speed development.

Software Acceleration with SDAccel

When compared with processor architectures, the structures that comprise the programmable logic (PL) fabric in aXilinxFPGA enable a high degree of parallelism in application execution. The custom processing architecture generated bySDAccelfor a kernel presents a different execution paradigm from CPU execution, and provides opportunity for significant performance gains. While you can re-target an existing application for acceleration on an FPGA, understanding the FPGA architecture and revising your host and kernel code appropriately will significantly improve performance. Refer to theSDAccel Environment Programmers Guidefor more information on writing your host and kernel code, and managing data transfers between them.

CPUs have fixed resources and offer limited opportunities for parallelization of tasks or operations. A processor, regardless of its type, executes a program as a sequence of instructions generated by processor compiler tools, which transform an algorithm expressed in C/C++ into assembly language constructs that are native to the target processor. Even a simple operation, like the addition of two values, results in multiple assembly instructions that must be executed across multiple clock cycles. This is why software engineers spend so much time restructuring their algorithms to increase the cache hit rate and decrease the processor cycles used per instruction.

On the other hand, the FPGA is an inherently parallel processing device capable of implementing any function that can run on a processor.XilinxFPGAs have an abundance resources that can be programmed and configured to implement any custom architecture and achieve virtually any level of parallelism. Unlike a processor, where all computations share the same ALU, operations in an FPGA are distributed and executed across a configurable array of processing resources. The FPGA compiler creates a unique circuit optimized for each application or algorithm. The FPGA programming fabric acts as a blank canvas to define and implement your acceleration functions.

TheSDAccelcompiler exercises the capabilities of the FPGA fabric through the processes of scheduling, pipelining, and dataflow.

Scheduling: The process of identifying the data and control dependencies between different operations to determine when each will execute. The compiler analyzes dependencies between adjacent operations as well as across time, and groups operations to execute in the same clock cycle when possible, or to overlap the function calls as permitted by the dataflow dependencies.

Pipelining: A technique to increase instruction-level parallelism in the hardware implementation of an algorithm by overlapping independent stages of operations or functions. The data dependence in the original software implementation is preserved for functional equivalence, but the required circuit is divided into a chain of independent stages. All stages in the chain run in parallel on the same clock cycle. Pipelining is a fine-grain optimization that eliminates CPU restrictions requiring the current function call or operation to fully complete before the next can begin.
Dataflow: Enables multiple functions implemented in the FPGA to execute in a parallel and pipelined manner instead of sequentially, implementing task-level parallelism. The compiler extracts this level of parallelism by evaluating the interactions between different functions of a program based on their inputs and outputs. In terms of software execution, this transformation applies to parallel execution of functions within a single kernel.

Another advantage of aXilinxFPGA is the ability to be dynamically reconfigured. For example, loading a compiled program into a processor or reconfiguring the FPGA during runtime can re-purpose the resources of the FPGA to implement additional kernels as the accelerated application runs. This allows a singleSDAccelaccelerator board provide acceleration for multiple functions within an application, either sequentially or concurrently.

SDAccel Execution Model

In theSDAccelframework, an application program is split between a host application and hardware accelerated kernels with a communication channel between them. The host application, written in C/C++ and using API abstractions likeOpenCL, runs on an x86 server while hardware accelerated kernels run within theXilinxFPGA. The API calls, managed by theXilinxRuntime (XRT), are used to communicate with the hardware accelerators. Communication between the host x86 machine and the accelerator board, including control and data transfers, occurs across thePCIebus. While control information is transferred between specific memory locations in hardware, global memory is used to transfer data between the host application and the kernels. Global memory is accessible by both the host processor and hardware accelerators, while host memory is only accessible by the host application.

For instance, in a typical application, the host will first transfer data, to be operated on by the kernel, from host memory into global memory. The kernel would subsequently operate on the data, storing results back to the global memory. Upon kernel completion, the host would transfer the results back into the host memory. Data transfers between the host and global memory introduce latency which can be costly to the overall acceleration. To achieve acceleration in a real system, the benefits achieved by hardware acceleration kernels must outweigh the extra latency of the data transfers. The general structure of this acceleration platform is shown in the following figure.

The FPGA hardware platform, on the right-hand side, contains the hardware accelerated kernels, global memory along with the DMA for memory transfers. Kernels can have one or more global memory interfaces and are programmable. TheSDAccelexecution model can be broken down into these steps:

The host application writes the data needed by a kernel into the global memory of the attached device through thePCIeinterface.
The host application sets up the kernel with its input parameters.
The host application triggers the execution of the kernel function on the FPGA.
The kernel performs the required computation while reading data from global memory, as necessary.
The kernel writes data back to global memory and notifies the host that it has completed its task.
The host application reads data back from global memory into the host memory and continues processing as needed.

The FPGA can accommodate multiple kernel instances at one time; this can occur between different types of kernels or multiple instances of the same kernel. The XRT transparently orchestrates the communication between the host application and the kernels in the accelerator. The number of instances of a kernel is determined by compilation options.

SDAccel Build Process

TheSDAccelenvironment offers all of the features of a standard software development environment:

Optimized compiler for host applications
Cross-compilers for the FPGA
Robust debugging environment to help identify and resolve issues in the code
Performance profilers to identify bottlenecks and optimize the code

Within this environment, the build process uses a standard compilation and linking process for both the software elements, and the hardware elements of the project. As shown in the following figure, the host application is built through one process using standard GCC compiler, and the FPGA binary is built through a separate process using theXilinxxocccompiler.

Host application build process using GCC:
- Each host application source file is compiled to an object file (.o).
- The object files (.o) are linked with theXilinxSDAccelruntime shared library to create the executable (.exe).
FPGA build process is highlighted in the following figure:
- Each kernel is independently compiled to aXilinxobject (.xo) file.
  - C/C++ andOpenCLC kernels are compiled for implementation on an FPGA using thexocccompiler. This step leverages theVivado®HLS compiler. Pragmas and attributes supported byVivadoHLS can be used in C/C++ andOpenCLC kernel source code to specify the desired kernel micro-architecture and control the result of the compilation process.
  - RTL kernels are compiled using thepackage_xoutility. The RTL kernel wizard in theSDAccelenvironment can be used to simplify this process.
- The kernel.xofiles are linked with the hardware platform (shell) to create the FPGA binary (.xclbin). Important architectural aspects are determined during the link step. In particular, this is where connections from kernel ports to global memory banks are established and where the number of instances for each kernel is specified.
  - When the build target is software or hardware emulation, as described below,xoccgenerates simulation models of the device contents.
  - When the build target is the system (actual hardware),xoccgenerates the FPGA binary for the device leveraging theVivado Design Suiteto run synthesis and implementation.

Note:The xocccompiler automatically uses the VivadoHLS and Vivado Design Suitetools to build the kernels to run on the FPGA platform. It uses these tools with predefined settings which have proven to provide good quality of results. Using the SDAccelenvironment and the xocccompiler does not require knowledge of these tools; however, hardware-savvy developers can fully leverage these tools and use all their available features to implement kernels.

Build Targets

TheSDAcceltool build process generates the host application executable (.exe) and the FPGA binary (.xclbin). TheSDAccelbuild target defines the nature of FPGA binary generated by the build process.

TheSDAcceltool provides three different build targets, two emulation targets used for debug and validation purposes, and the default hardware target used to generate the actual FPGA binary:

Software Emulation ( sw_emu): Both the host application code and the kernel code are compiled to run on the x86 processor. This allows iterative algorithm refinement through fast build-and-run loops. This target is useful for identifying syntax errors, performing source-level debugging of the kernel code running together with application, and verifying the behavior of the system.
Hardware Emulation ( hw_emu): The kernel code is compiled into a hardware model (RTL) which is run in a dedicated simulator. This build and run loop takes longer but provides a detailed, cycle-accurate, view of kernel activity. This target is useful for testing the functionality of the logic that will go in the FPGA and for getting initial performance estimates.
System ( hw): The kernel code is compiled into a hardware model (RTL) and is then implemented on the FPGA device, resulting in a binary that will run on the actual FPGA.

SDAccel Design Methodology

TheSDAccelenvironment supports the two primary use cases:

Software-Centric Design: This software-centric approach focuses on improving the performance of an application written by software programmers, by accelerating compute intensive functions or bottlenecks identified while profiling the application.

Hardware-Centric Design: The acceleration kernel developer creates an optimized kernel that may be called as a library element by the application developer. Kernel languages are not specific to the methodology. A software-centric flow can also use either C/C++, OpenCL, or RTL for kernel. The main differences between the two approaches are the starting point (software application or kernels) and the emphasis that comes with it.

The two use cases can be combined, allowing teams of software and hardware developers define accelerator kernels and develop applications to use them. This combined methodology involves different components of the application, developed by different people, potentially from different companies. You can leverage predefined kernel libraries available for use in your accelerated application, or develop all the acceleration functions within your own team.

Software-Centric Design

The methodology is comprised of two major phases:

Architecting the application
Developing the C/C++ kernels

In the first phase, the developer makes key decisions about the application architecture by determining which software functions should be mapped to FPGA kernels, how much parallelism is needed, and how it should be delivered.

In the second phase, the developer implements the kernels. This primarily involves structuring source code and applying the desired compiler pragma to create the desired kernel architecture and meet the performance target.

For more information on theSDAccelsoftware design methodology, see theSDAccel Methodology Guide(UG1346).

Hardware-Centric Design

Hardware-centric flows first focuses on developing and optimizing the kernel(s) and typically leverages advanced FPGA design techniques. For more information, see theSDAccel Environment Profiling and Optimization Guide. The hardware-centric development flow typically uses the following steps:

Baseline the application in terms of functionalities and performance and isolate functions to be accelerated in hardware.
Estimate cycle budgets and performance requirements to define accelerator architecture and interfaces.
Develop accelerator.
Verify functionality and performance. Iterate as needed.
Optimize timing and resource utilization. Iterate as needed.
Import kernel intoSDAccel.
Develop sample host code to test with a dummy kernel having the same interfaces as the actual kernel.
Verify kernel works correctly with host code using hardware emulation, or running on actual hardware. Iterate as needed.
Use Activity Timeline, Profile Summary, and timers in the source code to measure performance to optimize host code for performance. Iterate as needed.

Best Practices for Acceleration with SDAccel

Below are some specific things to keep in mind when developing your application code and hardware function in theSDAccelenvironment. You can find additional information in theSDAccel Environment Profiling and Optimization Guide.

Look to accelerate functions that have a high ratio of compute time to input and output data volume. Compute time can be greatly reduced using FPGA kernels, but data volume adds transfer latency.
Accelerate functions that have a self-contained control structure and do not require regular synchronization with the host.
Transfer large blocks of data from host to global device memory. One large transfer is more efficient than several smaller transfers. Run a bandwidth test to find the optimal transfer size.
Only copy data back to host when necessary. Data written to global memory by a kernel can be directly read by another kernel. Memory resources include PLRAM (small size but fast access with lowest latency), HBM (moderate size and access speed with some latency), and DDR (large size but slow access with high latency).
Take advantage of the multiple global memory resources to evenly distribute bandwidth across kernels.
Maximize bandwidth usage between kernel and global memory by performing 512-bit wide bursts.
Cache data in local memory within the kernels. Accessing local memories is much faster than accessing global memory.
In the host application, use events and non-blocking transactions to launch multiple requests in a parallel and overlapping manner.
In the FPGA, use different kernels to take advantage of task-level parallelism and use multiple CUs to take advantage of data-level parallelism to execute multiple tasks in parallel and further increase performance.
Within the kernels take advantage of tasks-level with dataflow and instruction-level parallelism with loop unrolling and loop pipelining to maximize throughput.
SomeXilinxFPGAs contain multiple partitions called super logic regions (SLRs). Keep the kernel in the same SLR as the global memory bank that it accesses.
Use software and hardware emulation to validate your code frequently to make sure it is functionally correct.
Frequently review theSDAccelGuidance report as it provides clear and actionable feedback regarding deficiencies in your project.