SDAccel Compilation Flow and Execution Model

The SDAccel™development environment is a heterogeneous system architecture platform to accelerate compute intensive tasks using Xilinx®FPGA devices. The SDAccelenvironment contains a Host x86 machine that is connected to one or more XilinxFPGA devices through a PCIe®bus, as shown below.

Figure:SDAccel Architecture

Programming Model

TheSDAccelenvironment supports heterogeneous computing using the industry standardOpenCLprotocol (https://www.khronos.org/opencl/). The host program executes on the Host CPU and offloads compute intensive tasks to execute onXilinxFPGA devices using theOpenCLprogramming paradigm. Unlike a CPU (or GPU), the FPGA can be thought of as a blank canvas which does not have predefined instruction sets or fixed word size, and is capable of running the same or different instructions in parallel to greatly improve the performance of your applications.

Device Topology

In theSDAccelenvironment, devices are one or more FPGAs connected to a host x86 machine through aPCIebus. The FPGA contains a programmable region that implements and executes kernels. The FPGA platform contains one or more global memory banks. The data transfer from the host machine to kernels and from kernels to the host happens through these global memory banks. The kernels running on the FPGA can have one or more memory interfaces. The connection from the global memory banks to those memory interfaces are configurable, as their features are determined by the kernel compilation options.

The programmable logic ofXilinxdevices can implement multiple kernels at the same time, allowing for significant task parallelism. A single kernel can also be instantiated multiple times. The number of instances of a kernel is programmable and determined by the kernel compilation options.

Figure:SDAccel Architecture

The diagram above illustrates the flexible connections from the host application to multiple kernels through the global memory banks. The FPGA board device shown above contains four DDR memory banks. The programmable logic of the FPGA is running two kernels, Kernel A and Kernel B. Each kernel has two memory interfaces one for reading the data and another for writing. Also, note that there are two instances of Kernel A, totaling three simultaneous kernel instances on the FPGA.

In the diagram, the first instance of Kernel A: CU1 uses a single memory interface for reading and writing. Kernel B and the second instance of Kernel A: CU2 use different memory interfaces for reading, and writing, with Kernel B essentially passing data directly to Kernel A: CU2 through the global memory.

Note:To achieve the best performance, the global memory banks to kernel interface connections should carefully be defined as discussed in Customization of DDR Bank to Kernel Connection.

SDAccel Build Process

TheSDAccelenvironment offers all of the features of a standard software development environment:

  • Optimized compiler for host applications
  • Cross-compilers for the FPGA
  • Robust debugging environment to help identify and resolve issues in the code
  • Performance profilers to identify bottlenecks and optimize the code

Within this environment, the build process uses a standard compilation and linking process for both the software elements, and the hardware elements of the project. As shown in the following figure, the host application is built through one process using standard GCC compiler, and the FPGA binary is built through a separate process using theXilinxxocccompiler.

Figure:Software/Hardware Build Process



  1. Host application build process using GCC:
    • Each host application source file is compiled to an object file (.o).
    • The object files (.o) are linked with theXilinxSDAccelruntime shared library to create the executable (.exe).
  2. FPGA build process is highlighted in the following figure:
    • Each kernel is independently compiled to aXilinxobject (.xo) file.
      • C/C++ andOpenCLC kernels are compiled for implementation on an FPGA using thexocccompiler. This step leverages theVivado®HLS compiler. Pragmas and attributes supported byVivadoHLS can be used in C/C++ andOpenCLC kernel source code to specify the desired kernel micro-architecture and control the result of the compilation process.
      • RTL kernels are compiled using thepackage_xoutility. The RTL kernel wizard in theSDAccelenvironment can be used to simplify this process.
    • The kernel.xofiles are linked with the hardware platform (shell) to create the FPGA binary (.xclbin). Important architectural aspects are determined during the link step. In particular, this is where connections from kernel ports to global memory banks are established and where the number of instances for each kernel is specified.
      • When the build target is software or hardware emulation, as described below,xoccgenerates simulation models of the device contents.
      • When the build target is the system (actual hardware),xoccgenerates the FPGA binary for the device leveraging theVivado Design Suiteto run synthesis and implementation.

Figure:FPGA Build Process



Note:The xocccompiler automatically uses the VivadoHLS and Vivado Design Suitetools to build the kernels to run on the FPGA platform. It uses these tools with predefined settings which have proven to provide good quality of results. Using the SDAccelenvironment and the xocccompiler does not require knowledge of these tools; however, hardware-savvy developers can fully leverage these tools and use all their available features to implement kernels.

Build Targets

TheSDAcceltool build process generates the host application executable (.exe) and the FPGA binary (.xclbin). TheSDAccelbuild target defines the nature of FPGA binary generated by the build process.

TheSDAcceltool provides three different build targets, two emulation targets used for debug and validation purposes, and the default hardware target used to generate the actual FPGA binary:

Software Emulation ( sw_emu)
Both the host application code and the kernel code are compiled to run on the x86 processor. This allows iterative algorithm refinement through fast build-and-run loops. This target is useful for identifying syntax errors, performing source-level debugging of the kernel code running together with application, and verifying the behavior of the system.
Hardware Emulation ( hw_emu)
The kernel code is compiled into a hardware model (RTL) which is run in a dedicated simulator. This build and run loop takes longer but provides a detailed, cycle-accurate, view of kernel activity. This target is useful for testing the functionality of the logic that will go in the FPGA and for getting initial performance estimates.
System ( hw)
The kernel code is compiled into a hardware model (RTL) and is then implemented on the FPGA device, resulting in a binary that will run on the actual FPGA.

Execution Model of an SDAccel Application

TheSDAccelenvironment is designed to provide a simplified development experience for FPGA-based software acceleration platforms. The general structure of the acceleration platform is shown in the following figure.

Figure:Architecture of anSDAccelApplication



The custom application is running on the host x86 server and usesOpenCLAPI calls to interact with the FPGA accelerators. TheXilinxruntime (XRT) manages those interactions. The application is written in C/C++ usingOpenCLAPIs. The custom kernels are running within aXilinxFPGA with the XRT managing interactions between the host application and the accelerator. Communication between the host x86 machine and the accelerator board occurs across thePCIebus.

TheSDAccelhardware platform contains global memory banks. The data transfer between the host machine and kernels, in either direction, occurs through these global memory banks. The kernels running on the FPGA can have one or more memory interfaces. The connection from the memory banks to those memory interfaces is programmable and determined by linking options of the compiler.

TheSDAccelexecution model follows these steps:

  1. The host application writes the data needed by a kernel into the global memory of the attached device through thePCIeinterface.
  2. The host application programs the kernel with its input parameters.
  3. The host application triggers the execution of the kernel function on the FPGA.
  4. The kernel performs the required computation while reading and writing data from global memory, as necessary.
  5. The kernels write data back to the memory banks, and notify the host that it has completed its task.
  6. The host application reads data back from global memory into the host memory space, and continues processing as needed.

The FPGA can accommodate multiple kernel instances at one time; this can occur between different types of kernels or multiple instances of the same kernel. The XRT transparently orchestrates the communication between the host application and the kernels in the accelerator. The number of instances of a kernel is determined by compilation options.

SDAccel Emulation Flows

TheSDAccelenvironment development flow can be divided into two distinct steps. The first step is to compile the host and kernel code to generate the executables. The second step is to run the executables in a heterogeneous system comprised of the Host CPU andSDAccelenvironment accelerator platform. However, the kernel compilation process is long and can take several hours depending on the size of the kernels and the architecture of the target FPGA. Therefore, to save time and shorten the debug cycle before the kernel compilation process, theSDAccelenvironment provides two other build targets for testing purposes: software emulation and hardware emulation. The compilation and execution of these emulation targets are significantly faster, and do not require the actual FPGA board to be run. These emulation flows abstract the FPGA board, and its connection to the host machine, into software models to validate the combined functionality of the host and kernel code, as well as providing some performance estimates early in the design process. These performance estimates are just estimates, but they can greatly help debugging and identifying performance bottlenecks. Refer to theSDAccel Environment Debugging Guidefor more information on debugging using software and hardware emulation flows.

Software Emulation Flow

Compilation of the software emulation target is the fastest. It is mainly used for checking the functional correctness when the host and kernel code are running together. Thexocccompiler does the minimum transformation of the kernel code in order to run it in conjunction with the host code, so this software emulation flow helps to check functional correctness at the very early stage of the final binary creation. The software emulation flow can be used for algorithm refinement, debugging functional issues, and letting developers iterate quickly through the code to make improvements.

Hardware Emulation Flow

In the hardware emulation flow, thexocccompiler generates a model of the kernel in a hardware description language (RTL Verilog). The hardware emulation flow helps to check the functional correctness of the final binary creation after synthesis of the RTL from the C, C++, orOpenCLkernel code. The hardware emulation flow offers great debug capability with the waveform viewer if the system does not work as expected.

Table 1.Comparison of Emulation Flows with Hardware Execution
Software Emulation Hardware Emulation Hardware Execution
Host application runs with a C/C++ or OpenCL model of the kernels. Host application runs with a simulated RTL model of the kernels. Host application runs with actual hardware implementation of the kernels.
Confirm functional correctness of the system. Test the host / kernel integration, get performance estimates. Confirm that the system runs correctly and with desired performance.
Fastest turnaround time. Best debug capabilities, moderate compilation time. Final FPGA implementation and run provides accurate performance result with long build time.

SDAccel Example Designs

SDAccelExamples on GitHub

Xilinxprovides many examples for programming with theSDAccelenvironment in theGitHub repositoryto help beginning users get familiar with the coding style of host and kernel code, and for more experienced users to use as a source of coding examples. All examples are available with host code, kernel code, and Makefile associated with the compilation flow and runtime flow. The following is one such example to get a basic understanding of the file structure of a standard example.

Inverse Discrete Cosine Transform (IDCT) Example

Look at theIDCT exampledesign that demonstrates the key coding organization required for theSDAccelenvironment.

TheReadme.mdfile discusses in detail how to run this example in both emulation and FPGA execution flows using the providedMakefile.

Inside the./srcdirectory, you can find host codeidct.cpp, and kernel codekrnl_idct.cpp.

In the following chapters you will learn the basic required knowledge to program the host code and kernel code for theSDAccelenvironment. During this process you might refer to the above design as an example.

Organization of this Guide

The remaining chapters are organized as follows:

  • Programming the Host Application: Describes writing host code in C or C++ using theOpenCLAPI targeted forXilinxFPGA devices. This chapter assumes the user has prior knowledge ofOpenCL. It discusses coding practices to follow when writing an effective host application interfacing with acceleration kernels running onXilinxFPGA devices.
  • Programming C/C++ Kernels: Describes different elements of writing high-performance, compute-intensive kernel code to implement on an FPGA device.
  • Configuring the System Architecture: Discusses integrating and connecting the host application to one or more kernel instances during the linking process.