Introduction

This guide presents allSDx™development environment features related to performance analysis of the design. It is also logically structured to assist in the actual performance improvement effort. Dedicated sections are available for the main components of theSDAccel™environment performance bottlenecks, namely Accelerator,PCIe®bus transfer, and Host code. Each of these sections is structured to guide the developer from recognizing bottlenecks all the way to solution approaches to increase overall system performance.

Note:Performance optimization assumes, as a starting point, a working design intended for performance improvement. If erroneous behavior is encountered, look for guidance in SDAccel Environment Debugging Guide.

Similarly, the general concepts regarding coding of host code or accelerator kernels are not explained here; these concepts are introduced in theSDAccel Environment Programmers Guide.

Execution Model of an SDAccel Application

TheSDAccelenvironment is designed to provide a simplified development experience for FPGA-based software acceleration platforms. The general structure of the acceleration platform is shown in the following figure.

Figure:Architecture of anSDAccelApplication



The custom application is running on the host x86 server and usesOpenCLAPI calls to interact with the FPGA accelerators. TheXilinxruntime (XRT) manages those interactions. The application is written in C/C++ usingOpenCLAPIs. The custom kernels are running within aXilinxFPGA with the XRT managing interactions between the host application and the accelerator. Communication between the host x86 machine and the accelerator board occurs across thePCIebus.

TheSDAccelhardware platform contains global memory banks. The data transfer between the host machine and kernels, in either direction, occurs through these global memory banks. The kernels running on the FPGA can have one or more memory interfaces. The connection from the memory banks to those memory interfaces is programmable and determined by linking options of the compiler.

TheSDAccelexecution model follows these steps:

  1. The host application writes the data needed by a kernel into the global memory of the attached device through thePCIeinterface.
  2. The host application programs the kernel with its input parameters.
  3. The host application triggers the execution of the kernel function on the FPGA.
  4. The kernel performs the required computation while reading and writing data from global memory, as necessary.
  5. The kernels write data back to the memory banks, and notify the host that it has completed its task.
  6. The host application reads data back from global memory into the host memory space, and continues processing as needed.

The FPGA can accommodate multiple kernel instances at one time; this can occur between different types of kernels or multiple instances of the same kernel. The XRT transparently orchestrates the communication between the host application and the kernels in the accelerator. The number of instances of a kernel is determined by compilation options.

SDAccel Build Process

TheSDAccelenvironment offers all of the features of a standard software development environment:

  • Optimized compiler for host applications
  • Cross-compilers for the FPGA
  • Robust debugging environment to help identify and resolve issues in the code
  • Performance profilers to identify bottlenecks and optimize the code

Within this environment, the build process uses a standard compilation and linking process for both the software elements, and the hardware elements of the project. As shown in the following figure, the host application is built through one process using standard GCC compiler, and the FPGA binary is built through a separate process using theXilinxxocccompiler.

Figure:Software/Hardware Build Process



  1. Host application build process using GCC:
    • Each host application source file is compiled to an object file (.o).
    • The object files (.o) are linked with theXilinxSDAccelruntime shared library to create the executable (.exe).
  2. FPGA build process is highlighted in the following figure:
    • Each kernel is independently compiled to aXilinxobject (.xo) file.
      • C/C++ andOpenCLC kernels are compiled for implementation on an FPGA using thexocccompiler. This step leverages theVivado®HLS compiler. Pragmas and attributes supported byVivadoHLS can be used in C/C++ andOpenCLC kernel source code to specify the desired kernel micro-architecture and control the result of the compilation process.
      • RTL kernels are compiled using thepackage_xoutility. The RTL kernel wizard in theSDAccelenvironment can be used to simplify this process.
    • The kernel.xofiles are linked with the hardware platform (shell) to create the FPGA binary (.xclbin). Important architectural aspects are determined during the link step. In particular, this is where connections from kernel ports to global memory banks are established and where the number of instances for each kernel is specified.
      • When the build target is software or hardware emulation, as described below,xoccgenerates simulation models of the device contents.
      • When the build target is the system (actual hardware),xoccgenerates the FPGA binary for the device leveraging theVivado Design Suiteto run synthesis and implementation.

Figure:FPGA Build Process



Note:The xocccompiler automatically uses the VivadoHLS and Vivado Design Suitetools to build the kernels to run on the FPGA platform. It uses these tools with predefined settings which have proven to provide good quality of results. Using the SDAccelenvironment and the xocccompiler does not require knowledge of these tools; however, hardware-savvy developers can fully leverage these tools and use all their available features to implement kernels.

Build Targets

TheSDAcceltool build process generates the host application executable (.exe) and the FPGA binary (.xclbin). TheSDAccelbuild target defines the nature of FPGA binary generated by the build process.

TheSDAcceltool provides three different build targets, two emulation targets used for debug and validation purposes, and the default hardware target used to generate the actual FPGA binary:

Software Emulation ( sw_emu)
Both the host application code and the kernel code are compiled to run on the x86 processor. This allows iterative algorithm refinement through fast build-and-run loops. This target is useful for identifying syntax errors, performing source-level debugging of the kernel code running together with application, and verifying the behavior of the system.
Hardware Emulation ( hw_emu)
The kernel code is compiled into a hardware model (RTL) which is run in a dedicated simulator. This build and run loop takes longer but provides a detailed, cycle-accurate, view of kernel activity. This target is useful for testing the functionality of the logic that will go in the FPGA and for getting initial performance estimates.
System ( hw)
The kernel code is compiled into a hardware model (RTL) and is then implemented on the FPGA device, resulting in a binary that will run on the actual FPGA.

SDAccel Optimization Flow Overview

SDAccelenvironment is a complete software development environment for creating, compiling, and optimizing C/C++/OpenCLapplications to be accelerated onXilinxFPGAs. TheSDAccelenvironment includes the three recommended flows for optimizing an application. For information on each flow, refer to the following:

Baselining Functionalities and Performance

It is very important to understand the performance of your application before you start any optimization effort. This is achieved by baselining the application in terms of functionalities and performance.

Figure:Baselining Functionalities and Performance Flow

Identify Bottlenecks

The first step is to identify the bottlenecks of the current application running on your existing platform. The most effective way is to run the application with profiling tools, likevalgrind,callgrind, and GNUgprof. The profiling data generated by these tools show the call graph with the number of calls to all functions and their execution time. The functions that consume the most execution time are good candidates to be offloaded and accelerated onto FPGAs.

Convert Target Functions

After the target functions are selected, convert them toOpenCLC kernels or C/C++ kernels without any optimization. The application code calling these kernels will also need to be converted to useOpenCLAPIs for data movement and task scheduling.

TIP:Keep everything as simple as possible and minimize changes to the existing code in this step so you can quickly generate a working design on the FPGA and get the baselined performance and resource number.

Run Software and Hardware Emulation

Next, run software and hardware emulation to verify the function correctness and generate profiling data on the host code and the kernels. Analyze the kernel compilation reports, profile summary, timeline trace, and device hardware transactions to understand the baselined performance estimate such as timing, interval, and latency and resource utilization, such as DSP and block RAM.

Build and Run the Application

The last step in baselining is building and running the application on an FPGA acceleration card. Analyze the reports from the system compilation and the profiling data from application execution to see the actual performance and resource utilization.

TIP:Save all the reports during baselining, so that you can reference and compare results during optimization.

Optimizing Data Movement

Figure:Optimizing Data Movement Flow

In theOpenCLAPI, all data is transferred from the host memory to the global memory on the device first and then from the global memory to the kernel for computation. The computation results are written back from the kernel to the global memory and lastly from the global memory to the host memory. A key factor in determining strategies for kernel optimization is understanding how data can be efficiently moved around.

Note:Optimize the data movement in the application before optimizing computation.

During data movement optimization, it is important to isolate data transfer code from computation code because inefficiency in computation might cause stalls in data movement.Xilinxrecommends that you modify the host code and kernels with data transfer code only for this optimization step. The goal is to maximize the system level data throughput by maximizingPCIebandwidth usage and DDR bandwidth usage. It usually takes multiple iterations of running software emulation, hardware emulation, as well as execution on FPGAs, to achieve the goal.

Optimizing Kernel Computation

Figure:Optimizing Kernel Computation Flow

One of the key benefits of an FPGA is that you can create custom logic for your specific application. The goal of kernel computation optimization is to create processing logic that can consume all the data as soon as they arrive at kernel interfaces. The key metric during this step is the initiation interval (II). This is generally achieved by expanding the processing code to match the data path with techniques such as function pipelining, loop unrolling, array partitioning, data flowing, etc. TheSDAccelenvironment produces various compilation reports and profiling data during hardware emulation and system run to assist your optimization effort. Refer toSDAccel Profiling and Optimization Featuresfor details on the compilation and profiling report.