Introduction to Debugging inSDAccel

This document is intended to introduce the debugging capabilities of theSDAccel™environment. The goal is to provide detailed instructions on how to analyze any failure encountered within theSDAccelflow. If no tool problem is encountered and the behavior of the design is deemed functionally correct, look for answers in theSDAccel Environment Profiling and Optimization Guideto determine if the performance of the design can be further improved.

SDAccel Execution Model

In theSDAccelframework, an application program is split between a host application and hardware accelerated kernels with a communication channel between them. The host application, written in C/C++ and using API abstractions likeOpenCL, runs on an x86 server while hardware accelerated kernels run within theXilinxFPGA. The API calls, managed by theXilinxRuntime (XRT), are used to communicate with the hardware accelerators. Communication between the host x86 machine and the accelerator board, including control and data transfers, occurs across thePCIebus. While control information is transferred between specific memory locations in hardware, global memory is used to transfer data between the host application and the kernels. Global memory is accessible by both the host processor and hardware accelerators, while host memory is only accessible by the host application.

For instance, in a typical application, the host will first transfer data, to be operated on by the kernel, from host memory into global memory. The kernel would subsequently operate on the data, storing results back to the global memory. Upon kernel completion, the host would transfer the results back into the host memory. Data transfers between the host and global memory introduce latency which can be costly to the overall acceleration. To achieve acceleration in a real system, the benefits achieved by hardware acceleration kernels must outweigh the extra latency of the data transfers. The general structure of this acceleration platform is shown in the following figure.

Figure:Architecture of anSDAccelApplication



The FPGA hardware platform, on the right-hand side, contains the hardware accelerated kernels, global memory along with the DMA for memory transfers. Kernels can have one or more global memory interfaces and are programmable. TheSDAccelexecution model can be broken down into these steps:

  1. The host application writes the data needed by a kernel into the global memory of the attached device through thePCIeinterface.
  2. The host application sets up the kernel with its input parameters.
  3. The host application triggers the execution of the kernel function on the FPGA.
  4. The kernel performs the required computation while reading data from global memory, as necessary.
  5. The kernel writes data back to global memory and notifies the host that it has completed its task.
  6. The host application reads data back from global memory into the host memory and continues processing as needed.

The FPGA can accommodate multiple kernel instances at one time; this can occur between different types of kernels or multiple instances of the same kernel. The XRT transparently orchestrates the communication between the host application and the kernels in the accelerator. The number of instances of a kernel is determined by compilation options.

SDAccel Build Process

TheSDAccelenvironment offers all of the features of a standard software development environment:

  • Optimized compiler for host applications
  • Cross-compilers for the FPGA
  • Robust debugging environment to help identify and resolve issues in the code
  • Performance profilers to identify bottlenecks and optimize the code

Within this environment, the build process uses a standard compilation and linking process for both the software elements, and the hardware elements of the project. As shown in the following figure, the host application is built through one process using standard GCC compiler, and the FPGA binary is built through a separate process using theXilinxxocccompiler.

Figure:Software/Hardware Build Process



  1. Host application build process using GCC:
    • Each host application source file is compiled to an object file (.o).
    • The object files (.o) are linked with theXilinxSDAccelruntime shared library to create the executable (.exe).
  2. FPGA build process is highlighted in the following figure:
    • Each kernel is independently compiled to aXilinxobject (.xo) file.
      • C/C++ andOpenCLC kernels are compiled for implementation on an FPGA using thexocccompiler. This step leverages theVivado®HLS compiler. Pragmas and attributes supported byVivadoHLS can be used in C/C++ andOpenCLC kernel source code to specify the desired kernel micro-architecture and control the result of the compilation process.
      • RTL kernels are compiled using thepackage_xoutility. The RTL kernel wizard in theSDAccelenvironment can be used to simplify this process.
    • The kernel.xofiles are linked with the hardware platform (shell) to create the FPGA binary (.xclbin). Important architectural aspects are determined during the link step. In particular, this is where connections from kernel ports to global memory banks are established and where the number of instances for each kernel is specified.
      • When the build target is software or hardware emulation, as described below,xoccgenerates simulation models of the device contents.
      • When the build target is the system (actual hardware),xoccgenerates the FPGA binary for the device leveraging theVivado Design Suiteto run synthesis and implementation.

Figure:FPGA Build Process



Note:The xocccompiler automatically uses the VivadoHLS and Vivado Design Suitetools to build the kernels to run on the FPGA platform. It uses these tools with predefined settings which have proven to provide good quality of results. Using the SDAccelenvironment and the xocccompiler does not require knowledge of these tools; however, hardware-savvy developers can fully leverage these tools and use all their available features to implement kernels.

Build Targets

TheSDAcceltool build process generates the host application executable (.exe) and the FPGA binary (.xclbin). TheSDAccelbuild target defines the nature of FPGA binary generated by the build process.

TheSDAcceltool provides three different build targets, two emulation targets used for debug and validation purposes, and the default hardware target used to generate the actual FPGA binary:

Software Emulation ( sw_emu)
Both the host application code and the kernel code are compiled to run on the x86 processor. This allows iterative algorithm refinement through fast build-and-run loops. This target is useful for identifying syntax errors, performing source-level debugging of the kernel code running together with application, and verifying the behavior of the system.
Hardware Emulation ( hw_emu)
The kernel code is compiled into a hardware model (RTL) which is run in a dedicated simulator. This build and run loop takes longer but provides a detailed, cycle-accurate, view of kernel activity. This target is useful for testing the functionality of the logic that will go in the FPGA and for getting initial performance estimates.
System ( hw)
The kernel code is compiled into a hardware model (RTL) and is then implemented on the FPGA device, resulting in a binary that will run on the actual FPGA.

SDAccel Debug Flow Overview

This section presents the general debug flow of theSDAccelenvironment by detailing the general steps of a proven development process. This process allows you to focus rapidly on potential errors in the design. This sets the baseline for developers indicating where to start if an error occurs in their adopted development steps.

The debug flow described here assumes that anSDAccelplatform board is installed and the initial setup checks have passed. It is possible to configure theSDAccelenvironment to work with custom hardware platforms that require a platform shell which defines the foundational components of the board.

TheSDAccelenvironment provides application-level debug features which allow the host code, the kernel code, and the interactions between them to be efficiently debugged. The recommended application-level debugging flow consists of three levels of debugging: software emulation, hardware emulation, and hardware execution.

This three-tiered approach allows debugging of the host and kernel code and their interactions at different levels of abstraction. Each of the execution models described below is supported through theSDAccelIDE as well as through a batch flow using basic compile time and runtime setup options.

Software Emulation

Purpose
Algorithm verification
Execution Model
During software emulation, all processes are running pure C/C++ models. OpenCLkernel models are transformed to execute concurrently.

Figure:Software Emulation

Verify that both the host and kernel code are functionally correct by running software emulation. Because software emulation compiles and executes quickly, spend time here to iterate through the code until the host and kernel code function correctly. Both hardware emulation and hardware execution take more time to compile and execute.

Hardware Emulation

Purpose
RTL debugging, finding protocol violations.
Execution Model
During hardware emulation the host code is executed concurrently with a simulation of the RTL model of the kernel, directly imported, or created through VivadoHLS from the C/C++/ OpenCLkernel code.

Figure:Hardware Emulation

Verify the host code and the kernel hardware implementation is correct by running hardware emulation on a data set. Hardware emulation performs detailed verification using an accurate model of the hardware (RTL) together with the host code C/OpenCLmodel. The hardware emulation flow invokes the hardware simulator in theSDAccelenvironment to test the functionality of the logic that is to be executed on the FPGA compute fabric. The interface between the models is represented by a transaction-level model (TLM) to limit impact of interface model on the overall execution time. The execution time for hardware emulation is longer than for software emulation.

TIP: Xilinxrecommends that you use small data sets for debug and validation.
During the hardware emulation stage you can optionally modify the kernel code to improve performance. Iterate in hardware emulation until the functionality is correct and the estimated kernel performance is sufficient. See SDAccel Environment Profiling and Optimization Guidefor more information.

Hardware Execution

Purpose
Final verification of the complete system, finding protocol violations (hardware hangs), and debugging system performance.
Execution Model
During hardware execution, the actual hardware platform is used to execute the kernels. The difference between this debug configuration and the final compilation of the kernel code is the inclusion of special hardware logic in the platform, such as ILA and VIO debug cores, and AXI performance monitors for debug purposes.

Figure:Hardware Execution

At this stage, a system image (xclbin) is compiled and executed on the actual hardware platform. Refer to theSDAccel Environment User Guidefor more information on generating thexclbinfile. At this point, the kernels are confirmed to be executing correctly on the actual FPGA hardware, and your focus can shift from debugging to performance tuning. See theSDAccel Environment Profiling and Optimization Guide.

Nevertheless, the hardware execution model might not be functional due to protocol issues, or issues with the hardware configuration. Towards that end, theSDAccelenvironment provides specific hardware debug capabilities which includeChipScope™debug cores (such as System ILAs), which can be viewed inVivadohardware manager, with waveform analysis, kernel activity reports, and memory access analysis to localize these critical hardware issues.

IMPORTANT:Debugging the kernel on the platform hardware requires additional logic to be incorporated into the overall hardware model. This means that if hardware debugging is enabled, there is some impact on resource use of the FPGA, as well as some impact on the kernel performance.