SDAccel Profiling and Optimization Features

TheSDAccel™environment generates various reports on the kernel resource and performance during compilation. It also collects profiling data during application execution in emulation mode and on the FPGA acceleration card. The reports and profiling data provide you with information on performance bottlenecks in the application and optimization techniques that can be used to improve performance. This chapter describes how to generate the reports and collect, display, and read the profiling results in theSDAccelenvironment.

System Estimate

In theSDAcceldevelopment environment, generating FPGA binary files is the step with the longest execution time. The execution time is also most affected by the FPGA architecture and the number of compute units placed on the FPGA fabric. Therefore, it is essential for the you to have a quicker way to understand the performance of the application before running it on the hardware, so you can spend more time iterating and optimizing your applications instead of waiting for the FPGA programming file to generate.

The system estimate in theSDAcceldevelopment environment takes into account the target hardware device and each compute unit in the application. Although an exact performance metric can only be measured by running the application on the FPGA, the estimation report in the development environment provides an accurate representation of the expected behavior.

GUI Flow

This report is automatically generated during the hardware emulation flow. There is one report generated for each kernel and a top report for the complete binary container. It is easy to access the reports from the Assistant window in the Emulation-HW folder.

The following figure shows the Assistant window with a System Estimate report for thebinary_container_1and the kernel with the namerun.

Figure:System Estimate Report in the Assistant Window

Command Line

The following command generates the system performance estimate reportsystem_estimate.xtxtfor all kernels inkernel.cl:

xocc -c -t hw_emu --platform xilinx:adm-pcie-7v3:1ddr:3.0 --report estimate kernel.cl

The performance estimate report generated by thexocc -report estimateoption provides information on every binary container in the application, as well as every compute unit in the design. The report is structured as follows:

  • Target device information
  • Summary of every kernel in the application
  • Detailed information on every binary container in the solution

Data Interpretation

The following example report file represents the information generated for the estimate report:

--------------------------------------------------------------------- Design Name: _xocc_compile_kernel_bin.dir Target Device: xilinx:adm-pcie-ku3:2ddr-xpr:3.3 Target Clock: 200MHz Total number of kernels: 1 --------------------------------------------------------------------- Kernel Summary Kernel Name Type Target OpenCL Library Compute Units ------------- ---- ------------------ -------------- ------------- smithwaterman clc fpga0:OCL_REGION_0 xcl_xocc 1 ---------------------------------------------------------------------- OpenCL Binary: xcl_xocc Kernels mapped to: clc_region Timing Information (MHz) Compute Unit Kernel Name Module Name Target Frequency --------------- ------------- ------------- ---------------- smithwaterman_1 smithwaterman smithwaterman 200 Estimated Frequency ------------------- 202.020203 Latency Information (clock cycles) Compute Unit Kernel Name Module Name Start Interval --------------- ------------- ------------- -------------- smithwaterman_1 smithwaterman smithwaterman 29468 Best Case Avg Case Worst Case --------- -------- ---------- 29467 29467 29467 Area Information Compute Unit Kernel Name Module Name FF LUT DSP BRAM --------------- ------------- ------------- ---- ---- --- ---- smithwaterman_1 smithwaterman smithwaterman 2925 4304 1 10 ---------------------------------------------------------------------

Design and Target Device Summary

All design estimate reports begin with an application summary and information about the target device. The device information is provided in the following section of the report:

--------------------------------------------------------------------- Design Name: _xocc_compile_kernel_bin.dir Target Device: xilinx:adm-pcie-ku3:2ddr-xpr:3.3 Target Clock: 200MHz Total number of kernels: 1 ---------------------------------------------------------------------

For the design summary, the only information that is provided is the design name and the selection of the target device. The other information provided in this section is the target board and the clock frequency.

Target Device
The name of the board that runs the application compiled by the SDAcceldevelopment environment.
Target Clock
Defines how fast the logic runs for compute units mapped to the FPGA fabric.

Both of these parameters are fixed by the device developer. These parameters cannot be modified from within theSDAccelenvironment.

Kernel Summary

The Kernel Summary section lists all of the kernels defined for the currentSDAccelsolution. The following example shows the kernel summary:

Kernel Summary Kernel Name Type Target OpenCL Library Compute Units ------------- ---- ------------------ -------------- ------------- smithwaterman clc fpga0:OCL_REGION_0 xcl_xocc 1

In addition to the kernel name, the summary also provides the execution target and type of the input source. Because there is a difference in compilation and optimization methodology forOpenCL™, C, and C/C++ source files, the type of kernel source file is specified.

The Kernel Summary section is the last summary information in the report. From here, detailed information on each compute unit binary container is presented.

Timing Information

For each binary container, the detail section begins with the execution target of all compute units. It also provides timing information for every compute unit. As a general rule, if an estimated frequency is higher than that of the device target, the compute unit will be able to run in the device. If the estimated frequency is below the target frequency, the kernel code for the compute unit needs to be further optimized for the compute unit to run correctly on the FPGA fabric. This information is shown in the following example:

OpenCL Binary: xcl_xocc Kernels mapped to: clc_region Timing Information (MHz) Compute Unit Kernel Name Module Name Target Frequency --------------- ------------- ------------- ---------------- smithwaterman_1 smithwaterman smithwaterman 200 Estimated Frequency ------------------- 202.020203

It is important to understand the difference between the target and estimated frequencies. Compute units are not placed in isolation into the FPGA fabric. Compute units are placed as part of a valid FPGA design that can include other components defined by the device developer to support a class of applications.

Because the compute unit custom logic is generated one kernel at a time, an estimated frequency that is higher than the device target indicates to the developer using theSDAccelenvironment that there should not be any timing problems during the creation of the FPGA programming files.

Latency Information

The latency information presents the execution profile of each compute unit in the binary container. When analyzing this data, it is important to keep in mind that all values are measured from the compute unit boundary through the custom logic. In-system latencies associated with data transfers to global memory are not reported as part of these values. Also, the latency numbers reported are only for compute units targeted at the FPGA fabric. The following is an example of the latency report:

Latency Information (clock cycles) Compute Unit Kernel Name Module Name Start Interval Best Case --------------- ------------- ------------- -------------- --------- smithwaterman_1 smithwaterman smithwaterman 29468 29467 Avg Case Worst Case -------- ---------- 29467 29467

The latency report is divided into the following fields:

  • Start interval
  • Best case latency
  • Average case latency
  • Worst case latency

The start interval defines the number of clock cycles that has to pass between invocations of a compute unit for a given kernel.

The best, average, and worst case latency numbers refer to how much time it takes the compute unit to generate the results of one ND Range data tile for the kernel. For cases where the kernel does not have data dependent computation loops, the latency values will be the same. Data dependent execution of loops introduces data specific latency variation that is captured by the latency report.

The interval or latency numbers will be reported as "undef" for kernels with one or more conditions listed below:

  • OpenCLkernels that do not have explicitreqd_work_group_size(x,y,z)
  • Kernels that have loops with variable bounds
Note:In case of undefined counters, consider using the TRIPCOUNT pragma.
Note:The latency information reflects estimates based on the analysis of the loop transformations and exploited parallelism of the model. These advanced transformations such as pipelining and data flow can heavily change the actual throughput numbers. Therefore, latency can only be used as relative guides between different runs.

Area Information

There are a limited number of fundamental building blocks available in each FPGA. These fundamental blocks (FF, LUT, DSP, block RAM) are used bySDAcceldevelopment environment to generate the custom logic for each compute unit in the design. The number of each fundamental resource needed to implement the custom logic in a compute unit determines how many compute units can be simultaneously loaded into the FPGA fabric. The following example shows the area information reported for a compute unit:

Area Information Compute Unit Kernel Name Module Name FF LUT DSP BRAM --------------- ------------- ------------- ---- ---- --- ---- smithwaterman_1 smithwaterman smithwaterman 2925 4304 1 10

HLS Report

After compiling a kernel using theSDx™development environment GUI or the XOCC command line, theVivado®High-Level Synthesis (HLS) tool HLS report is available. The HLS report includes details about the performance and logic usage of the custom-generated hardware logic from user kernel code. These details provide advanced users many insights into the kernel compilation results to guide kernel optimization.

GUI Flow

After compiling a kernel using theSDxenvironment GUI, you can view the HLS Report in the Assistant window. The report is under the Emulation-HW or System build configuration, and has the name, and the name. This is illustrated in the following Assistant window:

Figure:Assistant Window



Command Line

The HLS Report is designed to be viewed by theSDAccelenvironment GUI. However, for command line users, a textual representation of this report is also published. This report can be found inside the report directory situated under the kernel synthesis directory in theVivadoHLS tool solution directory.

Because thexocccommand generates several additional levels of hierarchy above this synthesis directory, it is best to simply locate the file by name:

find . -name _csynth.rpt

Where is the name of the kernel.

Note:The findcommand also supports the look up using wildcards such that the following command will look up all synthesis reports in any subdirectory:
find . -name "*_csynth.rpt"

Data Interpretation

The left pane of the HLS Report shows the module hierarchy. Each module generated as part of the high level synthesis run is represented in this hierarchy. You can select any of these modules to present the synthesis details of the module in the right side of the Synthesis Report window.

Figure:HLS Report Window

The Synthesis Report is separated into several sections, namely:

  • General Information
  • Performance Estimates (timing and latency)
  • Utilization Estimates
  • Interface Information

If this information is part of a hierarchical block, it will sum up the information of the blocks contained in the hierarchy. Due to this fact, the hierarchy can also be navigated from within the report, when it is clear which instance contributes what to the overall design.

CAUTION:
Regarding the absolute counts of cycles and latency, these numbers are based on estimates identified during synthesis, especially with advanced transformations, such as pipelining and dataflow; these numbers might not accurately reflect the final results. If you encounter question marks in the report, this might be due to variable bound loops, and you are encouraged to set trip counts for such loops to have some relative estimates presented in this report.

Profile Summary Report

TheSDAccelenvironment runtime automatically collects profiling data on host applications. After the application finishes execution, the profile summary is saved in HTML, CSV, and Google Protocol Buffer formats in the solution report directory or working directory. These reports can be reviewed in a web browser, spreadsheet viewer, or the integrated Profile Summary view in theSDAccelenvironment. The profile reports are generated in bothSDAccelGUI and XOCC command line flows.

GUI Flow

When you compile and execute an application fromSDAccelenvironment, the profile summary is automatically generated.

To control the generation of profile information, simply edit the run configuration through the context menu of the build configuration, and selectRun>Run Configurations.

After the configuration is run, the Assistant window enables easy access to the report from below the Run Configuration item. After the run configuration has executed, modifying the configuration can now be initiated directly through the context menu of the run configuration item in the Assistant window.

Figure:Profile Summary access inSDAccelGUI Flow

Double-click the report to open it.

Command Line

Command line users execute standalone applications outside theSDAccelenvironment. To generate the profile summary data, you can compile your design without any additional options. However, linking the bitstream file (xclbin) requires the--profile_kerneloption.

The argument provided through the--profile_kerneloption can be used to limit data collection, which might be required in large systems. The general syntax for theprofile_kerneloption with respect to the profile summary report is:

--profile_kernel <[data]:<[kernel_name|all]:[compute_unit_name|all]:[interface_name|all]:[counters|all]>

The following three fields can be specified to determine which interface the performance monitor is applied to:

  • kernel_name
  • compute_unit_name
  • interface_name

However, you can also specify the keywordallto apply the monitoring to all existing kernels, compute units, and interfaces with a single option. The last option,allows you to restrict the information gathering to justcountersfor large designs, whileall(default) will include the collection of actual trace information.

Note:The profile_kerneloption is additive and can be used multiple times on the link line.

If theprofile = trueoption is specified in thesdaccel.inifile, when the program is executed, aprofile_summary.csvfile is created.

[Debug] profile = true

The .csv file needs to be manually converted to Google Protocol Buffer format (.xprf) before the profiling result can be viewed in the integrated Profile Summary view. The following command line example generates an .xprf file from the .csv input file:

sdx_analyze profile profile_summary.csv

Display the Profile Summary

Use the following methods to display the SDAccelenvironment Profile Summary view created from the command line.
  1. To display the report in your web browser of choice, do the following:
    1. Run the following command:
      sdx_analyze profile -i profile_summary.csv -f html

      This creates an HTML file representing the data. that can be opened by the web browser of your choice. The file contains the same profiling result as presented inGUI Flow.

    2. Navigate to the file location, and double-click the generated HTML file.
  2. To display the report in the integratedSDAccelProfile Summary view, do the following:
    1. Use the following command to convert the .csv data file into the protobuf format.
      sdx_analyze profile -i profile_summary.csv -f protobuf
    2. StartSDAcceltool GUI by running thesdxcommand:
      $sdx
    3. Choose the default workspace when prompted.
    4. SelectFile>Open File.
    5. Browse to and then open the .xprf file created by thesdx_analyzecommand run in step a.

      The following figure shows the Profile Summary view that displaysOpenCLAPI calls, kernel executions, data transfers, and profile rule checks (PRCs).

Data Interpretation

The profile summary includes a number of useful statistics for yourOpenCLapplication. This can provide you with a general idea of the functional bottlenecks in your application. The profile summary consists of the following sections:

Top Operations
Top Data Transfer: Kernels and Global Memory
This table displays the profile data for top data transfers between FPGA and device memory.
Device
Name of device
Compute Unit
Name of compute unit
Number of Transfers
Sum of write and read AXI transactions monitored on device
Average Bytes per Transfer
(Total Read Bytes + Total Write Bytes) / (Total Read AXI Transactions + Total Write AXI Transactions)
Transfer Efficiency (%)
(Average Bytes per Transfer) / min(4K, (Memory Bit Width/8 * 256))

AXI4specification limits the max burst length to 256 and max burst size to 4K bytes.

Total Data Transfer (MB)
(Total Read Bytes + Total Write Bytes) / 1.0e6
Total Write (MB)
(Total Write Bytes) / 1.0e6
Total Read (MB)
(Total Read Bytes) / 1.0e6
Transfer Rate (MB/s)
(Total Data Transfer) / (Compute Unit Total Time)
Top Kernel Execution
Kernel Instance Address
Host address of kernel instance (in hex)
Kernel
Name of kernel
Context ID
Context ID on host
Command Queue ID
Command queue ID on host
Device
Name of device where kernel was executed (format: -)
Start Time (ms)
Start time of execution (in ms)
Duration (ms)
Duration of execution (in ms)
Global Work Size
NDRange of kernel
Local Work Size
Work group size of kernel
Top Memory Writes: Host and Device Global Memory
Buffer Address
Host address of buffer (in hex)
Context ID
Context ID on host
Command Queue ID
Command queue ID on host
Start Time (ms)
Start time of write transfer (in ms)
Duration (ms)
Duration of write transfer (in ms)
Buffer Size (KB)
Size of write transfer (in KB)
Writing Rate (MB/s)
Writing Rate = (Buffer Size) / (Duration)
Top Memory Reads: Host and Device Global Memory
Buffer Address
Host address of buffer (in hex)
Context ID
Context ID on host
Command Queue ID
Command queue ID on host
Start Time (ms)
Start time of read transfer (in ms)
Duration (ms)
Duration of read transfer (in ms)
Buffer Size (KB)
Size of read transfer (in KB)
Reading Rate (MB/s)
Reading Rate = (Buffer Size) / (Duration)
Kernels & Compute Units
Kernel Execution (includes estimated device times)
This table displays the profile data summary for all kernel functions scheduled and executed.
Kernel
Name of kernel
Number of Enqueues
Number of times kernel is enqueued
Total Time (ms)
Sum of runtimes of all enqueues (measured from START to END in OpenCLexecution model)
Minimum Time (ms)
Minimum runtime of all enqueues
Total Time (ms)
Sum of runtimes of all enqueues (measured from START to END in OpenCLexecution model)
Average Time (ms)
(Total Time) / (Number of Enqueues)
Maximum Time (ms)
Maximum runtime of all enqueues
Compute Unit Utilization (includes estimated device times)
This table displays the summary profile data for all compute units on the FPGA.
Device
Name of device (format: -)
Compute Unit
Name of Compute Unit
Kernel
Kernel this Compute Unit is associated with
Global Work Size
NDRange of kernel (format is x:y:z)
Local Work Size
Local work group size (format is x:y:z)
Number of Calls
Number of times the Compute Unit is called
Dataflow Execution
Indicates if top level dataflow execution is enabled
Maximum Overlapping Executions
How much executions were actually operating in parallel at some point during execution
Dataflow Acceleration
Estimated improvement due to dataflow acceleration
Total Time (ms)
Sum of runtimes of all call
Minimum Time (ms)
Minimum runtime of all calls
Average Time (ms)
(Total Time) / (Number of Work Groups)
Maximum Time (ms)
Maximum runtime of all calls
Clock Frequency (MHz)
Clock frequency used for a given accelerator (in MHz)
Data Transfers
Data Transfer: Host and Global Memory
This table displays the profile data for all read and write transfers between the host and device memory via PCI Express®link.
Context:Number of Devices
Context ID and number of devices in context
Transfer Type
READ or WRITE
Number of Transfers
Number of host data transfers
Note:May contain printftransfers
Transfer Rate (MB/s)
(Total Bytes Sent) / (Total Time in usec)

where Total Time includes software overhead

Average Bandwidth Utilization (%)
(Transfer Rate) / (Max. Transfer Rate)

where Max. Transfer Rate = (256/8 bytes) * (300 MHz) = 9.6 GBps

Average Size (KB)
(Total KB sent) / (number of transfers)
Total Time (ms)
Sum of transfer times
Average Time (ms)
(Total Time) / (number of transfers)
Data Transfer: Kernels and Global Memory
This table displays the profile data for all read and write transfers between the FPGA and device memory.
Device
Name of device
Compute Unit/Port Name
/
Kernel Arguments
List of arguments connected to this port
DDR Bank
DDR bank number this port is connected to
Transfer Type
READ or WRITE
Number of Transfers
Number of AXI transactions monitored on device
Note:Might contain printftransfers)
Transfer Rate (MB/s)
(Total Bytes Sent) / (Compute Unit Total Time)
Compute Unit Total Time
Total execution time of compute unit
Total Bytes Sent
Sum of bytes across all transactions
Average Bandwidth Utilization (%)
(Transfer Rate) / (0.6 *Max. Transfer Rate)

where Max. Transfer Rate = (512/8 bytes) * (300 MHz) = 19200 MBps

Average Size (KB)
(Total KB sent) / (number of AXI transactions)
Average Latency (ns)
(Total latency of all transaction) / (number of AXI transactions)
OpenCL API Calls
This table displays the profile data for all OpenCLhost API function calls executed in the host application.
API Name
Name of API function (for example, clCreateProgramWithBinary, clEnqueueNDRangeKernel)
Number of Calls
Number of calls to this API
Total Time (ms)
Sum of runtimes of all calls
Minimum Time (ms)
Minimum runtime of all calls
Average Time (ms)
(Total Time) / (Number of Calls)
Maximum Time (ms)
Maximum runtime of all calls

Application Timeline

The Application Timeline view collects and displays host and device events on a common timeline to help you understand and visualize the overall health and performance of your systems. These events include:

  • OpenCLAPI calls from the host code.
  • Device trace data including AXI transaction start/stop, kernel start/stop, etc.

While useful for debugging and profiling the application, timeline and device trace data are not collected by default because the runtime needs to periodically unload the trace data from the FPGA, which can add additional time to the overall application execution. However, the device data is collected with dedicated hardware inside the FPGA, so the data collection does not affect kernel functionality on the FPGA. The following sections describe setups required to enable time and device data collection.

Turning on device profiling is intrusive and can negatively affect overall performance. This feature should be used for system performance debugging only.

Note:Device profiling can be used in Emulation-HW without negative impact.

GUI Flow

Timeline and device trace data collection is part of run configuration for anSDAccel™project created from the integratedSDAccelenvironment. Follow the steps below to enable it:

  1. Instrumenting the code is required for System execution. This is done through the Hardware Function Settings dialog box. In the Assistant window, right-click the kernel under the System [Hardware] configuration, and select the Settings Command.

    With respect to application timeline functionality, you can enable Data Transfer, Execute Profile, and Stall Profiling. These options are instrumenting all ports of each instance of any kernel. As these options insert additional hardware, instrumenting all ports might be too much. Towards that end, more control is available through command line options as detailed in theCommand Linesection. These options are only valid for system runs. During hardware emulation, this data is generated by default.

    Data Transfer
    This option enables monitoring of data ports.
    Execute Profiling
    This option provides minimum port data collection during system run. This option records the execution times of the compute units. Execute profiling is enabled by default for data and stall profiling.
    Stall Profiling
    This option includes the stall monitoring logic in the bitstream.
  2. Specify what information is actually going to be reported during a run.
    Note:Only information actually exposed from the hardware during system execution is reported.

    To configure reporting, click the down arrow next to the Debug or Run button, and then selectRun Configurationsto open the Run Configurations window.



  3. In the Run Configurations window, click the Profile tab.

    Ensure the Enable profiling check box is selected. This enables basic profiling support. With respect to trace data, ensure that Generate timeline trace report actually gathers the information in the build config you are running.

    Default implies that no trace data capturing is supported in system execution, but enabled by default in hardware emulation.

    You can also select the amount of information to gather during runtime. Select the granularity for trace data collection independently for Data Transfer Trace and Stall Trace.

    The Data Transfer Trace options are as follows:

    Coarse
    Show compute unit transfer activity from beginning of first transfer to end of last transfer (before compute unit transfer ends).
    Fine
    Show all AXI-level burst data transfers.
    None
    Turn off reading and reporting of device-level trace during runtime.

    The Stall Trace Options are as follows:

    None
    Turn off any stall trace information gathering.
    All
    Record all stall trace information.
    External Memory Stall
    Memory stalls to DDR (for example, AXI4read from DDR).
    Internal Dataflow Stall
    Intra-kernel streams (for example, writing to a full FIFO between data flow blocks).
    Inter CU Pipe Stall
    Inter-kernel pipe (for example, writing to a full OpenCL™pipe between kernels).

    If you have multiple run configurations for the same project, you must change the profile settings for each run configuration.

  4. After running configurations, in the Assistant window, double-clickApplication Timelineto open the Application Timeline view.

Command Line

Complete the following steps to enable timeline and device trace data collection in the command line flow:

  1. This step is responsible for the FPGA bitstream instrumentation withSDxAccel Monitors (SAM) andSDxPerformance Monitors (SPMs). The instrumentation is performed through the--profile_kernel, which has three distinct instrumentation options (data,stall, andexec).
    Note:The --profile_kerneloption is ignored except for system compilation and linking. During hardware emulation, this data is generated by default.

    The--profile_kerneloption has three fields that are required to determine the specific kernel interface to which the monitors are applied. However, if resource usage is not an issue, the keywordallenables you to apply the monitoring to all existing kernels, compute units, and interfaces with a single option. Otherwise, you can specify thekernel_name,compute_unit_name, andinterface_nameexplicitly to limit instrumentation. The last option,allows you to restrict the information gathering to justcountersfor large designs, whileall(default) includes the collection of actual trace information.

    Note:The --profile_kerneloption is additive and can be used multiple times on the link line.
    • data: This option enables monitoring of data ports through SAM and SPM IPs. This option needs to be set only during linking.
      -l --profile_kernel <[data]:<[kernel_name|all]:[compute_unit_name|all]:[interface_name|all]:[counters|all]>
    • stall: This option needs to be applied during compilation:
      -c --profile_kernel <[stall]:<[kernel_name|all]:[compute_unit_name|all]:[counters|all]>
      and during linking:
      -l --profile_kernel <[stall]:<[kernel_name|all]:[compute_unit_name|all]:[counters|all]>
      This option includes the stall monitoring logic (using SAM IP) in the bitstream. However, it does require that stall ports are present on the kernel interface. To facilitate this, the option is required for compilation of the C/C++/OpenCLAPI kernel modules.
    • exec: This option provides minimum port data collection during system run. It simply records the execution times of the kernel through the use of SAM IP. This feature is by default enabled on any port that uses the data or stall data collection. This option needs to be provided only during linking.
      -l --profile_kernel <[exec]:<[kernel_name|all]:[compute_unit_name|all]>:[counters|all]
  2. After the kernels are instrumented, data gathering must be enabled during runtime execution. Do this through the use of thesdaccel.inifile that is in the same directory as the host executable. The followingsdaccel.inifile will enable maximum information gathering during runtime:
    [Debug] profile=true timeline_trace=true data_transfer_trace=coarse stall_trace=all
    • profile=: When this option is specified as true, basic profile monitoring is enabled. Without any additional options, this implies that the host runtime logging profile summary is enabled. However, without this option enabled, no monitoring is performed at all.
    • timeline_trace=: This option will enable timeline trace information gathering of the data. Without adding profile IP into the FPGA (data), it will only show host information. At a minimum, to get more compute unit start and end execution times in the timeline trace, the compute unit needs to be linked with--profile_kernel exec.
    • data_transfer_trace=: This option enables device-level AXI data transfers trace:
      • coarse: Show compute unit transfer activity from beginning of first transfer to end of last transfer (before compute unit transfer ends).
      • fine: Show all AXI-level burst data transfers.
      • off: Turn off reading and reporting of device-level trace during runtime.
    • stall_trace=: Specify what types of stalls to capture and report in timeline trace. The default isoff.
      • off: Turn off any stall trace information gathering.
        Note:Enabling stall tracing can often fill the trace buffer, which results in incomplete and potentially corrupt timeline traces. This can be avoided by setting trace_stall=off.
      • all: Record all stall trace information.
      • dataflow: Intra-kernel streams (for example, writing to full FIFO between dataflow blocks).
      • memory: External memory stalls (for example,AXI4read from the DDR.
      • pipe: Inter-kernel pipe (for example, writing to fullOpenCLpipe between kernels).
  3. To allow User Function analysis in the Application Timeline it is necessary to run the waveform tool as part of the Emulation flow. This requires to launch the waveform tool through thesdaccel.inifile.
    [Emulation] launch_waveform=batch
    • launch_waveform=: This option automatically starts the waveform tool during emulation. SeeWaveform Viewfor more details.
      • gui: Start the graphical user interface for the live waveform view
      • batch: Start the waveform processing as a background process.
  4. In command line mode, CSV files are generated to capture the trace data. These CSV reports need to be converted to the Application Timeline format using thesdx_analyzeutility before they can be opened and displayed in theSDAccelenvironment GUI.
    sdx_analyze trace timeline_trace.csv -k timeline_kernels.csv -f wdb

    This creates thetimeline_trace.wdbfile by default, which can be opened from the GUI. Thetimeline_kernels.csvfile contains specific kernel trace data which might not always be available. In this case, the option-k timeline_kernels.csvshould be omitted from thesdx_analyzecommand.

  5. To view the timeline report host and device waveforms, do the following:
    1. Start theSDxenvironment by running the command:
      $sdx
    2. Choose a workspace when prompted.
    3. SelectFile>Open File, browse to the .wdb file generated during hardware emulation or system run, and open it.

Data Interpretation

The following figure shows the Application Timeline view that displays host and device events on a common timeline. This information helps you to understand details of application execution and identify potential areas for improvements.

Figure:Application Timeline View

The Application Timeline view trace includes two main sections:

Host
Shows the trace of all the activity originating from the host side.
Device
Shows the activity of the compute-units on the FPGA.

Under the host, different activities are categorized asOpenCL™API calls, Data Transfer, and Kernels.

The complete tree has the following structure:

Host
OpenCL API Calls
All OpenCLAPI calls are traced here. The activity time is measured from the host perspective.
General
All general OpenCLAPI calls, such as clCreateProgramWithBinary(), clCreateContext(), and clCreateCommandQueue, are traced here.
Queue
OpenCLAPI calls that are associated with a specific command queue are traced here. This includes commands, such as clEnqueueMigrateMemObjects, clEnqueueNDRangeKernel, and so on. If the user application creates multiple command queues, then this section shows as many queues and activities under it.
Data Transfer
In this section, the DMA transfers are traced. The data transfer from the host to the device appear under Write, and the transfers from device to host appear under Read. The additional section Copytraces direct communication between kernels.
Kernel Enqueues
The active kernel executions are shown here. The kernels here should not be confused with your kernels/compute-unit on the device. In this instance, kernels refers to the NDRangeKernels and the tasks created by the clEnqueueNDRangeKernels()and clEnqueueTask()APIs, and these are plotted against the time measured from the host’s perspective. Multiple kernels can be scheduled to be executed at the same time and they are traced from the point they are scheduled to run until the end of kernel execution. This is the reason for multiple entries. The number of rows depend on the number of overlapping kernel executions.
Note:Overlapping of the kernels should not be mistaken for actual real parallel execution on the device as the process might not be ready to actually execute right away.
Device "name"
Binary Container "name"
Accelerator "name"
This is the name of the compute unit (also known as accelerator) on the FPGA.
User Functions
In the case of the VivadoHLS tool kernels, functions that are implemented as data flow processes are traced here. The trace for these functions show the number of active instances of these functions that are currently executing in parallel. These names are generated in hw emulation when waveform is enabled.
Note:Function level activity is only possible in Hardware Emulation.
  • Function: "name a"
  • Function: "name b"
Read
A compute unit reads from the DDR over AXI-MM ports. The trace of data a read by a compute unit is shown here. The activity is shown as transaction and the tooltip for each transaction shows more details of the AXI transaction. These names are generated when --profile_kernel datais used.
  • m_axi_(port)
Write
A compute unit writes to the DDR over AXI-MM ports. The trace of data written by a compute unit is shown here. The activity is shown as transactions and the tool-tip for each transaction shows more details of the AXI transaction. This is generated when --profile_kernel datais used.
  • m_axi_(port)

Waveform View

TheSDxdevelopment environment can generate a waveform view and launch a live waveform viewer when running hardware emulation. It displays in-depth details on the emulation results at system level, compute unit level, and at function level. The details include data transfers between the kernel and global memory, data flow through inter-kernel pipes as well as data flow through intra-kernel pipes. They provide many insights into the performance bottleneck from the system level down to individual function call to help developers optimize their applications.

By default, the waveform view and live waveform viewer are not enabled. This is because the views require that the runtime generates a simulation waveform during hardware emulation, which consumes more time and disk space. The following sections describe the setup required to enable data collection.

Note:The waveform view allows you to look directly at the device transactions from within the SDxdevelopment environment. In contrast, the live waveform capability actually spawns the simulation waveform view that visualizes the hardware transactions in addition to potential user-selected internal signals.

GUI Flow

Follow the steps below to enable waveform data collection and to open the viewer:

  1. Open the Application Project Settings window, and select theKernel debugcheck box.

  2. Click the down arrow next to the Run button, and selectRun Configurationsto open the Run Configurations window.

  3. On the Run Configurations window, clickMain, and select theUse waveform for kernel debuggingcheck box.

    Optionally:

    • To bring up the Simulation window to view the Live Waveform while the hardware emulation is running, deselectLaunch live waveform.
    • To enable basic profiling, selectEnable profiling.
  4. If you have multiple run configurations for the same project, change the profile settings for each run configuration.
  5. If you have not selected the Live Waveform viewer to be launched automatically, open the Waveform view from theSDxDevelopment Environment.

    In theSDxDevelopment Environment, double-clickWaveformin the Assistant window to open the Waveform view window.



Command Line

Use the following instructions to enable waveform data collection from the Command Line during hardware emulation and open the viewer:

  1. Turn on debug code generation during kernel compilation.
    xocc -g -t hw_emu ...
  2. Create ansdaccel.inifile in the same directory as the host executable with the contents below:
    [Debug] profile=true timeline_trace=true

    This enables maximum observability. The options in detail are:

    profile=
    Setting this option to true, enables profile monitoring. Without any additional options, this implies that the host runtime logging profile summary is enabled. However, without this option enabled, no monitoring is performed at all.
    timeline_trace=
    This option enables timeline trace information gathering of the data.
  3. To see the live waveform and additional simulation waveforms, add the following to the emulation section in thesdaccel.ini:
    [Emulation] launch_waveform=gui
    launch_waveform=
    The guioption enables the Live Waveform viewer, while the batchoption will record the waveform activity for post-processing.

    A Live Waveform viewer is spawned during the execution of the hardware emulation, which allows you to examine the waveforms in detail.

  4. Execute hardware emulation. The hardware transaction data is collected in the file--.wdb.
  5. If no Live Waveform viewer was requested, follow the steps below to open the Waveform view:
    1. Start theSDxIDE by running the following command:$sdx.
    2. Choose a workspace when prompted.
    3. SelectFile>Open File, browse to the .wdb file generated during hardware emulation.

    Alternatively, xsim can be used to open the .wdb file using the following command:xsim --gui .wdb. For more details about xsim, refer toVivado Design Suite User Guide: Logic Simulation(UG900).

Data Interpretation Waveform View

The following image shows the Waveform view:

Figure:Waveform View

The waveform view is organized hierarchically for easy navigation.

Note:This view is based on the actual waveforms generated during hardware emulation (Kernel Trace). This allows this view to descend all the way down to the individual signals responsible for the abstracted data. However, as it is post processing the data, no additional signals can be added, and some of the runtime analysis such as DATAFLOW transactions cannot be visualized.

The hierarchy tree and descriptions are:

HLS Process Summary
This summary section contains a hierarchical representation of the activity report of each sequential process contained within the generated RTL. Visualizing the active processes within the HLS design allows to profile in detail which process is active for how long within each activation of the top module. Therefore, this view enables the analysis with respect to individual process performance as well as the overall concurrent execution of independent processes. According to Amdahl’s Law, processes dominating the overall execution have the highest potential to improve performance, if process execution time can be reduced.
Device “name”
Target device name.
Binary Container “name”
Binary container name.
Kernel “name” 1:1:1
For each kernel and for each compute unit of that kernel, this section breaks down the activities originating from the compute unit.
Compute Unit: “name”
Compute unit name.
CU Stalls (%)
Stall signals are provided by the HLS tool to inform you when a portion of their circuit is stalling because of external memory accesses, internal streams (that is, dataflow), or external streams (that is, OpenCLpipes). The stall bus, shown in detailed kernel trace, compiles all of the lowest level stall signals and reports the percentage that are stalling at any point in time. This provides a factor of how much of the kernel is stalling at any point in the simulation.

For example, if there are 100 lowest level stall signals, and 10 are active on a given clock cycle, then the CU Stall percentage is 10%. If one goes inactive, then it would be 9%.

Data Transfers
This shows the read/write data transfer accesses originating from each Master AXI port of the compute unit to the DDR.
User Functions
This information is available for the HLS tool kernels and shows the user functions.
Function: "name"
Dataflow/Pipeline Activity
This shows the number of parallel executions of the function if the function is implemented as a dataflow process.
Active Iterations
This shows the currently active iterations of the dataflow. The number of rows is dynamically incremented to accommodate the visualization of any concurrent execution.
StallNoContinue
This is a stall signal that tells if there were any output stalls experienced by the dataflow processes (function is done, but it has not received a continue from the adjacent dataflow process).
RTL Signals
These are the underlying RTL control signals that were used to interpret the above transaction view of the dataflow process.
Function Stalls
Shows the different types of stalls experienced by the process.
External Memory
Stalls experienced while accessing the DDR memory.
External Stream
Stalls triggered by streaming.
Function I/O
Actual interface signals grouped according to their associated block interfaces.
Function: "name"
Function name.

Data Interpretation Live Waveform

The following figure shows the live waveform viewer while running hardware emulation.

Figure:Live Waveform Viewer

The live waveform viewer is organized hierarchically for easy navigation. Below are the hierarchy tree and descriptions.

Note:As the live waveform viewer is presented only as part of the actual hardware simulation run (xsim), you can annotate extra signals and internals of the register transfer (RTL) to the same view. Also, all grouped and combined groups can be fully expanded to the actual contributing signals. For more information on working with xsim, refer to Vivado Design Suite User Guide: Logic Simulation(UG900).
HLS Process Summary
This summary section contains a hierarchical representation of the activity report of each sequential process contained within the generated RTL. Visualizing the active processes within the HLS design allows to profile in detail which process is active for how long within each activation of the top module. Therefore, this view enables the analysis with respect to individual process performance as well as the overall concurrent execution of independent processes. According to Amdahl’s Law, processes dominating the overall execution have the highest potential to improve performance if process execution time can be reduced.
Device "name"
Target device name.
Binary Container "name"
Binary container name.
Kernel "name" 1:1:1
For each kernel and for each compute unit of that kernel this section breaks down the activities originating from the compute unit.
Compute Unit: "name"
Compute unit name.
CU Stalls (%)
Stall signals are provided by the VivadoHLS tool to inform you when a portion of the circuit is stalling because of external memory accesses, internal streams (that is, dataflow), or external streams (that is, OpenCL™pipes). The stall bus shown in detailed kernel trace compiles all of the lowest level stall signals and reports the percentage that are stalling at any point in time. This provides a factor of how much of the kernel is stalling at any point in the simulation.

For example: If there are 100 lowest level stall signals, and 10 are active on a given clock cycle, then the CU Stall percentage is 10%. If one goes inactive, then it would be 9%.

Data Transfers
This shows the read/write data transfer accesses originating from each Master AXI port of the compute unit to the DDR.
User Functions
This information is available for the HLS kernels and shows the user functions.
Function: "name"
Dataflow/Pipeline Activity
This shows the number of parallel executions of the function if the function is implemented as a dataflow process
Active Iterations
This shows the currently active iterations of the dataflow. The number of rows is dynamically incremented to accommodate the visualization of any concurrent execution.
StallNoContinue
This is a stall signal that tells if there were any output stalls experienced by the dataflow processes (function is done, but it has not received a continue from the adjacent dataflow process).
RTL Signals
These are the underlying RTL control signals that were used to interpret the above transaction view of the dataflow process.
Function Stalls
Shows the different types of stalls experienced by the process.
External Memory
Stalls experienced while accessing the DDR memory.
External Stream
Stalls triggered by streaming.
Function I/O
Actual interface signals grouped according to their associated block interfaces.
Function: "name"
Function name.

Guidance View

The Guidance view is designed to provide you with feedback throughout the development process. It presents in a single location all issues encountered from building the actual design all the way through runtime analysis.

It is crucial to understand that the Guidance view is intended to help you to identify potential issues in the design. These issues might be source code related or due to missed tool optimizations. Also, the rules are generic rules based on experiences on a vast set of reference designs. Nevertheless, these rules might not be applicable for a specific design. Therefore, it is up to you to understand the specific guidance rules, and take appropriate action based on your specific algorithm and requirements.

GUI Flow

The Guidance view is automatically populated and displayed in the lower central tab view. After running hardware emulation, the Guidance view might look like the following:

Figure:Guidance View

Note:You can produce the Guidance view through the VivadoHLS tool post-compilation, but you will not get Profile Rule Checks.

To simplify visualizing the guidance information, the GUI flow allows you to search, and filter the Guidance view to locate specific guidance rule entries. It is also possible to collapse or expand the tree view or even suppress the hierarchical tree representation and visualize a condensed representation of the guidance rules. Finally, you can select what is shown in the Guidance view. You can enable or disable the visualization of warnings, as well as met rules, and restrict the specific content based on the source of the messages such as build and emulation.

By default, the Guidance view shows all guidance information for the project selected in the drop-down.

To restrict the content to an individual build or run step, do the following:

  1. Use the commandWindow>Preferences.
  2. Select the categoryXilinx Sdx>Guidance.
  3. DeselectGroup guidance rule checks by project.

Command Line

The Guidance data is best analyzed through the GUI, which consolidates all guidance information for the flow. Nevertheless, the tool automatically generates HTML files containing the guidance information. As guidance information is generated throughout the tool flow, several guidance files are generated. The simplest way to locate the guidance reports is to search for theguidance.htmlfiles.

find . -name "*guidance.html" -print

This command lists all guidance files generated, which can be opened with any web browser.

Data Interpretation

The Guidance view places each entry in a separate row. Each row might contain the name of the guidance rule, threshold value, actual value, and a brief but specific description of the rule. The last field provides a link to reference material intended to assist in understanding and resolving any of the rule violations.

In the GUI Guidance view, guidance rules are grouped by categories and unique IDs in the Name column and annotated with symbols representing the severity. These are listed individually in the HTML report. In addition, as the HTML report does not show tooltips, a full Name column is included in the HTML report as well.

The following list describes all fields and their purpose as included in the HTML guidance reports.

Id
Each guidance rule is assigned a unique id. Use this id to uniquely identify a specific message from the guidance report.
Name
The Name column displays a mnemonic name uniquely identifying the guidance rule. These names are designed to assist in memorizing specific guidance rules in the view.
Severity
The Severity column allows the easy identification of the importance of a guidance rule.
Full Name
The Full Name provides a less cryptic name compared to the mnemonic name in the Name column.
Categories
Most messages are grouped within different categories. This allows the GUI to display groups of messages within logical categories under common tree nodes in the Guidance view.
Threshold
The Threshold column displays an expected threshold value, which determines whether or not a rule is met. The threshold values are determined from many applications that follow good design and coding practices.
Actual
The Actual column displays the values actually encountered on the specific design. This value is compared against the expected value to see if the rule is met.
Details
The Details column provides a brief but specific message describing the specifics of the current rule.
Resolution
The Resolution column provides a pointer to common ways the model source code or tool transformations can be modified to meet the current rule. Clicking the link brings up a pop-up window or the documentation with tips and code snippets that you can apply to the specific issue.

Using Implementation Tools

Exploring Kernel Optimizations Using Vivado HLS

All kernel optimizations usingOpenCLor C/C++ can be performed from within theSDAccelenvironment. The primary performance optimizations, such as those discussed in this chapter (pipelining function and loops, applying dataflow to enable greater concurrency between functions and loops, unrolling loops, etc.), are performed by theXilinx®FPGA design tool,Vivado®HLS tool.

TheSDAccelenvironment automatically calls the HLS tool. However, to use the GUI analysis capabilities, you must launch the HLS tool directly from within theSDAccelenvironment. Using the HLS tool in standalone mode enables the following enhancements to the optimization methodology:

  • Focusing solely on the kernel optimization, there is no requirement to execute emulation.
  • The ability to create multiple solutions, compare their results, and explore the solution space to find the most optimum design.
  • The ability to use the interactive Analysis Perspective to analyze the design performance.
IMPORTANT:Only the kernel source code is incorporated back into the SDAccelenvironment. After exploring the optimization space, ensure that all optimizations are applied to the kernel source code as OpenCLattributes or C/C++ pragmas.

To open the HLS tool in standalone mode, from the Assistant window, right-click the hardware function object, and selectOpen HLS Project, as shown in the following figure.

Figure:Open HLS Project

Controlling FPGA Implementation with the Vivado Design Suite

TheSDAcceldevelopment environment provides a smooth flow from anOpenCL/C/C++ model all the way to an FPGA accelerated implementation. In most cases, this flow completely abstracts away the underlying fact that the programmable region in the FPGA is configured to implement the kernel functionality. This fully isolates the developer from typical hardware constraints such as routing delays and kernel placement. However, in some cases these concerns will have to be looked at especially when large designs are to be implemented. Towards this end, the development environment allows you to fully control theVivado Design Suitebackend tool.

TheSDAccelenvironment calls theVivado Design Suiteto automatically run RTL synthesis and implementation. You also have the option of launching the design suite directly from within theSDAccelenvironment. When invoking theVivadoIntegrated Design Environment (IDE) in standalone mode in theSDAccelenvironment, you can open theVivadosynthesis project or theVivadoimplementation project to edit, manage, and control the project.

TheVivadoproject can be opened in theSDAccelenvironment after the build targeting the system configuration has completed.

To openVivadoIDE in standalone mode, from the Xilinx drop-down menu, selectVivado IntegrationandOpen Vivado Project. Choose between theVivadosynthesis and implementation projects, and clickOK.

Using theVivadoIDE in standalone mode enables the exploration of various synthesis and implementation options for further optimizing the kernel for performance and area. Familiarity with the design suite is recommended to make the most use of these features.

IMPORTANT:The optimization switches applied in the standalone project are not automatically incorporated back into the SDAccelenvironment. After exploring the optimization space, ensure that all optimization parameters are passed to the SDAccelenvironment using the -–xpoption for xocc. For example:
--xp "vivado_prop:run.impl_1.{STEPS.PLACE_DESIGN.ARGS.TCL.POST}={}"

This optimization flow is supported in the command line flow by callingxocc –interactiveto bring up theVivadoIDE, on the current project. In the IDE, generate a DCP, which can be saved and reused during linking with xocc. The specific options are:

  • --interactiveallows theVivadoIDE to be launched from within thexoccenvironment, with the right project loaded.
  • --reuse_implallows a pre-implemented and timing closedVivadotool design checkpoint (.dcp) file to be brought in and used directly inSDxenvironment flow for xclbin generation.