Profiling and Optimization

TheSDAccel™environment generates various system and kernel resource performance reports during compilation. It also collects profiling data during application execution in both emulation and system mode configurations. Examples of the data reported includes:

  • Host and device timeline events
  • OpenCL™API call sequence
  • Kernel execution sequence
  • FPGA trace data including AXI transactions
  • Kernel start and stop signals

Together the reports and profiling data can be used to isolate performance bottlenecks in the application and optimize the design to improve performance.

Optimizing an application requires optimizing both the application host code and any hardware accelerated kernels. The host code must be optimized to facilitate data transfers and kernel execution, while the kernel should be optimized for performance and resource usage.

There are four distinct areas to be considered when performing algorithm optimization inSDAccel: System resource usage and performance, Kernel optimization, Host optimization andPCIe®bandwidth optimization. The followingSDAccelreports and graphical tools support your efforts to profile and optimize these areas:

  • System Estimate
  • Design Guidance
  • HLS Report
  • Profile Summary
  • Application Timeline
  • Waveform View and Live Waveform Viewer

Reports are automatically generated after running the active build via theSDAccelGUI orxoccMakefile flows.

Separate sets of reports are generated for all three build configurations and can be found in the respective report directories.

IMPORTANT:The high-level synthesis (HLS) report and HLS guidance are only generated for hardware emulation and system build configurations for C and OpenCLkernels, not for RTL kernels.

The Profile Summary and Application Timeline reports are generated for all three build configurations and are located under the default application sub-directory.

Reports can be viewed in a web browser or spreadsheet viewer for theSDAccelGUI. To access these reports from theSDx™integrated design environment, make sure the Assistant view is visible and double-click the desired report.

This following sections briefly describe the various reports and graphical visualization tools, and how they can be used to profile and optimize your design. For complete details on each report along with optimization steps, and coding guidelines see theSDAccel Environment Profiling and Optimization Guide.

Design Guidance

TheSDAccelenvironment has a comprehensive design guidance tool that provides immediate actionable guidance to the software application developers for detected issues in their designs. Guidance is generated from HLS, theSDxProfiler and theVivado® Design Suitewhen invoked fromxocc. The generated design guidance can have several severity levels; errors, advisories, warnings, and critical warnings are provided during software emulation, hardware emulation, and system builds.

The guidance includes hyperlinks, examples, and links to documentation. This improves productivity for current users by quickly highlighting issues and propels new users to more quickly become experts in using theSDAcceltool.

Design guidance is automatically generated after building or running a design in theSDxGUI with results contained in the Guidance view located in the console area of theSDxGUI. Hovering over the guidance highlights solutions and suggestions.

The following image shows an example of guidance given by theSDxGUI. It details ways to increase the bandwidth use of the kernels. Clicking a link displays an expanded view of the actionable guidance. In this case, it displays guidance for maximizing use of global memory bandwidth.

Figure:Design Guidance Example

TIP:In the Assistant you can right-click on a build configuration and select Show Guidance.

There is one HTML guidance report for each command line run ofxocc, including compile and link. The report files are generated in the--report_dirlocation under the specific.xoname.

The name of the report file is given below, whereis the.xoname:

  • xocc_compile__guidance.htmlforxocccompilation
  • xocc_link_t_guidance.htmlforxocclinking

The profile design guidance helps you interpret the profiling results and know exactly where to focus on to improve performance. Specific details of the reports and additional design guidance details can be found inSDAccel Environment Profiling and Optimization Guide.

System Estimate Report

TheSDAccelHLS generates the System Estimate report provides estimates on FPGA resource usage and the frequency at which the hardware accelerated kernels can operate. It is automatically generated for Emulation-HW and System builds, and can be found under the respective directory of the Assistant view shown below.

TIP:The time to generate the System Estimate report in Hardware Emulation build is much shorter than during System builds which provide actual and not estimated resources. Xilinx®recommends iterating in Hardware Emulation and optimizing before performing a System build.

Figure:System Estimate Assistant View

The report contains high-level details of the user kernels including resource usage and estimated frequency. The results can be used to guide the design optimization. For instance, if the target frequency is not met, it might be necessary to revisit the source code.

An example report is shown in the following graphic. It shows thekrnl_vaddkernel:

  • It is estimated to operate at a frequency of 411 MHz which exceeds the 300 MHz targeted frequency.
  • In the best case it has a latency of one cycle.
  • Estimated FPGA resource usage of 2353 FF, 3948 LUTs, no DSPs, and three BRAMs.

Figure:System Estimate

When using the command line flow, you can generate the system estimate report with the following option:
xocc .. --report estimate

For additional details on the System Estimate report see theSDAccel Environment Profiling and Optimization Guide.

HLS Report

The HLS Report provides details about the high-level synthesis (HLS) process of a user kernel and is generated in Hardware emulation and System builds. This process translates the C/C++ andOpenCLkernel into a hardware description language responsible for implementing the functionality on the FPGA. It provides estimated FPGA resource usage, operating frequency, latency and interface signals of the custom-generated hardware logic. These details provide the programmer many insights to guide kernel optimization.

The HLS Report can be opened by selecting the report in the Assistant and double-clicking. An example of the HLS report follows.

Figure:HLS Report

When running from the command line, this report can be found in the following directory:

_x/..///solution/syn/report

For additional details on the System Estimate report see theSDAccel Environment Profiling and Optimization Guide.

Profile Summary Report

The Profile Summary provides annotated details regarding the overall application performance. All data generated during the execution of the program is gathered bySDAcceland grouped into categories. The Profile Summary enables the programmer to drill down to the actual Data Transfer and Kernel Execution numbers and statistics.

TIP:The Profile Summary report is automatically generated for all build configurations. However, with the Emulation-SW build, the report will not include any data transfer details under kernel execution efficiency and data transfer efficiency. This information is only generated in Emulation-HW or System build configurations.

To open the Profile Summary report in theSDxIDE, double-click the Profile Summary report under the Assistant as shown in the following image.

Figure:Opening Profile Summary Report

An example of the Profile Summary report is shown here.

Figure:Profile Summary

The report has multiple tabs that can be selected. A description of each tab is given in the following table.

Table 1.Profile Summary
Tab Description
Top Operations Kernels and Global Memory. This tab shows a summary of top operations. It displays the profile data for top data transfers between FPGA and device memory.
Kernels & Compute Units Displays the profile data for all kernels and compute units.
Data Transfers Host and Global Memory. This table displays the profile data for all read and write transfers between the host and device memory through thePCIelink. It also displays data transfers between kernels and global memory, if enabled.
OpenCLAPIs Displays the profile data for allOpenCLC host API function calls executed in the host application.

For command line users, the profile summary data is generated by using the--profile_kerneloption during the linking stage. The--profile_kernelsyntax is given below:

--profile_kernel <[data]:<[kernel_name|all]:[compute_unit_name|all]: [interface_name|all]:[counters|all]>

See theSDAccel Environment Profiling and Optimization Guidefor complete details.

Application Timeline

Application Timeline collects and displays host and device events on a common timeline to help you understand and visualize the overall health and performance of your systems. These events include:

  • OpenCLAPI calls from the host code.
  • Device trace data including Compute units, AXI transaction start/stop.
  • Host events and kernel start/stops.

This graphical representation enables the programmer to identify issues regarding kernel synchronization and efficient concurrent execution.

TIP:By default, timeline and device trace data are only collected during hardware emulation and not System build. Turning on device profiling for System build is intrusive and can negatively affect overall performance. This feature should be used for system performance debugging only. To collect data during system testing, update the run config setting. Details can be found in the SDAccel Environment Profiling and Optimization Guide.

Double-clickApplication Timelinein the Reports window to open the Application Timeline window.

The following is a snapshot of the Application Timeline window which displays host and device events on a common timeline. Host activity is displayed at the top of the image and kernel activity is shown on the bottom of the image. Host activities include creating the program, running the kernel and data transfers between global memory and the host. The kernel activities include read/write accesses and transfers between global memory and the kernel(s). This information helps you understand details of application execution and identify potential areas for improvements.

Figure:Application Timeline

Timeline data can be enabled and collected through the command line flow, however, viewing must be done through the GUI. Complete instructions for enabling and displaying timeline data collection through both the command and GUI flows are given inSDAccel Environment Profiling and Optimization Guide.

Waveform View and Live Waveform Viewer

TheSDxDevelopment Environment can generate a Waveform View when running hardware emulation. It displays in-depth details on the emulation results at the system level, compute unit (CU) level, and at the function level. The details include data transfers between the kernel and global memory and data flow through inter-kernel pipes. These details provide many insights into the performance bottleneck from the system level down to the individual function call to help developers optimize their applications.

The Live Waveform Viewer is similar to the Waveform view, however, it provides even lower-level details. It can also be opened usingxsim, aXilinxtool used by hardware designers.

Waveform View and Live Waveform Viewer data are not collected by default because it requires the runtime to generate simulation waveform during hardware emulation, which consumes more time and disk space. TheSDAccel Environment Profiling and Optimization Guidedescribes setups required to enable data collection for the Waveform View and Live Waveform Viewer for both GUI and command line.

Double-click theWaveformin the Assistant view (shown in the following image) to open the Waveform View window.

Figure:Opening Waveform View

An example of the Waveform View is shown here.

Figure:Waveform View Example

The Live Waveform Viewer can be viewed if you selectLaunch Live Waveformin the Run Configuration Main tab. Or, if the Launch Live Waveform is not selected, you can open the waveform (.wdb) withxsimthrough the Linux command line. The.wdbfile is located in the sub-directory,Emulation-HW/-Default, within the project directory. Use the following Linux line command to openxsim:

xsim -gui  &

An example of thexsimLive Waveform Viewer is shown in the following image.

Figure:Live Waveform Viewer Example

Kernel SLR and DDR Memory Assignments

Kernel compute unit (CU) instance and DDR memory resource floorplanning are keys to meeting quality of results of your design in terms of frequency and resources. Floorplanning involves explicitly allocating CUs (a kernel instance) to SLRs and mapping CUs to DDR memory resources. When floorplanning, both CU resource usage and DDR memory bandwidth requirements need to be considered.

The largestXilinxFPGAs are made up of multiple stacked silicon dies. Each stack is referred to as a super logic region (SLR) and has a fixed amount of resources and memory including DDR interfaces. Available device SLR resources which can be used for custom logic can be found inSDx Environments Release Notes, Installation, and Licensing Guideor can be displayed using theplatforminfoutility described inPlatforminfo Utility.

You can use the actual kernel resource utilization values to help distribute CUs across SLRs to reduce congestion in any one SLR. The system estimate report lists the number of resources (LUTs, Flip-Flops, BRAMs, etc.) used by the kernels early in the design cycle. The report can be generated during hardware emulation and system compilation through the command line or GUI and is described inSystem Estimate Report.

Use this information along with the available SLR resources to help assign CUs to SLRs such that no one SLR is over-utilized. The less congestion in an SLR, the better the tools can map the design to the FPGA resources and meet your performance target. For mapping memory resources and CUs, seeMapping Kernel Interfaces to Memory ResourcesandAllocating Compute Units to SLRs, respectively.

Note:While compute units can be connected to any available DDR memory resource, it is also necessary to account for the bandwidth requirements of the kernels when assigning to SLRs. SDAccel Environment Profiling and Optimization Guideprovides details on allocating and optimizing DDR bandwidth.

After allocating your CUs to SLRs, map any CU master AXI port(s) to DDR memory resources.Xilinxrecommends connecting to a DDR memory resource in the same SLR as the CU. This reduces competition for the limited SLR-crossing connection resources. In addition, connections between SLRs use super long line (SLL) routing resources, which incurs a greater delay than a standard intra-SLR routing.

It might be necessary to cross an SLR region to connect to a DDR resource in a different SLR. However, if both the--spand the--slrdirectives are explicitly defined, the tools automatically add additional crossing logic to minimize the effect of the SLL delay, and facilitates better timing closure.

Guidelines for Kernels that Access Multiple Memory Banks

The DDR memory resources are distributed across the super logic regions (SLRs) of the platform. Since the number of connections available for crossing between SLRs is limited, the general guidance is to place a kernel in the same SLR as the DDR memory resource with which it has the most connections. This reduces competition for SLR-crossing connections and avoids consuming extra logic resources associated with SLR crossing.

Figure:Kernel and Memory in Same SLR

Note:The image on the left shows a single AXI interface mapped to a single memory bank. The image on the right shows multiple AXI interfaces mapped to the same memory bank.
As shown in the previous figure, when a kernel has a single AXI interface that maps only a single memory bank, the platforminfoutility described in Platforminfo Utilitylists the SLR that is associated with the memory bank of the kernel; therefore, the SLR where the kernel would be best placed. In this scenario, the design tools might automatically place the kernel in that SLR without need for extra input; however, you might need to provide an explicit SLR assignment for some of the kernels under the following conditions:
  • If the design contains a large number of kernels accessing the same memory bank.
  • A kernel requires some specialized logic resources that are not available in the SLR of the memory bank.

When a kernel has multiple AXI interfaces and all of the interfaces of the kernel access the same memory bank, it can be treated in a very similar way to the kernel with a single AXI interface, and the kernel should reside in the same SLR as the memory bank that its AXI interfaces are mapping.

Figure:Memory Bank in Adjoining SLR

Note:The image on the left shows one SLR crossing is required when the kernel is placed in SLR0. The image on the right shows two SLR crossings are required for kernel to access memory banks.

When a kernel has multiple AXI interfaces to multiple memory banks in different SLRs, the recommendation is to place the kernel in the SLR that has the majority of the memory banks accessed by the kernel (shown it the figure above). This minimizes the number of SLR crossings required by this kernel which leaves more SLR crossing resources available for other kernels in your design to reach your memory banks.

When the kernel is mapping memory banks from different SLRs, explicitly specify the SLR assignment as described inKernel SLR and DDR Memory Assignments.

Figure:Memory Banks Two SLRs Away

Note:The image on the left shows two SLR crossings are required to access all of the mapped memory banks. The image on the right shows three SLR crossings are required to access all of the mapped memory banks.

As shown in the previous figure, when a platform contains more than two SLRs, it is possible that the kernel might map a memory bank that is not in the immediately adjacent SLR to its most commonly mapped memory bank. When this scenario arises, memory accesses to the distant memory bank must cross more than one SLR boundary and incur additional SLR-crossing resource costs. To avoid such costs it might be better to place the kernel in an intermediate SLR where it only requires less expensive crossings into the adjacent SLRs.