Profiling and Optimization

There are two distinct areas for you to consider when performing algorithm optimization in theSDSoC™environment:

Application code optimization
Hardware function optimization

Most application developers are familiar with optimizing software targeted to a CPU. This usually requires programmers to analyze algorithmic complexities, overall system performance, and data locality. There are many methodology guides and software tools to guide the developer identifying performance bottlenecks. These same techniques can be applied to the functions targeting hardware acceleration in theSDSoCenvironment.

As a first step, programmers should optimize their overall program performance independently of the final target. The main difference betweenSDSoCand general purpose software is: inSDSoCprojects, part of the core compute algorithms are pushed onto the FPGA. This implies that the developer must also be aware of algorithm concurrency, data transfers, memory usage and consumption, and the fact that programmable logic is targeted.

Generally, you need to identify the section of the algorithm to be accelerated and how best to keep the hardware accelerator busy while transferring data to and from the accelerator. The primary objective is to reduce the overall computation time taken by the combined hardware accelerator and data motion network versus the CPU software only approach.

Software running on the CPU must efficiently manage the hardware function(s), optimize its data transfers, and perform any necessary pre- or post- processing steps.

TheSDSoCenvironment is designed to support your efforts to optimize these areas, by generating reports that help you analyze the application and the hardware functions in some detail. The reports are generated automatically when you build the project, and listed in the Assistant view of theSDx™IDE, as shown in the following figure. Double-click a listed report to open it.

The following figures show the two main reports:

High-Level Synthesis (HLS)
Data Motion Network

To access these reports from the GUI, ensure theAssistantview is visible. This view is typically below theProject Explorerview. You can use theWindow>Show View>Assistantmenu command to display the Assistant view if it is not displayed.

The HLS Report provides details about the HLS process program that translates the C/C++ model into a hardware description language responsible for implementing the functionality on the FPGA. The details of this report enables you to see the impact of the design on the hardware implementation. You can then optimize the hardware function(s) based on the information.

The Data Motion Network Report describes the hardware/software connectivity for each hardware function. TheData Motion Networktable shows (from the right-most column to the left-most) what sort of data mover is used for transport of each hardware function argument, and to which system port that data mover is attached. ThePragmasshows any SDS-based pragmas used for the hardware function.

The Accelerator Callsitestable shows the following:

Accelerator instance name and Accelerator argument.
Name of the port on the IP that pertains to the Accelerator argument (typically the same as the previous, except when bundling).
Direction of the data motion transfer.
Size, in bytes, of data to be transferred, to the degree in which the compiler can deduce that size. If the runtime determines the transfer size, this is zero.
List of pragmas related to this argument.
System Port and data mover:, if applicable indicates which platform port and which data mover is used for transport of this argument.
Accelerator(s) that are used, the inferred compiler as being used, and the CPU cycles used for setup and transfer of the memory.

Generally, the Data Motion report page indicates first:

What characteristics are specified in pragmas.
In the absence of a pragma, what the compiler was able to infer.

The distinction is that the compiler might not be able to deduce certain program properties. In particular, the most important distinction here is cacheability. If the Data Motion report indicates cacheable and the data is in fact uncacheable (or vice versa), correct cache behavior would occur at runtime. It is not necessary to structure your program such that the compiler can identify data as being uncacheable to remove flushes.

Additional details for each report, as well as a profiling and optimization methodology, and coding guidelines can be found in theSDSoC Profiling and Optimization Guide.