Improving System Performance

This chapter describes underlying principles and inference rules within theSDSoC™system compiler to assist the programmer to improve overall system performance through the following:

Increased parallelism in the hardware function.
Increased system parallelism and concurrency.
Improved access to external memory from programmable logic.
An understanding of the data motion network (default behavior and user specification).

There are many factors that affect overall system performance. A well-designed system generally balances computation and communication, so that all hardware components remain occupied doing meaningful work.

Some applications are compute-bound; for these applications, concentrate on maximizing throughput and minimizing latency in hardware accelerators.
Other applications might be memory-bound, in which case, you might need to restructure algorithms to increase temporal and spatial locality in the hardware; for example, adding copy-loops ormemcopyto pull blocks of data into hardware rather than making random array accesses to external memory.

Control over the various aspects of optimization is provided through the use of pragmas in the code. For a complete description of the available pragmas, refer to the .

Improving Hardware Function Parallelism

This section provides a concise introduction to writing efficient code that can be cross-compiled into programmable logic.

TheSDSoCenvironment employs theVivado®High-Level Synthesis (HLS) tool as a programmable logic cross-compiler to transform C/C++ functions into hardware.

By applying the principles described in this section, you can dramatically increase the performance of the synthesized functions, which can lead to significant increases in overall system performance for your application.

Top-Level Hardware Function Guidelines

This section describes coding guidelines to ensure that theVivadoHLS tool hardware function has a consistent interface with object code generated by theArm®core GNU toolchain.

Use Standard C99 Data Types for Top-Level Hardware Function Arguments

Avoid using arrays ofbool. An array ofboolhas a different memory layout between the GNUArmcross compiler and the HLS tool.
Avoid usinghls::streamat the hardware function top-level interface. This data type helps the HLS tool compiler synthesize efficient logic within a hardware function but does not apply to application software.

Omit HLS Interface Directives for Top-Level Hardware Function Arguments

Although supported, a top-level hardware function should not, in general, contain HLS interface pragmas. Thesdcc/sds++(referred to assds++) system compiler automatically generates the appropriate HLS tool interface directives.

There are two SDSoCenvironment pragmas you can specify for a top-level hardware function to guide the sds++system compiler to generate the required HLS tool interface directives:

#pragma SDS data zero_copy(): Use to generate a shared memory interface implemented as an AXI master interface in hardware.
#pragma SDS data access_pattern(argument:SEQUENTIAL): Use to generate a streaming interface implemented as a FIFO interface in hardware.

If you specify the interface using#pragma HLS interfacefor a top-level function argument, theSDSoCenvironment does not generate the HLS tool interface directive for that argument; you should ensure that the generated hardware interface is consistent with all other function argument hardware interfaces.

Note:Because a function with the incompatible HLS tool interface types can result in cryptic sds++system compiler error messages, it is strongly recommended (though not absolutely mandatory) that you omit HLS interface pragmas.

Using Vivado Design Suite HLS Libraries

This section describes how to use theVivadoHLS tool libraries with theSDSoCenvironment.

The HLS tool libraries are provided as source code with the HLS installation tool in theSDSoCenvironment. Consequently, you can use these libraries as you would any other source code that you plan to cross-compile for programmable logic using the HLS tool. In particular, ensure that the source code conforms to the rules described inHardware Function Argument Types, which might require you to provide a C/C++ wrapper function to ensure the functions export a software interface to your application.

In theSDSoCIDE, the synthesizeable finite impulse response (FIR) example template for all basic platforms provides an example that uses an HLS tool library. You can find several additional code examples that employ the HLS tool libraries in thesamples/hls_libdirectory. For example,samples/hls_lib/hls_mathcontains an example to implement and uses a square root function.

The filemy_sqrt.hcontains:

#ifndef _MY_SQRT_H_ #define _MY_SQRT_H_ #ifdef __SDSVHLS__ #include "hls_math.h" #else // The hls_math.h file includes hdl_fpo.h which contains actual code and // will cause linker error in the ARM compiler, hence we add the function // prototypes here static float sqrtf(float x); #endif void my_sqrt(float x, float *ret); #endif // _SQRT_H_

The filemy_sqrt.cppcontains:

#include "my_sqrt.h" void my_sqrt(float x, float *ret) { *ret = sqrtf(x); }

The makefile has the commands to compile these files:

sds++ -c -hw my_sqrt –sds-pf zc702 my_sqrt.cpp sds++ -c my_sqrt_test.cpp sds++ my_sqrt.o my_sqrt_test.o -o my_sqrt_test.elf

Increasing System Parallelism and Concurrency

Increasing the level of concurrent execution is a standard way to increase overall system performance, and increasing the level of parallel execution is a standard way to increase concurrency. Programmable logic is well-suited to implement architectures with application-specific accelerators that run concurrently, especially communicating through flow-controlled streams that synchronize between data producers and consumers.

In theSDSoCenvironment, you influence the macro-architecture parallelism at the function and data mover level, and the micro-architecture parallelism within hardware accelerators. By understanding how thesds++/sdscc(referred to assds++) compiler infers system connectivity and data movers, you can structure application code and apply pragmas, as needed, to control hardware connectivity between accelerators and software, data mover selection, number of accelerator instances for a given hardware function, and task level software control.

You can control the micro-architecture parallelism, concurrency, and throughput for hardware functions within theVivadoHLS tool, or within the IP, you incorporate as C-callable and linkable libraries.

At the system level, thesds++compiler chains together hardware functions when the data flow between them does not require transferring arguments out of programmable logic and back to system memory.

For example, consider the code in the following figure, wheremmultandmaddfunctions have been selected for hardware.

Because the intermediate array variabletmp1is used only to pass data between the two hardware functions, thesds++compiler chains the two functions together in hardware with a direct connection between them.

It is instructive to consider a timeline for the calls to hardware, as shown in the following figure:

The program preserves the original program semantics, but instead of the standardArmcore procedure calling sequence, each hardware function call is broken into multiple phases involving setup, execution, and cleanup, both for the data movers (DM) and the accelerators. The CPU, in turn, sets up each hardware function (that is, the underlying IP control interface), and the data transfers for the function call with non-blocking APIs, and then waits for all calls and transfers to complete.

In the example shown in the following figure, themmultandmaddfunctions run concurrently whenever their inputs become available. The ensemble of function calls is orchestrated in the compiled program by control code automatically generated by thesds++system compiler according to the program, data mover, and accelerator structure.

In general, it is impossible for thesds++system compiler to determine side effects of function calls in your application code (for example,sds++might have no access to source code for functions within linked libraries), so any intermediate access of a variable occurring lexically between hardware function calls requires the compiler to transfer data back to memory.

For example, an injudicious simple change to uncomment the debug print statement (in the "wrong place"), as shown in the following figure, can result in a significantly different data transfer graph and consequently, an entirely different generated system and application performance.

A program can invoke a single hardware function from multiple call sites. In this case, thesds++system compiler behaves as follows. If any of the function calls results in "direct connection" data flow, then thesds++system compiler creates an instance of the hardware function that services every similar direct connection, and an instance of the hardware function that services the remaining calls between memory (software) and PL.

To achieve high performance in PL, one of the best methods is structuring your application code with "direct connection" data flow between hardware functions. You can create deep pipelines of accelerators connected with data streams, increasing the opportunity for concurrent execution.

There is another way in which you can increase parallelism and concurrency using thesds++system compiler. You can direct the system compiler to create multiple instances of a hardware function by inserting the following pragma immediately preceding a call to the function.

#pragma SDS resource() //  a non-negative integer

This pragma creates a hardware instance that is referenced by.

A simple code snippet that creates two instances of a hardware functionmmultis as follows.

{ #pragma SDS resource(1) mmult(A, B, C); // instance 1 #pragma SDS resource(2) mmult(D, E, F); // instance 2 }

If creating multiple instances of an accelerator is not what you want, using thesds_asyncmechanism gives the programmer ability to handle the "hardware threads" explicitly to achieve very high levels of parallelism and concurrency. However, like any explicit multi-threaded programming model, it requires careful attention to synchronization details to avoid non-deterministic behavior or deadlocks. For more information , refer to theSDSoC Environment Programmers Guide.

Data Motion Network Generation in SDSoC

This section describes:

Components that make up the data motion network in theSDSoCenvironment, and also provides guidelines to help you understand the data motion network generated by theSDSoCcompiler.
Guidelines to help you guide the data motion network generation by using appropriateSDSoCpragmas.

Every transfer between the software program and a hardware function requires a data mover, which consists of a hardware component that moves the data, and an operating system-specific library function. The following table lists supported data movers and various properties for each:

Table 1.SDSoCData Movers Table
SDSoCData Mover	Vivado®IP Data Mover	Accelerator IP Port Types	Transfer Size	Contiguous Memory Only
axi_dma_simple	axi_dma	bram, ap_fifo, axis	≤ 32 MB	Yes
axi_dma_sg	axi_dma	bram, ap_fifo, axis	N/A	No (but recommended)
axi_fifo	axi_fifo_mm_s	bram, ap_fifo, axis	≤ 300 B	No
zero_copy	accelerator IP	aximm master	N/A	Yes

For array arguments, the data mover inference is based on transfer size, hardware function port mapping, and function call site information. The selection of data movers is a trade off between performance and resource, for example:
- Theaxi_dma_simpledata mover is the most efficient bulk transfer engine and supports up to 32 MB transfers, so it is best for transfers under that limit.
- Theaxi_fifodata mover does not require as many hardware resources as the DMA, but due to its slower transfer rates, is preferred only for payloads of up to 300 bytes.
- Theaxi_dma_sg(scatter-gather DMA) data mover provides slower DMA performance and consumes more hardware resources but has fewer limitations, and in the absence of any pragma directives, is often the best default data mover.

You can specify the data mover selection by inserting a pragma into program source immediately before the function declaration; for example:

#pragma SDS data data_mover(A:AXI_DMA_SIMPLE)

Note: #pragma SDSis always treated as a rule, not a hint, so you must ensure that their use conforms with the data mover requirements in the previous table.

The data motion network in theSDSoCenvironment is made up of three components:

The memory system ports on the PS (A)
Data movers between the PS and accelerators as well as among accelerators (B)
The hardware interface on an accelerator (C)

The following figure illustrates these three components.

Without any SDS pragma, theSDSoCenvironment automatically generates the data motion network based on an analysis of the source code; however, theSDSoCenvironment also provides pragmas for you to guide the data motion network generation. See the .

System Port

A system port connects a data mover to the PS. It can be an acceptance filter ID (AFI)—which corresponds to high-performance ports, memory interface generator (MIG)—which is a PL-based DDR memory controller, or a stream port on theZynq®-7000 SoCorZynq® UltraScale+™ MPSoCprocessors.

The AFI port is a non-cache-coherent port. If needed, cache coherency, such as cache flushing and cache invalidation, is maintained by software.

The AFI port depends on the cache requirement of the transferred data, the cache attribute of the data, and the data size. If the data is allocated withsds_alloc_non_cacheable()orsds_register_dmabuf(), it is better to connect to the AFI port to avoid cache flushing/invalidation.

Note:These functions can be found in the sds_lib.hand described in the environment APIs in the SDSoC Environment Programmers Guide(UG1278).

TheSDSoCsystem compiler analyzes these memory attributes for the data transactions with the accelerator, and connects data movers to the appropriate system port.

To override the compiler decision, or in some cases where the compiler is not able to do such analysis, you can use the following pragma to specify the system port:

#pragma SDS data sys_port(arg:ip_port)

For example, the following function directly connects to a FIFO AXI interface. This is whereip_portcan be eitherAFIorMIG:

#pragma SDS data sys_port:(A:fifo_S_AXI)* void foo(int* A, int* B, int* C);

This function can also be used for a streaming interface:

#pragma SDS data sys_port:(A:stream_fifo_S_AXIS)* void foo(int* A, int* B, int* C)

Note:For more information about AXI functionality in the Vivado®High Level Synthesis (HLS) tool, see the Vivado Design Suite User Guide: High-Level Synthesis(UG902).

Use the followingsds++system compiler command to see the list of system ports for the platform:

sds++ -sds-pf-info  -verbose

Data Mover

The data mover transfers data between the PS and accelerators and among accelerators. TheSDSoCenvironment can generate various types of data movers based on the properties and size of the data being transferred.

Scalar

Scalar data is always transferred by the AXI_LITEdata mover.

Array

The sds++system compiler can generate the following data movers:

AXI_DMA_SG
AXI_DMA_SIMPLE
AXI_FIFO
zero_copy(accelerator-masteredAXI4bus)
AXI_LITE(depending on the memory attributes and data size of the array)

For example, if the array is allocated usingmalloc(), the memory is not physically contiguous, andSDSoCenvironment generates a scatter-gather DMA (AXI_DMA_SG); however, if the data size is less than 300 bytes,AXI_FIFOis generated instead because the data transfer time is less thanAXI_DMA_SG, and it occupies much less PL resource.

Struct or Class

The implementation of a structdepends on how the struct is passed to the hardware —passed by value, passed by reference, or as an array of structs—and the type of data mover selected. The following table shows the various implementations..

Table 2.Struct Implementations
Struct Pass Method	Default (no pragma)	#pragma SDS data zero_copy (arg)	#pragma SDS data zero_copy (arg[0:SIZE])	#pragma SDS data copy (arg)	#pragma SDS data copy (arg[0:SIZE])
pass by value (`struct RGB arg`)	Each field is flattened and passed individually as a scalar or an array.	This is not supported and will result in an error.	This is not supported and will result in an error.	The`struct`is packed into a single wide scalar.	Each field is flattened and passed individually as a scalar or an array. The value of`SIZE`is ignored.
pass by pointer (`struct RGB *arg`) or reference (`struct RGB &arg`)	Each field is flattened and passed individually as a scalar or an array.	The`struct`is packed into a single wide scalar and transferred as a single value. The data is transferred to the hardware accelerator through anAXI4bus.	The`struct`is packed into a single wide scalar. The number of data values transferred to the hardware accelerator through anAXI4bus is defined by the value of`SIZE`.	The`struct`is packed into a single wide scalar.	The`struct`is packed into a single wide scalar. The number of data values transferred to the hardware accelerator using an`AXIDMA_SG`or`AXIDMA_SIMPLE`is defined by the value of`SIZE`.
array of`struct` (`struct RGB arg[1024]`)	Each`struct`element of the array is packed into a single wide scalar.	Each`struct`element of the array is packed into a single wide scalar. The data is transferred to the hardware accelerator using anAXI4bus.	Each`struct`element of the array is packed into a single wide scalar. The data is transferred to the hardware accelerator using anAXI4bus. The value of`SIZE`overrides the array size and determines the number of data values transferred to the accelerator.	Each`struct`element of the array is packed into a single wide scalar. The data is transferred to the hardware accelerator using a data mover such as`AXI_DMA_SG`or`AXI_DMA_SIMPLE`.	Each`struct`element of the array is packed into a single wide scalar. The data is transferred to the hardware accelerator using a data mover such as`AXI_DMA_SG`or`AXI_DMA_SIMPLE`. The value of`SIZE`overrides the array size and determines the number of data values transferred to the accelerator.

Determining which data mover to use for transferring an array depends on two attributes of the array, data size and physical memory contiguity. For example, if the memory size is one MB and not physically contiguous (allocated bymalloc()), useAXI_DMA_SG. The following table shows the applicability of these data movers.

Table 3.Data Mover Selection
Data Mover	Physical Memory Contiguity	Data Size (bytes)
`AXI_DMA_SG`	Either	> 300
`AXI_DMA_Simple`	Contiguous	< 32M
`AXI_FIFO`	Non-contiguous	< 300

Normally, theSDSoCcross-compiler analyzes the array that is transferred to the hardware accelerator for these two attributes, and selects the appropriate data mover accordingly. However, there are cases where such analysis is not possible. At that time, theSDSoCcross-compiler issues a warning message, as shown in the following example, that states that it is unable to determine the memory attributes using SDS pragmas.

WARNING: [DMAnalysis 83-4492] Unable to determine the memory attributes passed to rgb_data_in of function img_process at C:/simple_sobel/src/main_app.c:84

The following pragma specifies the memory attributes:

#pragma SDS data mem_attribute(function_argument:contiguity)

Thecontiguitycan be eitherPHYSICAL_CONTIGUOUSorNON_PHYSICAL_CONTIGUOUS. Use the following pragma to specify the data size:

#pragma SDS data copy(function_argument[offset:size])

Thesizecan be a number or an arbitrary expression.

Zero Copy Data Mover

The zero copy data mover is unique because it covers both the accelerator interface and the data mover. The syntax of this pragma is:

#pragma SDS data zero_copy(arg[offset:size])

The[offset:size]is optional, and only needed if the data transfer size for an array cannot be determined at compile time.

By default, theSDSoCenvironment assumescopysemantics for an array argument, meaning the data is explicitly copied from the PS to the accelerator through a data mover. When this ZERO_COPY pragma is specified,SDSoCenvironment generates anAXI-Masterinterface for the specified argument on the accelerator, which grabs the data from the PS as specified in the accelerator code.

To use the ZERO_COPY pragma, the memory corresponding to the array must be physically contiguous, that is allocated withsds_alloc.

Accelerator Interface

The accelerator interface generated in depends on the data type of the argument.

Scalar

For a scalar argument, the register interface is generated to pass in and/or out of the accelerator.

Arrays

The hardware interface on an accelerator for transferring an array can be either a RAM interface or a streaming interface, depending on how the accelerator accesses the data in the array.

The RAM interface allows the data to be accessed randomly within the accelerator; however, it requires the entire array to be transferred to the accelerator before any memory accesses can happen within the accelerator. Moreover, the use of this interface requires block RAM resources on the accelerator side to store the array.

The streaming interface, on the other hand, does not require memory to store the whole array, it allows the accelerator to pipeline the processing of array elements; for example, the accelerator can start processing a new array element while the previous ones are still being processed. However, the streaming interface requires the accelerator to access the array in a strict sequential order, and the amount of data transferred must be the same as the accelerator expects.

TheSDSoCenvironment, by default, generates the RAM interface for an array; however, theSDSoCenvironment provides pragmas to direct it to generate the streaming interface.

struct or class

The implementation of a structdepends on how the structis passed to the hardware—passed by value, passed by reference, or as an arrays of structs—and the type of data mover selected. The previous table shows the various implementations.

The following SDS pragma can be used to guide the interface generation for the accelerator.

#pragma SDS data access_pattern(function_argument:pattern)

Wherepatterncan either be RANDOM or SEQUENTIAL, andargcan be an array argument name of the accelerator function.

If an array argument's access pattern is specified as RANDOM, a RAM interface is generated. If it is specified as SEQUENTIAL, a streaming interface is generated.

Note:

The default access pattern for an array argument is RANDOM.
The specified access pattern must be consistent with the behavior of the accelerator function. For SEQUENTIAL access patterns, the function must access every array element in a strict sequential order.
This pragma only applies to arguments without thezero_copypragma.