Programming Hardware Functions

Programming a function for hardware acceleration in the SDSoC™environment is as simple as writing a standard C/C++ function. However, to get the significant performance advantages of hardware acceleration through the sds++/ sdscc(referred to as sds++) system compiler, there are a few considerations to keep in mind when writing the function, or modifying existing code to be implemented in programmable logic.

Defining the function interface: the data types of inputs and outputs, and data transfers.
What kind of memory access the function will have: DMA (interfacing with DDR) or FIFOs.
How will the data be accessed: contiguous or non-contiguous.
How will the data be processed: loops, arrays.

Determining what data is going to be processed in and out of an accelerator is the first step in creating a hardware function. Knowing the inputs and outputs of the hardware function, you can get an idea of what parallelism can be achieved. A critical element to writing a function for acceleration in programmable logic data sizes are fixed when implemented in hardware/programmable logic. Hardware data sizes cannot change during runtime.

Exporting Hardware Functions as Libraries

After a hardware function, or a library of functions are written and optimized as needed, you can create an exported library for reuse in other projects. A general flow for exporting a library is to make sure that all the function definitions are grouped appropriately, and use thesds++/sdscccommand with the-shared(which is interpreted as-fPICforgcc) option to build a shared library when compiling the functions. More detailed information can be found inExporting a Library for GCC.

C-Callable IP

An accelerator can also be created using RTL and provided as an IP core through theVivado® Design Suite, called a C-Callable IP. In this case, to use the C-Callable IP in the application code, add the static library compiled for the appropriateArm®processor, typically an archive (.a) file, and the header file (.h/.hpp) as source files in theSDSoCapplication project. These files need be added to the source directory of the application or specified for library search (-L), include search (-I), and properly included and linked with themainfunction. Now, like any other hardware function, it can be called as a typical C/C++ function, making sure the data sizes match to the function parameters. More information on creating and using C-Callable IP can be found in theSDSoC Environment User Guide.

Coding Guidelines

This section contains general coding guidelines for application programming using thesds++system compilers, with the assumption of starting from application code that has already been cross-compiled for theArmCPU within theZynq®-7000device, using the GNU toolchain included as part of theSDSoCenvironment.

General Hardware Function Guidelines

Hardware functions can execute concurrently under the control of a master thread. Multiple master threads are supported.
A top-level hardware function must be a global function, not a class method, and it cannot be an overloaded function.
There is no support for exception handling in hardware functions.
It is an error to refer to a global variable within a hardware function or any of its sub-functions when this global variable is also referenced by other functions running in software.
Hardware functions support scalar types up to 1024 bits, including double, long, packed structs, etc.
A hardware function must have at least one argument.
An output or inout scalar argument to a hardware function can be assigned multiple times, but only the last written value is read upon function exit.
Use predefined macros to guard code with#ifdefand#ifndefpreprocessor statements; the macro names begin and end with two underscore characters ‘_’. For examples, see "SDSCC/SDS++ Compiler Commands" in theSDx Command and Utility Reference Guide.
- The__SDSCC__macro is defined and passed as a-Doption to sub-tools wheneversds++is used to compile source files, and can be used to guard code dependent on whether it is compiled bysds++or by another compiler, for example a GNU host compiler.
- Whensds++compiles source files targeted for hardware acceleration usingVivadoHLS, the__SDSVHLS__macro is defined and passed as a-Doption, and can be used to guard code dependent on whether high-level synthesis is run or not.
- VivadoHLS employs some 32-bit libraries irrespective of the host machine. Furthermore, the tool does not provide a true cross-compilation.

All object code for theArmCPUs is generated with the GNU toolchains, but thesds++compiler, built upon Clang/LLVM frameworks, is generally less forgiving of C/C++ language violations than the GNU compilers. As a result, you might find that some libraries needed for your application cause front-end compiler errors when usingsds++. In such cases, compile the source files directly through the GNU toolchain rather than throughsds++, either in your makefiles or by setting the compiler toarm-linux-gnueabihf-g++. To set the compiler, right-click the file (or folder) in theProject Explorerand selectC/C++ Build>Settings>SDSCC/SDS++ Compiler. See "Compiling and Running Applications on anArmProcessor" in theSDSoC Environment User Guidefor more information.

Hardware Function Argument Types

The sds++supports hardware function arguments with types that resolve to a single or array of C99 basic arithmetic type (scalar), a structor classwhose members flatten to a single or array of C99 basic arithmetic type (hierarchical structs are supported), and an array of structwhose members flatten to a single C99 basic arithmetic type. Scalar arguments must fit in a 1024-bit container. The SDSoCenvironment automatically infers hardware interface types for each hardware function argument based on the argument type and the following pragmas:

#pragma SDS data copy|zero_copy #pragma SDS data access_pattern

To avoid interface incompatibilities, you should only incorporateVivadoHLSinterfacetype directives and pragmas in your source code whensds++fails to generate a suitable hardware interface directive.

VivadoHLS provides arbitrary precision typesap_fixed,ap_int, and anhls::streamclass. In theSDSoCenvironment,ap_fixedtypes must be specified as having widths greater than 7 but less than 1025 (7 < width < 1025). Thehls::streamdata type is not supported as the function argument to any hardware function.
By default, an array argument to a hardware function is transferred by copying the data, that is, it is equivalent to using#pragma SDS data copy. As a consequence, an array argument must be either used as an input or produced as an output, but not both. For an array that is both read and written by the hardware function, you must use#pragma SDS data zero_copyto tell the compiler that the array should be kept in the shared memory and not copied.
To ensure alignment across the hardware/software interface, do not use hardware function arguments that are an array ofbool.

Pointers

Pointer arguments for a hardware function require special consideration. Hardware functions operate on physical addresses, which typically are not available to user space programs, so pointers cannot be embedded in data structures passed to hardware functions.

In the absence of any pragmas, a pointer argument is taken to be a scalar parameter by default, even though in C/C++ a pointer might denote a one-dimensional array type. The following are the permitted pragmas:

TheDATA ZERO_COPYpragma provides pointer semantics using shared memory.
```
#pragma SDS data zero_copy
```
TheDATA COPYandDATA ACCESS_PATTERNpragma pair maps the argument onto a stream, and requires that the array elements are accessed in sequential index order.
```
#pragma SDS data copy(p[0:]) #pragma SDS data access_pattern(p:SEQUENTIAL)
```
TheDATA COPYpragma is only required when thesds++system compiler is unable to determine the data transfer size and issues an error.

TIP:If you require non-sequential access to the array in the hardware function, you should change the pointer argument to an array with an explicit declaration of its dimensions, for example A[1024].

Hardware Function Interfacing

After defining what function is needed for acceleration, there are a few key items to ensure compilation is valid. TheVivadoHLS tool data types (ap_int,ap_uint,ap_fixed, etc.) cannot be part of the function parameter list that the software part of the application calls. These data types are unique to the HLS tool and have no bearing outside of the intended tool and associated compiler.

For example, if the following function was written in the HLS tool, the parameter list needs to be adjusted, and the function body has to handle moving the data from the HLS tool to a more generic data type, as shown below:

void foo(ap_int *a, ap_int *b, ap_int *c) { /* Function body */ }

This needs to be modified if using local variables:

void foo(int *a, int *b, int *c) { ap_int *local_a = a; ap_int *local_b = b; ap_int *local_c = c; // Remaining function body }

IMPORTANT:Initializing local variables with input data can consume too much memory in the accelerator. Therefore, casting the input data types to the appropriate HLS data types will work instead.

Hardware Function Call Guidelines

Stub functions generated in the SDSoCenvironment transfer the exact number of bytes according to the compile-time determinable array bound of the corresponding argument in the hardware function declaration. If a hardware function admits a variable data size, you can use the following pragma to direct the SDSoCenvironment to generate code to transfer data whose size is defined by an arithmetic expression:
```
#pragma SDS data copy(arg[0:]) #pragma SDS data zero_copy(arg[0:])
```
Where themust compile in the scope of the function declaration.
Thezero_copypragma directs theSDSoCenvironment to map the argument into shared memory.

IMPORTANT:Be aware that mismatches between intended and actual data transfer sizes can cause the system to hang at runtime, requiring laborious hardware debugging. See the SDSoC Environment Debugging Guide.
Align arrays transferred by DMAs on cache-line boundaries (for L1 and L2 caches). Use thesds_alloc()API provided with theSDSoCenvironment instead ofmalloc()to allocate these arrays.
Align arrays to page boundaries to minimize the number of pages transferred with the scatter-gather DMA, for example, for arrays allocated withmalloc.
You must usesds_allocto allocate an array for the following two cases:
1. You are using zero-copy pragma for the array.
2. You are using pragmas to explicitly direct the system compiler to use Simple-DMA.
Note:To use sds_alloc()from sds_lib.h, you must include stdlib.hbefore including sds_lib.h. stdlib.his included to provide the size_ttype.

Data Movers

A data mover is added by thesds++compiler to move data into and out of the hardware accelerator. Generally, a data mover is a FIFO, or a direct memory access (DMA) interface between the processor and the programmable logic. The data mover is inferred by the compiler based on the volume of data being transferred, characteristics of memory being transferred, and access pattern expectations of the accelerator consuming or producing the data in the hardware function. The data mover is implemented into the PL region to support data transfers to and from the hardware function. The system compiler implements one or more accelerators in the programmable region, including the data movers, automating control signals, and interrupts.

You can specify theSDS DATA COPYorDATA ZERO_COPYpragmas in the source code of the hardware function to influence the behavior of the data movers. TheDATA COPYpragma indicates that data is explicitly copied between memory and the hardware function, using a suitable data mover for the transfer. TheDATA ZERO_COPYpragma means that the hardware function accesses the data directly from shared memory through an AXI master bus interface.

With the data size boundaries known, methods of transferring data can be optimized. For example, to access data like a large image (1920 x 1080), it is advantageous to store the data in the DDR in a physically contiguous way. To retrieve the data, use direct interfacing to the DDR and then determine if it needs to be sequentially accessed or not. By default,sds++compiler infers the data movers, but in cases where the data being transferred is large, you need to identify the appropriate data mover to use. See theSDSoC Profiling and Optimization Guidefor information on setting up and optimizing the data motion network. For instance, if the function interface uses wide data widths (over 64-bit) applying theFASTDMAover theAXIDMA_SIMPLEwould allow for a higher bandwidth and possible faster throughput.

To incorporate these types of data movers, pragmas are used to tell the compiler how to interface the accelerator to the rest of the system. For example, for storing and retrieving an image in DDR the following SDS pragmas can be used:

#pragma SDS DATA COPY(out[0:size]) #pragma DATA ACCESS_PATTERN(data:SEQUENTIAL, out:SEQUENTIAL) void accelerator(float *data, int *out, int size);

TheDATA ACCESS_PATTERNpragma is used to specify how the data is being accessed by the accelerator. It is needed when the data is determined at runtime, but unknown at compile. ForSEQUENTIALdata access,SDSoCcreates a streaming interface, while aRANDOMdata access pattern creates a RAM interface, which is the default. A key point is thatSEQUENTIALonly accesses elements from the array one time, whileRANDOMcan access any array value in any order. Note though that using a random access pattern still transfers the complete volume of data, regardless of the volume being accessed.

TIP:Depending on how the accelerator is written, the sds++compiler can automatically determine how the data is going to be accessed and the use of ACCESS_PATTERNwould not be needed. For the example code, it can be safely assumed that the accelerator's memory access is not easily determined and the pragma is needed to make sure the compiler treats the data appropriately.

You can also use the SDS pragmaDATA MEM_ATTRIBUTEas a hint to the compiler to trust that memory is physically contiguous or not, and is cacheable or not.

Knowing how the data is being allocated can help tune the system performance depending on the accelerator. For example, the compiler can use simple DMA transfer for physically contiguous memory, which is smaller and faster thanAXI_DMA_SG. For physically contiguous memory, you must usesds_alloc, while for non-physically contiguous memory usemalloc. Specifying theDATA MEM_ATTRIBUTEhelps determine what kind of data mover the compiler can use.

See the for more information on SDS pragmas and examples.

You can also direct the creation of data movers as elements of a packaged C-Callable IP by applying pragmas to the software function signature defined in the header file associated with the C-Callable IP. See theSDSoC Environment User Guidefor more information.

Function Body

After determining the function interfaces and the data transfer mechanism, writing the function body is all that remains. The body of a hardware function should not be all that different from a function written for the processor. However, a key point to remember is that there are opportunities to improve the performance of the accelerator, and of the overall application. To this end, you can examine and rewrite the structure of hardware functions to increase instruction-level or task-level parallelism, use bit-accurate data types, manage loop unrolling and pipelining, and overall dataflow.

Data Types

As it is faster to write and verify the code by using native C data types such asint,float, ordouble, it is a common practice to use these data types when coding for the first time.However, the hardware function code is implemented in hardware and all the operator sizes used in the hardware are dependent on the data types used in the accelerator code. The default native C/C++ data types can result in larger and slower hardware resources that can limit the performance of the hardware function.Instead, consider using bit-accurate data types to ensure the code is optimized for implementation in hardware. Using bit-accurate, or arbitrary precision data types, results in hardware operators which are smaller and faster. This allows more logic to be placed into the programmable logic and also allows the logic to execute at higher clock frequencies while using less power.

Consider using bit-accurate data types instead of native C/C++ data types in your code.

Note:Bit-accurate data types should be used within the accelerator function and not at the top-level interface.

Arbitrary Precision Integer Types

Arbitrary precision integer data types are defined byap_intorap_uintfor signed and unsigned integer respectively inside the header fileap_int.h. To use arbitrary precision integer data type:

Add header fileap_int.hto the source code.
Change the bit types toap_intorap_uint, where N is a bit-size from 1 to 1024.

The following example shows how the header file is added and the two variables are implemented to use 9-bit integer and 10-bit unsigned integer.

#include “ap_int.h” ap_int<9> var1 // 9 bit signed integer ap_uint<10> var2 // 10 bit unsigned integer

Arbitrary Precision Fixed-Point Data Types

Some existing applications use floating point data types as they are written for other hardware architectures. However, fixed-point data types are a useful replacement for floating point types which require many clock cycles to complete. Carefully evaluate trade-offs in power, cost, productivity, and precision when choosing to implement floating-point vs. fixed-point arithmetic for your application and accelerators.

As discussed inDeep Learning with INT8 Optimization on Xilinx Devices(WP486), using fixed-point arithmetic instead of floating point for applications like machine learning can increase power efficiency, and lower the total power required. Unless the entire range of the floating-point type is required, the same accuracy can often be implemented with a fixed-point type resulting in the same accuracy with smaller and faster hardware. The paperReduce Power and Cost by Converting from Floating Point to Fixed Point(WP491)provides some examples of this conversion.

Fixed-point data types model the data as an integer and fraction bits. The fixed-point data type requires theap_fixedheader, and supports both a signed and unsigned form as follows:

Header file:ap_fixed.h
Signed fixed point:ap_fixed
Unsigned fixed point:ap_ufixed
- W = Total width < 1024 bits
- I = Integer bit width. The value of I must be less than or equal to the width (W). The number of bits to represent the fractional part is W minus I. Only a constant integer expression can be used to specify the integer width.
- Q = Quantization mode. Only predefined enumerated values can be used to specify Q. The accepted values are:
  - AP_RND: Rounding to plus infinity.
  - AP_RND_ZERO: Rounding to zero.
  - AP_RND_MIN_INF: Rounding to minus infinity.
  - AP_RND_INF: Rounding to infinity.
  - AP_RND_CONV: Convergent rounding.
  - AP_TRN: Truncation. This is the default value when Q is not specified.
  - AP_TRN_ZERO: Truncation to zero.
- O = Overflow mode. Only predefined enumerated values can be used to specify O. The accepted values are:
  - AP_SAT: Saturation.
  - AP_SAT_ZERO: Saturation to zero.
  - AP_SAT_SYM: Symmetrical saturation.
  - AP_WRAP: Wrap-around. This is the default value when O is not specified.
  - AP_WRAP_SM: Sign magnitude wrap-around.
- N = The number of saturation bits in the overflow WRAP modes. Only a constant integer expression can be used as the parameter value. The default value is zero.
TIP:The ap_fixedand ap_ufixeddata types permit shorthand definition, with only W and I being required, and other parameters assigned default values. However, to define Q or N, you must also specify the parameters before those, even if you just specify the default values.

In the example code below, theap_fixedtype is used to define a signed 18-bit variable with 6 bits representing the integer value above the binary point, and by implication, 12 bits representing the fractional value below the binary point. The quantization mode is set to round to plus infinity (AP_RND). Because the overflow mode and saturation bits are not specified, the defaultsAP_WRAPand 0 are used.

#include  ... ap_fixed<18,6,AP_RND> my_type; ...

When performing calculations where the variables have different numbers of bits (W), or different precision (I), the binary point is automatically aligned. See the "C++ Arbitrary Precision Fixed-Point Types" in theVivado Design Suite User Guide: High-Level Synthesis(UG902)for more information on using fixed-point data types.

Array Configuration

TheSDSoCcompiler maps large arrays to the block Ram (BRAM) memory in the PL region. These BRAM can have a maximum of two access points or ports. This can limit the performance of the application as all the elements of an array cannot be accessed in parallel when implemented in hardware.

IMPORTANT:Use the following array configurations on local buffer variables inside the accelerator, rather than on the function parameters, otherwise it can cause incorrect runtime behavior.

Depending on the performance requirements, you might need to access some or all of the elements of an array in the same clock cycle. To achieve this, the #pragma HLS ARRAY_PARTITIONcan be used to instruct the compiler to split the elements of an array and map it to smaller arrays, or to individual registers. The compiler provides three types of array partitioning, as shown in the following figure. The three types of partitioning are:

block: The original array is split into equally sized blocks of consecutive elements of the original array.
cyclic: The original array is split into equally sized blocks interleaving the elements of the original array.
complete: Split the array into its individual elements. This corresponds to resolving a memory into individual registers. This is the default for theARRAY_PARTITIONpragma.

For block and cyclic partitioning, thefactoroption specifies the number of arrays that are created. In the preceding figure, a factor of 2 is used to split the array into two smaller arrays. If the number of elements in the array is not an integer multiple of the factor, the later arrays will have fewer elements.

When partitioning multi-dimensional arrays, the dimensionoption is used to specify which dimension is partitioned. The following figure shows how the dimensionoption is used to partition the following example code in three different ways:

void foo (...) { // my_array[dim=1][dim=2][dim=3] // The following three pragma results are shown in the figure below // #pragma HLS ARRAY_PARTITION variable=my_array dim=3  factor=2 // #pragma HLS ARRAY_PARTITION variable=my_array dim=1  factor=2 // #pragma HLS ARRAY_PARTITION variable=my_array dim=0 complete int my_array[10][6][4]; ... }

The examples in the figure demonstrate how partitioning dimension 3 results in four separate arrays and partitioning dimension 1 results in 10 separate arrays. If 0 is specified as the dimension, all dimensions are partitioned.

The Importance of Careful Partitioning

A complete partition of the array maps all the array elements to the individual registers. This helps in improving the kernel performance because all of these registers can be accessed concurrently in a same cycle.

CAUTION:

Complete partitioning of the large arrays consumes a lot of PL region. It could even cause the compilation process to slow down and face capacity issue. Partition the array only when it is needed. Consider selectively partitioning a particular dimension or performing a block or cycle partitioning.

Choosing a Specific Dimension to Partition

Suppose A and B are two-dimensional arrays representing two matrices. Consider the following Matrix Multiplication algorithm:

int A[64][64]; int B[64][64]; ROW_WISE: for (int i = 0; i < 64; i++) { COL_WISE : for (int j = 0; j < 64; j++) { #pragma HLS PIPELINE int result = 0; COMPUTE_LOOP: for (int k = 0; k < 64; k++) { result += A[i ][ k] * B[k ][ j]; } C[i][ j] = result; } }

Due to the PIPELINE pragma, the ROW_WISEand COL_WISEloop is flattened together and COMPUTE_LOOPis fully unrolled. To concurrently execute each iteration (k) of the COMPUTE_LOOP, the code must access each column of matrix A and each row of matrix B in parallel. Therefore, the matrix A should be split in the second dimension, and matrix B should be split in the first dimension.

#pragma HLS ARRAY_PARTITION variable=A dim=2 complete #pragma HLS ARRAY_PARTITION variable=B dim=1 complete

Choosing Between Cyclic and Block Partitions

Here the same matrix multiplication algorithm is used to demonstrate choosing between cyclic and block partitioning and determining the appropriate factor, by understanding the array access pattern of the underlying algorithm.

int A[64 * 64]; int B[64 * 64]; #pragma HLS ARRAY_PARTITION variable=A dim=1 cyclic factor=64 #pragma HLS ARRAY_PARTITION variable=B dim=1 block factor=64 ROW_WISE: for (int i = 0; i < 64; i++) { COL_WISE : for (int j = 0; j < 64; j++) { #pragma HLS PIPELINE int result = 0; COMPUTE_LOOP: for (int k = 0; k < 64; k++) { result += A[i * 64 + k] * B[k * 64 + j]; } C[i* 64 + j] = result; } }

In this version of the code, A and B are now one-dimensional arrays. To access each column of matrix A and each row of matrix B in parallel, cyclic and block partitions are used as shown in the above example. To access each column of matrix A in parallel,cyclicpartitioning is applied with thefactorspecified as the row size, in this case 64. Similarly, to access each row of matrix B in parallel,blockpartitioning is applied with thefactorspecified as the column size, or 64.

Minimizing Array Accesses with Caching

As arrays are mapped to BRAM with limited number of access ports, repeated array accesses can limit the performance of the accelerator.You should have a good understanding of the array access pattern of the algorithm, and limit the array accesses by locally caching the data to improve the performance of the hardware function.

The following code example shows a case in which accesses to an array can limit performance in the final implementation. In this example, there are three accesses to the array mem[N]to create a summed result.

#include "array_mem_bottleneck.h" dout_t array_mem_bottleneck(din_t mem[N]) { dout_t sum=0; int i; SUM_LOOP:for(i=2;i


             
              The code in the preceding example can be rewritten as shown in the following example to allow the code to be pipelined with a II = 1. By performing pre-reads and manually pipelining the data accesses, there is only one array read specified inside each iteration of the loop. This ensures that only a single-port BRAM is needed to achieve the performance.
              #include "array_mem_perform.h" dout_t array_mem_perform(din_t mem[N]) { din_t tmp0, tmp1, tmp2; dout_t sum=0; int i; tmp0 = mem[0]; tmp1 = mem[1]; SUM_LOOP:for (i = 2; i < N; i++) { tmp2 = mem[i]; sum += tmp2 + tmp1 + tmp0; tmp0 = tmp1; tmp1 = tmp2; } return sum; }
              
               Note:Consider minimizing the array access by caching to local registers to improve the pipelining performance depending on the algorithm.
              
             
             For more detailed information related to the configuration of arrays, see the "Arrays" section in theVivado Design Suite User Guide: High-Level Synthesis(UG902).


          
           Loops
           
            Loops are an important aspect for a high performance accelerator. Generally, loops are either pipelined or unrolled to take advantage of the highly distributed and parallel FPGA architecture to provide a performance boost compared to running on a CPU.
            By default, loops are neither pipelined nor unrolled. Each iteration of the loop takes at least one clock cycle to execute in hardware. Thinking from the hardware perspective, there is an implicitwait until clockfor the loop body. The next iteration of a loop only starts when the previous iteration is finished.
           
           
            Loop Pipelining
            
             
              By default, every iteration of a loop only starts when the previous iteration has finished. In the loop example below, a single iteration of the loop adds two variables and stores the result in a third variable. Assume that in hardware this loop takes three cycles to finish one iteration. Also, assume that the loop variable
              lenis 20, that is, the
              vaddloop runs for 20 iterations in the
              hardware function. Therefore, it requires a total of 60 clock cycles (20 iterations * 3 cycles) to complete all the operations of this loop.
              vadd: for(int i = 0; i < len; i++) { c[i] = a[i] + b[i]; }
             
             
              TIP:It is good practice to always label a loop as shown in the above code example (
              vadd:…). This practice helps with debugging when working in the
              SDSoCenvironment. Note that the labels generate warnings during compilation, which can be safely ignored.
             
             
              Pipelining the loop executes subsequent iterations in a pipelined manner. This means that subsequent iterations of the loop overlap and run concurrently, executing at different sections of the loop-body. Pipelining a loop can be enabled by the pragma
              HLS PIPELINE. Note that the pragma is placed inside the body of the loop.
              vadd: for(int i = 0; i < len; i++) { #pragma HLS PIPELINE c[i] = a[i] + b[i]; }
             
             In the example above, it is assumed that every iteration of the loop takes three cycles: read, add, and write. Without pipelining, each successive iteration of the loop starts in every third cycle. With pipelining the loop can start subsequent iterations of the loop in fewer than three cycles, such as in every second cycle, or in every cycle.
             The number of cycles it takes to start the next iteration of a loop is called the initiation interval (II) of the pipelined loop. So II = 2 means each successive iteration of the loop starts every two cycles. An II = 1 is the ideal case, where each iteration of the loop starts in the very next cycle. When you use thepragma HLS PIPELINEthe compiler always tries to achieve II = 1 performance.
             The following figure illustrates the difference in execution between pipelined and non-pipelined loops. In this figure, (A) shows the default sequential operation where there are three clock cycles between each input read (II = 3), and it requires eight clock cycles before the last output write is performed.
             
              Figure:Loop Pipelining
              
             
             
              In the pipelined version of the loop shown in (B), a new input sample is read every cycle (II = 1) and the final output is written after only four clock cycles: substantially improving both the II and latency while using the same hardware resources.
              
               IMPORTANT:Pipelining a loop causes any loops nested inside the pipelined loop to get unrolled.
              
             
             If there are data dependencies inside a loop it might not be possible to achieve II = 1, and a larger initiation interval might be the result. Loop dependencies are discussed inLoop Dependencies.
            
           
           
            Loop Unrolling
            
             
              The compiler can also unroll a loop, either partially or completely to perform multiple loop iterations in parallel. This is done using the
              HLS UNROLLpragma. Unrolling a loop can lead to a very fast design, with significant parallelism. However, because all the operations of the loop iterations are executed in parallel, a large amount of programmable logic resource are required to implement the hardware.
              As a result, the compiler can face challenges dealing with such a large number of resources and can face capacity problems that slow down the hardware function compilation process.It is a good guideline to unroll loops that have a small loop body, or a small number of iterations.
              vadd: for(int i = 0; i < 20; i++) { #pragma HLS UNROLL c[i] = a[i] + b[i]; }
             
             In the preceding example, you can seepragma HLS UNROLLhas been inserted into the body of the loop to instruct the compiler to unroll the loop completely. All 20 iterations of the loop are executed in parallel if that is permitted by any data dependency.
             Completely unrolling a loop can consume significant device resources, while partially unrolling the loop provides some performance improvement without causing a significant impact on hardware resources.
             
              Partially Unrolled Loop
              To completely unroll a loop, the loop must have a constant bound (20 in the example above). However, partial unrolling is possible for loops with a variable bound. A partially unrolled loop means that only a certain number of loop iterations can be executed in parallel.
              
               The following code examples illustrates how partially unrolled loops work:
               array_sum:for(int i=0;i<4;i++){ #pragma HLS UNROLL factor=2 sum += arr[i]; }
              
              In the above example theUNROLLpragma is given a factor of 2. This is the equivalent of manually duplicating the loop body and running the two loops concurrently for half as many iterations. The following code shows how this would be written. This transformation allows two iterations of the above loop to execute in parallel.
              array_sum_unrolled:for(int i=0;i<2;i+=2){ // Manual unroll by a factor 2 sum += arr[i]; sum += arr[i+1]; }
              Just like data dependencies inside a loop impact the initiation interval of a pipelined loop, an unrolled loop performs operations in parallel only if data dependencies allow it. If operations in one iteration of the loop require the result from a previous iteration, they cannot execute in parallel, but execute as soon as the data from one iteration is available to the next.
              
               Note:A good methodology is to
               PIPELINEloops first, and then
               UNROLLloops with small loop bodies and limited iterations to improve performance further.
              
             
            
           
           
            Loop Dependencies
            
             
              Data dependencies in loops can impact the results of loop pipelining or unrolling. These loop dependencies can be within a single iteration of a loop or between different iterations of a loop. The straightforward method to understand loop dependencies is to examine an extreme example. In the following code example, the result of the loop is used as the loop continuation or exit condition. Each iteration of the loop must finish before the next can start.
              Minim_Loop: while (a != b) { if (a > b) a -= b; else b -= a; }
             
             This loop cannot be pipelined. The next iteration of the loop cannot begin until the previous iteration ends.
             Dealing with various types of dependencies with thesds++compiler is an extensive topic requiring a detailed understanding of the high-level synthesis procedures underlying the compiler.Refer to theVivado Design Suite User Guide: High-Level Synthesis(UG902)for more information on "Dependencies withVivadoHLS."
            
           
           
            Nested Loops
            
             Coding with nested loops is a common practice. Understanding how loops are pipelined in a nested loop structure is key to achieving the desired performance.
             If the pragmaHLS PIPELINEis applied to a loop nested inside another loop, thesds++compiler attempts to flatten the loops to create a single loop, and apply thePIPELINEpragma to the constructed loop. The loop flattening helps in improving the performance of the hardware function.
             
              The compiler is able to flatten the following types of nested loops:
              
               Perfect nested loop:
                
                 Only the inner loop has a loop body.
                 There is no logic or operations specified between the loop declarations.
                 All the loop bounds are constant.
                
               Semi-perfect nested loop:
                
                 Only the inner loop has a loop body.
                 There is no logic or operations specified between the loop declarations.
                 The inner loop bound must be a constant, but the outer loop bound can be a variable.
                
              
             
             
              The following code example illustrates the structure of a perfect nested loop:
              ROW_LOOP: for(int i=0; i< MAX_HEIGHT; i++) { COL_LOOP: For(int j=0; j< MAX_WIDTH; j++) { #pragma HLS PIPELINE // Main computation per pixel } }
             
             The above example shows a nested loop structure with two loops that performs some computation on incoming pixel data. In most cases, you want to process a pixel in every cycle, hencePIPELINEis applied to the nested loop body structure. The compiler is able to flatten the nested loop structure in the example because it is a perfect nested loop.
             The nested loop in the preceding example contains no logic between the two loop declarations. No logic is placed between theROW_LOOPandCOL_LOOP; all of the processing logic is inside theCOL_LOOP. Also, both the loops have a fixed number of iterations.These two criteria help thesds++compiler flatten the loops and apply thePIPELINEconstraint.
             
              Note:If the outer loop has a variable boundary, then the compiler can still flatten the loop. You should always try to have a constant boundary for the inner loop.
             
            
           
           
            Sequential Loops
            
             
              If there are multiple loops in the design, by default they do not overlap, and execute sequentially. This section introduces the concept of dataflow optimization for sequential loops. Consider the following code example:
              void adder(unsigned int *in, unsigned int *out, int inc, int size) { unsigned int in_internal[MAX_SIZE]; unsigned int out_internal[MAX_SIZE]; mem_rd: for (int i = 0 ; i < size ; i++){ #pragma HLS PIPELINE // Reading from the input vector "in" and saving to internal variable in_internal[i] = in[i]; } compute: for (int i=0; i

             

             In the previous example, three sequential loops are shown:mem_rd,compute, andmem_wr.
             
              Themem_rdloop reads input vector data from the memory interface and stores it in internal storage.
              The maincomputeloop reads from the internal storage and performs an increment operation and saves the result to another internal storage.
              Themem_wrloop writes the data back to memory from the internal storage.
             
             By default, these loops are executed sequentially without any overlap. First, themem_rdloop finishes reading all the input data before thecomputeloop starts its operation. Similarly, thecomputeloop finishes processing the data before themem_wrloop starts to write the data. However, the execution of these loops can be overlapped, allowing thecompute(ormem_wr) loop to start as soon as there is enough data available to feed its operation, before themem_rd(orcompute) loop has finished processing its data.
             The loop execution can be overlapped using dataflow optimization as described inDataflow Optimization.
            

           

          

          
           Dataflow Optimization
           
            Dataflow optimization is a powerful technique to improve the hardware function performance by enabling task-level pipelining and parallelism inside the hardware function. It allows thesds++compiler to schedule multiple functions of the hardware function to run concurrently to achieve higher throughput and lower latency. This is also known as task-level parallelism.
            The following figure shows a conceptual view of dataflow pipelining. The default behavior is to execute and completefunc_A, thenfunc_B, and finallyfunc_C. With theHLS DATAFLOWpragma enabled, the compiler can schedule each function to execute as soon as data is available. In this example, the originaltopfunction has a latency and interval of eight clock cycles. WithDATAFLOWoptimization, the interval is reduced to only three clock cycles.
            
             Figure:Dataflow Optimization
             
            
           
           
            Dataflow Coding Example
            
             In the dataflow coding example you should notice the following:
             
              TheHLS DATAFLOWpragma is applied to instruct the compiler to enable dataflow optimization. This is not a data mover, which deals with interfacing between the PS and PL, but how the data flows through the accelerator.
              Thestreamclass is used as a data transferring channel between each of the functions in the dataflow region.
               
                TIP:The
                streamclass infers a first-in first-out (FIFO) memory circuit in the programmable logic. This memory circuit, which acts as a queue in software programming, provides data-level synchronization between the functions and achieves better performance. For additional details on the
                hls::streamclass, see the
                Vivado Design Suite User Guide: High-Level Synthesis(UG902).
               
             
             void compute_kernel(ap_int<256> *inx, ap_int<256> *outx, DTYPE alpha) { hls::streaminFifo; #pragma HLS STREAM variable=inFifo depth=32 hls::streamoutFifo; #pragma HLS STREAM variable=outFifo depth=32 #pragma HLS DATAFLOW read_data(inx, inFifo); // Do computation with the acquired data compute(inFifo, outFifo, alpha); write_data(outx, outFifo); return; }
            
           
           
            Canonical Forms of Dataflow Optimization
            
             
              Xilinxrecommends writing the code inside a dataflow region using canonical forms. There are canonical forms for dataflow optimizations for both functions and loops.
              
               Functions: The canonical form coding guideline for dataflow inside a function specifies:
                
                 Use only the following types of variables inside the dataflow region:
                  
                   Local non-static scalar/array/pointer variables.
                   Local statichls::streamvariables.
                  
                 Function calls transfer data only in the forward direction.
                 Array orhls::streamshould have only one producer function and one consumer function.
                 The function arguments (variables coming from outside the dataflow region) should only be read, or written, not both. If performing both read and write on the same function argument then read should happen before write.
                 The local variables (those that are transferring data in forward direction) should be written before being read.
                
The following code example illustrates the canonical form for dataflow within a function. Note that the first function (func1) reads the inputs and the last function (func3) writes the outputs. Also note that one function creates output values that are passed to the next function as input parameters.
void dataflow(Input0, Input1, Output0, Output1) { UserDataType C0, C1, C2; #pragma HLS DATAFLOW func1(read Input0, read Input1, write C0, write C1); func2(read C0, read C1, write C2); func3(read C2, write Output0, write Output1); }
               Loop: The canonical form coding guideline for dataflow inside a loop body includes the coding guidelines for a function defined above, and also specifies the following:
                
                 Initial value 0.
                 The loop condition is formed by a comparison of the loop variable with a numerical constant or variable that does not vary inside the loop body.
                 Increment by 1.
                
The following code example illustrates the canonical form for dataflow within a loop.
void dataflow(Input0, Input1, Output0, Output1) { UserDataType C0, C1, C2; for (int i = 0; i < N; ++i) { #pragma HLS DATAFLOW func1(read Input0, read Input1, write C0, write C1); func2(read C0, read C0, read C1, write C2); func3(read C2, write Output0, write Output1); } }
              
             
            
           
           
            Troubleshooting Dataflow
            
             The following behaviors can prevent thesds++compiler from performingDATAFLOWoptimizations:
             
              Single producer-consumer violations.
              Bypassing tasks.
              Feedback between tasks.
              Conditional execution of tasks.
              Loops with multiple exit conditions or conditions defined within the loop.
             
             If any of the above conditions occur inside the dataflow region, you might need to re-architect the code to successfully achieve dataflow optimization.