Programming C/C++ Kernels
In theSDAccel™environment, the kernel code is generally a compute-intensive part of the algorithm and meant to be accelerated on the FPGA. TheSDAccelenvironment supports the kernel code written in C,OpenCL™, and also in RTL. This guide mainly focuses on the C kernel coding style.
extern "C"
linkage.
extern "C" { void kernel_function(int *in, int *out, int size); }
Data Types
As it is faster to write and verify the code by using native C data types such asint
,float
, ordouble
, it is a common practice to use these data types when coding for the first time.However, the code is implemented in hardware, and all the operator sizes used in the hardware are dependent on the data types used in the accelerator code. The default native C/C++ data types can result in larger and slower hardware resources that can limit the performance of the kernel.Instead, consider using bit-accurate data types to ensure the code is optimized for implementation in hardware. Using bit-accurate, or arbitrary precision data types, results in hardware operators which are smaller and faster. This allows more logic to be placed into the programmable logic and also allows the logic to execute at higher clock frequencies while using less power.
Consider using bit-accurate data types instead of native C/C++ data types in your code.
In the following sections, the two most common arbitrary precision data types (arbitrary precision integer type and arbitrary precision fixed-point type) supported by thexocc
compiler are discussed. Note that these data types should be used for C/C++ kernels only, not forOpenCLkernel (or inside the host code).
Arbitrary Precision Integer Types
Arbitrary precision integer data types are defined byap_int
orap_uint
for signed and unsigned integer respectively inside the header fileap_int.h. To use arbitrary precision integer data type:
- Add header fileap_int.hto the source code.
- Change the bit types to
ap_int
orap_uint
, where N is a bit-size from 1 to 1024.
#include “ap_int.h” ap_int<9> var1 // 9 bit signed integer ap_uint<10> var2 // 10 bit unsigned integer
Arbitrary Precision Fixed-Point Data Types
Some existing applications use floating point data types as they are written for other hardware architectures. However, fixed-point data types are a useful replacement for floating point types which require many clock cycles to complete. Carefully evaluate trade-offs in power, cost, productivity, and precision when choosing to implement floating-point vs. fixed-point arithmetic for your application and accelerators.
As discussed inDeep Learning with INT8 Optimization on Xilinx Devices(WP486), using fixed-point arithmetic instead of floating point for applications like machine learning can increase power efficiency, and lower the total power required. Unless the entire range of the floating-point type is required, the same accuracy can often be implemented with a fixed-point type resulting in the same accuracy with smaller and faster hardware. The paperReduce Power and Cost by Converting from Floating Point to Fixed Point(WP491)provides some examples of this conversion.
Fixed-point data types model the data as an integer and fraction bits. The fixed-point data type requires theap_fixed
header, and supports both a signed and unsigned form as follows:
- Header file:ap_fixed.h
- Signed fixed point:
ap_fixed
- Unsigned fixed point:
ap_ufixed
- W = Total width < 1024 bits
- I = Integer bit width. The value of I must be less than or equal to the width (W). The number of bits to represent the fractional part is W minus I. Only a constant integer expression can be used to specify the integer width.
- Q = Quantization mode. Only predefined enumerated values can be used to specify Q. The accepted values are:
AP_RND
: Rounding to plus infinity.AP_RND_ZERO
: Rounding to zero.AP_RND_MIN_INF
: Rounding to minus infinity.AP_RND_INF
: Rounding to infinity.AP_RND_CONV
: Convergent rounding.AP_TRN
: Truncation. This is the default value when Q is not specified.AP_TRN_ZERO
: Truncation to zero.
- O = Overflow mode. Only predefined enumerated values can be used to specify O. The accepted values are:
AP_SAT
: Saturation.AP_SAT_ZERO
: Saturation to zero.AP_SAT_SYM
: Symmetrical saturation.AP_WRAP
: Wrap-around. This is the default value when O is not specified.AP_WRAP_SM
: Sign magnitude wrap-around.
- N = The number of saturation bits in the overflow WRAP modes. Only a constant integer expression can be used as the parameter value. The default value is zero.
TIP:Theap_fixed
andap_ufixed
data types permit shorthand definition, with only W and I being required, and other parameters assigned default values. However, to define Q or N, you must also specify the parameters before those, even if you just specify the default values.
In the example code below, theap_fixed
type is used to define a signed 18-bit variable with 6 bits representing the integer value above the binary point, and by implication, 12 bits representing the fractional value below the binary point. The quantization mode is set to round to plus infinity (AP_RND
). Because the overflow mode and saturation bits are not specified, the defaultsAP_WRAP
and 0 are used.
#include ... ap_fixed<18,6,AP_RND> my_type; ...
When performing calculations where the variables have different numbers of bits (W), or different precision (I), the binary point is automatically aligned. See the "C++ Arbitrary Precision Fixed-Point Types" in theVivado Design Suite User Guide: High-Level Synthesis(UG902)for more information on using fixed-point data types.
Interfaces
Two types of data transfer occur from the host machine to and from the kernels on the FPGA device: data transferred through the global memory memory banks, and scalar data.
Memory Data Inputs and Outputs
The main data processed by the kernel computation, often in a large volume, should be transferred through the global memory banks on the FPGA board. The host machine transfers a large chunk of data to one or more global memory bank(s). The kernel accesses the data from those global memory banks, preferably in burst. After the kernel finishes the computation, the resulting data is transferred back to the host machine through the global memory banks.
When writing the kernel interface description, pragmas are used to denote the interfaces coming to and from the global memory banks.
Memory Data
void cnn( int *pixel, // Input pixel int *weights, // Input Weight Matrix int *out, // Output pixel ... // Other input or Output ports #pragma HLS INTERFACE m_axi port=pixel offset=slave bundle=gmem #pragma HLS INTERFACE m_axi port=weights offset=slave bundle=gmem #pragma HLS INTERFACE m_axi port=out offset=slave bundle=gmem
In the example above, there are three large data interfaces. The two inputs arepixel
andweights
and one output,out
. These inputs and outputs connected to the global memory bank are specified in C code by usingHLS INTERFACE m_axi
pragmas as shown above.
Thebundle
keyword specifies the name of the port. The compiler will create a port for each unique bundle name. When the same name is used for different interfaces, this results in these interfaces being mapped to same port.
Sharing ports helps saves FPGA resources, but can limit the performance of the kernel because all the memory transfers have to go through a single port. The bandwidth and throughput of the kernel can be increased by creating multiple ports (using different bundle names).
void cnn( int *pixel, // Input pixel int *weights, // Input Weight Matrix int *out, // Output pixel ... // Other input or Output ports #pragma HLS INTERFACE m_axi port=pixel offset=slave bundle=gmem #pragma HLS INTERFACE m_axi port=weights offset=slave bundle=gmem1 #pragma HLS INTERFACE m_axi port=out offset=slave bundle=gmem
In the above example, thebundle
attribute is used to create two distinct ports:gmem
andgmem1
. The kernel will accesspixel
andout
through thegmem
port whileweights
will be accessed through thegmem1
port. As a result, the kernel will be able to make parallel accesses topixel
andweights
, potentially improving the throughput of the kernel.
For this performance potential to be fully realized, it is also necessary to connect these different ports to different global memory banks. This is done during thexocc
linking stage using the--sp
switch. For more details about this option, see theConfiguring the System Architecture.
Memory Interface Data Width Considerations
In theSDAccelenvironment, the maximum data width from the global memory to and from the kernel is 512 bits. To maximize the data transfer rate, using this full data width is recommended. The kernel code should be modified to take advantage of the full bit width.
Because a native integer type is used in the prior example, the full data transfer bandwidth is not used. As discussed previously inData Types, arbitrary precision data typesap_int
orap_uint
can be used to achieve bit-accurate data width for this purpose.
void cnn( ap_uint<512> *pixel, // Input pixel int *weights, // Input Weight Matrix ap_uint<512> *out, // Output pixel ... // Other input or output ports #pragma HLS INTERFACE m_axi port=pixel offset=slave bundle=gmem #pragma HLS INTERFACE m_axi port=weights offset=slave bundle=gmem #pragma HLS INTERFACE m_axi port=out offset=slave bundle=gmem
The example above shows the output (out
) interface using theap_uint
data type to make use of the full transfer width of 512 bits.
The data width of the memory interfaces should be a power of 2. For data width less than 32, use native C data types. Useap_int/ap_uint
for data widths greater than 32 and with power of 2 increment.
Reading and Writing by Burst
Accessing the global memory bank interface from the kernel has a large latency. So global memory data transfer should be done in burst. To infer the burst, a pipelined loop coding style is recommended as shown below:
hls::stream str; INPUT_READ: for(int i=0; i
In the above code example, a pipelinedfor
loop is used to read data from the input memory interface, and writes to an internalhls::stream
variable. The above coding style reads from the global memory bank in burst.
for
loop operation in the example above inside a separate function, and apply the
dataflow
optimization from the top level as shown below:
top_function(datatype_t * m_in, // Memory data Input datatype_t * m_out, // Memory data Output int inp1, // Other Input int inp2) { // Other Input #pragma HLS DATAFLOW hls::stream in_var1; // Internal stream to transfer hls::stream out_var1; // data through the dataflow region read_function(m_in, inp1); // Read function contains pipelined for loop // to infer burst execute_function(in_var1, out_var1, inp1, inp2); // Core compute function write_function(out_var1, m_out); // Write function contains pipelined for loop // to infer burst }
Scalar Data Inputs
void process_image(int *input, int *output, int width, int height) { #pragma HLS INTERFACE s_axilite port=width bundle=control #pragma HLS INTERFACE s_axilite port=height bundle=control
In the example above, there are two scalar inputs specify the imagewidth
andheight
. These inputs are specified using the#pragma HLS INTERFACE s_axilite
. These data inputs come to the kernel directly from the host machine and not using global memory bank.
bundle
name should be same for all scalar data inputs. In the preceding example the same
bundle
name,
control
, is used for all control inputs.
Enabling Host to Kernel Dataflow
ap_ctrl_chain
pragma:
void kernel_name( int *inputs, ... )// Other input or Output ports { #pragma HLS INTERFACE ..... // Other interface pragmas #pragma HLS INTERFACE ap_ctrl_chain port=return bundle=control
Loops
Loops are an important aspect for a high performance accelerator. Generally, loops are either pipelined or unrolled to take advantage of the highly distributed and parallel FPGA architecture to provide a performance boost compared to running on a CPU.
By default, loops are neither pipelined nor unrolled. Each iteration of the loop takes at least one clock cycle to execute in hardware. Thinking from the hardware perspective, there is an implicitwait until clockfor the loop body. The next iteration of a loop only starts when the previous iteration is finished.
Loop Pipelining
len
is 20, that is, the
vadd
loop runs for 20 iterations in the
kernel. Therefore, it requires a total of 60 clock cycles (20 iterations * 3 cycles) to complete all the operations of this loop.
vadd: for(int i = 0; i < len; i++) { c[i] = a[i] + b[i]; }
vadd:…
). This practice helps with debugging when working in the
SDAccelenvironment. Note that the labels generate warnings during compilation, which can be safely ignored.
HLS PIPELINE
. Note that the pragma is placed inside the body of the loop.
vadd: for(int i = 0; i < len; i++) { #pragma HLS PIPELINE c[i] = a[i] + b[i]; }
In the example above, it is assumed that every iteration of the loop takes three cycles: read, add, and write. Without pipelining, each successive iteration of the loop starts in every third cycle. With pipelining the loop can start subsequent iterations of the loop in fewer than three cycles, such as in every second cycle, or in every cycle.
The number of cycles it takes to start the next iteration of a loop is called the initiation interval (II) of the pipelined loop. So II = 2 means each successive iteration of the loop starts every two cycles. An II = 1 is the ideal case, where each iteration of the loop starts in the very next cycle. When you use thepragma HLS PIPELINE
the compiler always tries to achieve II = 1 performance.
The following figure illustrates the difference in execution between pipelined and non-pipelined loops. In this figure, (A) shows the default sequential operation where there are three clock cycles between each input read (II = 3), and it requires eight clock cycles before the last output write is performed.
Figure:Loop Pipelining
If there are data dependencies inside a loop it might not be possible to achieve II = 1, and a larger initiation interval might be the result. Loop dependencies are discussed inLoop Dependencies.
Loop Unrolling
HLS UNROLL
pragma. Unrolling a loop can lead to a very fast design, with significant parallelism. However, because all the operations of the loop iterations are executed in parallel, a large amount of programmable logic resource are required to implement the hardware.
As a result, the compiler can face challenges dealing with such a large number of resources and can face capacity problems that slow down the kernel compilation process.It is a good guideline to unroll loops that have a small loop body, or a small number of iterations.
vadd: for(int i = 0; i < 20; i++) { #pragma HLS UNROLL c[i] = a[i] + b[i]; }
In the preceding example, you can seepragma HLS UNROLL
has been inserted into the body of the loop to instruct the compiler to unroll the loop completely. All 20 iterations of the loop are executed in parallel if that is permitted by any data dependency.
Completely unrolling a loop can consume significant device resources, while partially unrolling the loop provides some performance improvement without causing a significant impact on hardware resources.
Partially Unrolled Loop
To completely unroll a loop, the loop must have a constant bound (20 in the example above). However, partial unrolling is possible for loops with a variable bound. A partially unrolled loop means that only a certain number of loop iterations can be executed in parallel.
array_sum:for(int i=0;i<4;i++){ #pragma HLS UNROLL factor=2 sum += arr[i]; }
In the above example theUNROLL
pragma is given a factor of 2. This is the equivalent of manually duplicating the loop body and running the two loops concurrently for half as many iterations. The following code shows how this would be written. This transformation allows two iterations of the above loop to execute in parallel.
array_sum_unrolled:for(int i=0;i<2;i+=2){ // Manual unroll by a factor 2 sum += arr[i]; sum += arr[i+1]; }
Just like data dependencies inside a loop impact the initiation interval of a pipelined loop, an unrolled loop performs operations in parallel only if data dependencies allow it. If operations in one iteration of the loop require the result from a previous iteration, they cannot execute in parallel, but execute as soon as the data from one iteration is available to the next.
PIPELINE
loops first, and then
UNROLL
loops with small loop bodies and limited iterations to improve performance further.
Loop Dependencies
Minim_Loop: while (a != b) { if (a > b) a -= b; else b -= a; }
This loop cannot be pipelined. The next iteration of the loop cannot begin until the previous iteration ends.
Dealing with various types of dependencies with thexocc
compiler is an extensive topic requiring a detailed understanding of the high-level synthesis procedures underlying the compiler.Refer to theVivado Design Suite User Guide: High-Level Synthesis(UG902)for more information on "Dependencies withVivadoHLS."
Nested Loops
Coding with nested loops is a common practice. Understanding how loops are pipelined in a nested loop structure is key to achieving the desired performance.
If the pragmaHLS PIPELINE
is applied to a loop nested inside another loop, the xocc compiler attempts to flatten the loops to create a single loop, and apply thePIPELINE
pragma to the constructed loop. The loop flattening helps in improving the performance of the kernel.
- Perfect nested loop:
- Only the inner loop has a loop body.
- There is no logic or operations specified between the loop declarations.
- All the loop bounds are constant.
- Semi-perfect nested loop:
- Only the inner loop has a loop body.
- There is no logic or operations specified between the loop declarations.
- The inner loop bound must be a constant, but the outer loop bound can be a variable.
ROW_LOOP: for(int i=0; i< MAX_HEIGHT; i++) { COL_LOOP: For(int j=0; j< MAX_WIDTH; j++) { #pragma HLS PIPELINE // Main computation per pixel } }
The above example shows a nested loop structure with two loops that performs some computation on incoming pixel data. In most cases, you want to process a pixel in every cycle, hencePIPELINE
is applied to the nested loop body structure. The compiler is able to flatten the nested loop structure in the example because it is a perfect nested loop.
The nested loop in the preceding example contains no logic between the two loop declarations. No logic is placed between theROW_LOOP
andCOL_LOOP
; all of the processing logic is inside theCOL_LOOP
. Also, both the loops have a fixed number of iterations.These two criteria help the xocc compiler flatten the loops and apply thePIPELINE
constraint.
Sequential Loops
void adder(unsigned int *in, unsigned int *out, int inc, int size) { unsigned int in_internal[MAX_SIZE]; unsigned int out_internal[MAX_SIZE]; mem_rd: for (int i = 0 ; i < size ; i++){ #pragma HLS PIPELINE // Reading from the input vector "in" and saving to internal variable in_internal[i] = in[i]; } compute: for (int i=0; i
In the previous example, three sequential loops are shown:mem_rd
,compute
, andmem_wr
.
- The
mem_rd
loop reads input vector data from the memory interface and stores it in internal storage. - The main
compute
loop reads from the internal storage and performs an increment operation and saves the result to another internal storage. - The
mem_wr
loop writes the data back to memory from the internal storage.
This code example is using two separate loops for reading and writing from/to the memory input/output interfaces to infer burst read/write.
By default, these loops are executed sequentially without any overlap. First, themem_rd
loop finishes reading all the input data before thecompute
loop starts its operation. Similarly, thecompute
loop finishes processing the data before themem_wr
loop starts to write the data. However, the execution of these loops can be overlapped, allowing thecompute
(ormem_wr
) loop to start as soon as there is enough data available to feed its operation, before themem_rd
(orcompute
) loop has finished processing its data.
The loop execution can be overlapped using dataflow optimization as described inDataflow Optimization.
Dataflow Optimization
Dataflow optimization is a powerful technique to improve the kernel performance by enabling tasklevel pipelining and parallelism inside the kernel. It allows the xocc compiler to schedule multiple functions of the kernel to run concurrently to achieve higher throughput and lower latency. This is also known as task-level parallelism.
The following figure shows a conceptual view of dataflow pipelining. The default behavior is to execute and completefunc_A
, thenfunc_B
, and finallyfunc_C
. With theHLS DATAFLOW
pragma enabled, the compiler can schedule each function to execute as soon as data is available. In this example, the originaltop
function has a latency and interval of eight clock cycles. WithDATAFLOW
optimization, the interval is reduced to only three clock cycles.
Figure:Dataflow Optimization
Dataflow Coding Example
In the dataflow coding example you should notice the following:
- The
HLS DATAFLOW
pragma is applied to instruct the compiler to enable dataflow optimization. This is not a data mover, which deals with interfacing between the PS and PL, but how the data flows through the accelerator. - The
stream
class is used as a data transferring channel between each of the functions in the dataflow region.TIP:Thestream
class infers a first-in first-out (FIFO) memory circuit in the programmable logic. This memory circuit, which acts as a queue in software programming, provides data-level synchronization between the functions and achieves better performance. For additional details on thehls::stream
class, see the Vivado Design Suite User Guide: High-Level Synthesis(UG902).
void compute_kernel(ap_int<256> *inx, ap_int<256> *outx, DTYPE alpha) { hls::streaminFifo; #pragma HLS STREAM variable=inFifo depth=32 hls::streamoutFifo; #pragma HLS STREAM variable=outFifo depth=32 #pragma HLS DATAFLOW read_data(inx, inFifo); // Do computation with the acquired data compute(inFifo, outFifo, alpha); write_data(outx, outFifo); return; }
Canonical Forms of Dataflow Optimization
- Functions: The canonical form coding guideline for dataflow inside a function specifies:
- Use only the following types of variables inside the dataflow region:
- Local non-static scalar/array/pointer variables.
- Local static
hls::stream
variables.
- Function calls transfer data only in the forward direction.
- Array or
hls::stream
should have only one producer function and one consumer function. - The function arguments (variables coming from outside the dataflow region) should only be read, or written, not both. If performing both read and write on the same function argument then read should happen before write.
- The local variables (those that are transferring data in forward direction) should be written before being read.
The following code example illustrates the canonical form for dataflow within a function. Note that the first function (
func1
) reads the inputs and the last function (func3
) writes the outputs. Also note that one function creates output values that are passed to the next function as input parameters.void dataflow(Input0, Input1, Output0, Output1) { UserDataType C0, C1, C2; #pragma HLS DATAFLOW func1(read Input0, read Input1, write C0, write C1); func2(read C0, read C1, write C2); func3(read C2, write Output0, write Output1); }
- Use only the following types of variables inside the dataflow region:
- Loop: The canonical form coding guideline for dataflow inside a loop body includes the coding guidelines for a function defined above, and also specifies the following:
- Initial value 0.
- The loop condition is formed by a comparison of the loop variable with a numerical constant or variable that does not vary inside the loop body.
- Increment by 1.
The following code example illustrates the canonical form for dataflow within a loop.
void dataflow(Input0, Input1, Output0, Output1) { UserDataType C0, C1, C2; for (int i = 0; i < N; ++i) { #pragma HLS DATAFLOW func1(read Input0, read Input1, write C0, write C1); func2(read C0, read C0, read C1, write C2); func3(read C2, write Output0, write Output1); } }
Troubleshooting Dataflow
The following behaviors can prevent the xocc compiler from performingDATAFLOW
optimizations:
- Single producer-consumer violations.
- Bypassing tasks.
- Feedback between tasks.
- Conditional execution of tasks.
- Loops with multiple exit conditions or conditions defined within the loop.
If any of the above conditions occur inside the dataflow region, you might need to re-architect the code to successfully achieve dataflow optimization.
Array Configuration
TheSDAccelcompiler maps large arrays to the block Ram (BRAM) memory in the PL region. These BRAM can have a maximum of two access points or ports. This can limit the performance of the application as all the elements of an array cannot be accessed in parallel when implemented in hardware.
#pragma HLS ARRAY_PARTITION
can be used to instruct the compiler to split the elements of an array and map it to smaller arrays, or to individual registers. The compiler provides three types of array partitioning, as shown in the following figure. The three types of partitioning are:
block
: The original array is split into equally sized blocks of consecutive elements of the original array.cyclic
: The original array is split into equally sized blocks interleaving the elements of the original array.complete
: Split the array into its individual elements. This corresponds to resolving a memory into individual registers. This is the default for theARRAY_PARTITION
pragma.
Figure:Partitioning Arrays
For block and cyclic partitioning, thefactor
option specifies the number of arrays that are created. In the preceding figure, a factor of 2 is used to split the array into two smaller arrays. If the number of elements in the array is not an integer multiple of the factor, the later arrays will have fewer elements.
dimension
option is used to specify which dimension is partitioned. The following figure shows how the
dimension
option is used to partition the following example code in three different ways:
void foo (...) { // my_array[dim=1][dim=2][dim=3] // The following three pragma results are shown in the figure below // #pragma HLS ARRAY_PARTITION variable=my_array dim=3 factor=2 // #pragma HLS ARRAY_PARTITION variable=my_array dim=1 factor=2 // #pragma HLS ARRAY_PARTITION variable=my_array dim=0 complete int my_array[10][6][4]; ... }
Figure:Partitioning the Dimensions of an Array
The examples in the figure demonstrate how partitioning dimension 3 results in four separate arrays and partitioning dimension 1 results in 10 separate arrays. If 0 is specified as the dimension, all dimensions are partitioned.
The Importance of Careful Partitioning
A complete partition of the array maps all the array elements to the individual registers. This helps in improving the kernel performance because all of these registers can be accessed concurrently in a same cycle.
Choosing a Specific Dimension to Partition
int A[64][64]; int B[64][64]; ROW_WISE: for (int i = 0; i < 64; i++) { COL_WISE : for (int j = 0; j < 64; j++) { #pragma HLS PIPELINE int result = 0; COMPUTE_LOOP: for (int k = 0; k < 64; k++) { result += A[i ][ k] * B[k ][ j]; } C[i][ j] = result; } }
ROW_WISE
and
COL_WISE
loop is flattened together and
COMPUTE_LOOP
is fully unrolled. To concurrently execute each iteration (k) of the
COMPUTE_LOOP
, the code must access each column of matrix A and each row of matrix B in parallel. Therefore, the matrix A should be split in the second dimension, and matrix B should be split in the first dimension.
#pragma HLS ARRAY_PARTITION variable=A dim=2 complete #pragma HLS ARRAY_PARTITION variable=B dim=1 complete
Choosing Between Cyclic and Block Partitions
Here the same matrix multiplication algorithm is used to demonstrate choosing between cyclic and block partitioning and determining the appropriate factor, by understanding the array access pattern of the underlying algorithm.
int A[64 * 64]; int B[64 * 64]; #pragma HLS ARRAY_PARTITION variable=A dim=1 cyclic factor=64 #pragma HLS ARRAY_PARTITION variable=B dim=1 block factor=64 ROW_WISE: for (int i = 0; i < 64; i++) { COL_WISE : for (int j = 0; j < 64; j++) { #pragma HLS PIPELINE int result = 0; COMPUTE_LOOP: for (int k = 0; k < 64; k++) { result += A[i * 64 + k] * B[k * 64 + j]; } C[i* 64 + j] = result; } }
In this version of the code, A and B are now one-dimensional arrays. To access each column of matrix A and each row of matrix B in parallel, cyclic and block partitions are used as shown in the above example. To access each column of matrix A in parallel,cyclic
partitioning is applied with thefactor
specified as the row size, in this case 64. Similarly, to access each row of matrix B in parallel,block
partitioning is applied with thefactor
specified as the column size, or 64.
Minimizing Array Accesses with Caching
As arrays are mapped to BRAM with limited number of access ports, repeated array accesses can limit the performance of the accelerator.You should have a good understanding of the array access pattern of the algorithm, and limit the array accesses by locally caching the data to improve the performance of the kernel.
mem[N]
to create a summed result.
#include "array_mem_bottleneck.h" dout_t array_mem_bottleneck(din_t mem[N]) { dout_t sum=0; int i; SUM_LOOP:for(i=2;i
#include "array_mem_perform.h" dout_t array_mem_perform(din_t mem[N]) { din_t tmp0, tmp1, tmp2; dout_t sum=0; int i; tmp0 = mem[0]; tmp1 = mem[1]; SUM_LOOP:for (i = 2; i < N; i++) { tmp2 = mem[i]; sum += tmp2 + tmp1 + tmp0; tmp0 = tmp1; tmp1 = tmp2; } return sum; }
For more detailed information related to the configuration of arrays, see the "Arrays" section in theVivado Design Suite User Guide: High-Level Synthesis(UG902).
Function Inlining
C code generally consists of several functions. By default, each function is compiled, and optimized separately by thexocc
compiler. A unique hardware module will be generated for the function body and reused as needed.
From performance perspective, in general it is better to inline the function, or dissolve the function hierarchy. This helpsxocc
compiler to do optimization more globally across the function boundary. For example, if a function is called inside a pipelined loop, then inlining the function helps the compiler to do more aggressive optimization and results in a better pipeline performance of the loop (lower initiation interval or II number).
INLINE
pragma placed inside the function body instruct the compiler to inline the function.
foo_sub (p, q) { #pragma HLS INLINE .... ... }
However, if the function body is very big and called several times inside the main kernel function, then inlining the function may cause capacity issues due to too many resources. In cases like that you might not want to inline such functions, and let thexocc
compiler optimize the function separately in its local context.
Summary
As discussed in earlier topics, several important aspects of coding the kernel for FPGA acceleration using C/C++ include the following points:
- Consider using arbitrary precision data types,
ap_int
, andap_fixed
. - Understand kernel interfaces to determine scalar and memory interfaces. Use
bundle
switch with different names if separate DDR memory banks will be specified in the linking stage. - Use Burst read and write coding style from and to the memory interface.
- Consider exploiting the full width of DDR banks during the data transfer when selecting width of memory data inputs and outputs.
- Get the greatest performance boost using pipelining and dataflow.
- Write perfect or semi-perfect nested loop structure so that the
xocc
compiler can flatten and apply pipeline effectively. - Unroll loops with a small number of iterations and low operation count inside the loop body.
- Consider understanding the array access pattern and apply
complete
partition to specific dimensions or applyblock
orcyclic
partitioning instead of acomplete
partition of the whole array. - Minimize the array access by using local cache to improve kernel performance.
- Consider inlining the function, specifically inside the pipelined region. Functions inside the dataflow should not be inlined.