Kernel Optimization
One of the key advantages of an FPGA is its flexibility and capacity to create customized designs specifically for your algorithm. This enables various implementation choices to trade off algorithm throughput vs. power consumption. The downside of creating custom logic is that the design needs to go through the traditional FPGA design flow.
The following guidelines help manage the design complexity and achieve the desired design goals.
Interface Attributes (Detailed Kernel Trace)
The detailed kernel trace provides easy access to the AXI transactions and their properties. The AXI transactions are presented for the global memory, as well as the Kernel side (Kernel "pass" 1:1:1) of the AXI interconnect. The following figure illustrates a typical kernel trace of a newly accelerated algorithm.
Figure:Accelerated Algorithm Kernel Trace
Most interesting with respect to performance are the fields:
- Burst Length
- Describes how many packages are sent within one transaction
- Burst Size
- Describes the number of bytes being transferred as part of one package
Given a burst length of 1 and just 4 Bytes per package, it will require many individual AXI transactions to transfer any reasonable amount of data.
Using Burst Data Transfers
Transferring data in bursts hides the memory access latency and improves bandwidth usage and efficiency of the memory controller.
If burst data transfers occur, the detailed kernel trace will reflect the higher burst rate as a larger burst length number:
Figure:Burst Data Transfer with Detailed Kernel Trace
In the previous figure, it is also possible to observe that the memory data transfers following the AXI interconnect are actually implemented rather differently (shorter transaction time). Hover over these transactions, you would see that the AXI interconnect has packed the 16x4 Byte transaction into a single package transaction of 1x64 Bytes. This effectively uses theAXI4bandwidth which is even more favorable. The next section focuses on this optimization technique in more detail.
Burst inference is heavily dependent on coding style and access pattern. To avoid potential modeling pitfalls, refer to theSDAccel Environment Programmers Guide(UG1277). However, you can ease burst detection and improve performance by isolating data transfer and computation, as shown in the following code snippet:
void kernel(T in[1024], T out[1024]) { T tmpIn[1024]; T tmpOu[1024]; read(in, tmpIn); process(tmpIn, tmpOut); write(tmpOut, out); }
In short, the functionread
is responsible for reading from the AXI input to an internal variable(tmpIn)
. The computation is implemented by the functionprocess
working on the internal variablestmpIn
andtmpOut
. The functionwrite
takes the produced output and writes to the AXI output.
The isolation of the read and write function from the computation results in:
- Simple control structures (loops) in the read/write function which makes burst detection simpler.
- The isolation of the computational function away from the AXI interfaces, simplifies potential kernel optimization. See theKernel Optimizationchapter for more information.
- The internal variables are mapped to on-chip memory, which allow faster access compared to AXI transactions. Acceleration platforms supported inSDAccelenvironment can have as much as 10 MB on-chip memories that can be used as pipes, local memories, and private memories. Using these resources effectively can greatly improve the efficiency and performance of your applications.
Using Full AXI Data Width
The user data width between the kernel and the memory controller can be configured by theSDAccelcompiler based on the data types of the kernel arguments. To maximize the data throughput,Xilinxrecommends that you choose data types map to the full data width on the memory controller. The memory controller in all supported acceleration cards supports 512-bit user interface, which can be mapped toOpenCL™vector data types, such asint16
or C/C++ arbitrary precision data typeap_int<512>
.
As shown on the following figure, you can observe burst AXI transactions (Burst Length 16) and a 512 bit package size (Burst Size 64 Bytes).
Figure:Burst AXI Transactions
This example shows good interface configuration as it maximizes AXI data width as well as it shows actual burst transactions.
Complex structs or classes, used to declare interfaces, can lead to very complex hardware interfaces due to memory layout and data packing differences. This can introduce potential issues that are very difficult to debug in a complex system.
OpenCL Attributes
TheOpenCLAPI provides attributes to support a more automatic approach to incrementing AXI data width usage. The change of the interface data types, as stated above is supported in the API as well but will require the same code changes as C/C++ to the algorithm to accommodate the larger input vector.
To eliminate manual code modifications, the followingOpenCLattributes are interpreted to perform data path widening and vectorization of the algorithm. A detailed description can be found in theSDx Pragma Reference Guide(UG1253).
vec_type_hint
reqd_work_group_size
xcl_zero_global_work_offset
Examine the combined functionality on the following case:
__attribute__((reqd_work_group_size(64, 1, 1))) __attribute__((vec_type_hint(int))) __attribute__((xcl_zero_global_work_offset)) __kernel void vector_add(__global int* c, __global const int* a, __global const int* b) { size_t idx = get_global_id(0); c[idx] = a[idx] + b[idx]; }
In this case, the hard coded interface is a 32-bit wide data path(int *c, int* a, int *b)
, which drastically limits the memory throughput if implemented directly. However, the automatic widening and transformation is applied, based on the values of the three attributes.
- __attribute__((vec_type_hint(int)))
-
Declares that
int
is the main type used for computation and memory transfer (32 bit). This knowledge is used to calculate the vectorization/widening factor based on the target bandwidth of the AXI interface (512 bits). In this example the factor would be 16 = 512 bits / 32 bit. This implies that in theory, 16 values could be processed if vectorization can be applied. - __attribute__((reqd_work_group_size(X, Y, Z)))
-
Defines the total number of work items (where
X
,Y
, andZ
are positive constants).X*Y*Z
is the maximum number of work items therefore defining the maximum possible vectorization factor which would saturate the memory bandwidth. In this example, the total number of work items is64*1*1=64
.The actual vectorization factor to be applied will be the greatest common divider of the vectorization factor defined by the actual coded type or the vec_type_hint, and the maximum possible vectorization factor defined through
reqd_work_group_size
.The quotient of maximum possible vectorization factor divided by the actual vectorization factor provides the remaining loop count of theOpenCLdescription. As this loop is pipelined, it can be advantageous to have several remaining loop iterations to take advantage of a pipelined implementation. This is especially true if the vectorizedOpenCLcode has long latency.
There is one optional parameter that is highly recommended to be specified for performance optimization onOpenCLinterfaces.
- The
__attribute__((xcl_zero_global_work_offset))
instructs the compiler that no global offset parameter is used at runtime, and all accesses are aligned. This gives the compiler valuable information with regard to alignment of the work groups, which in turn usually propagate to the alignment of the memory accesses (less hardware).
It should be noted, that the application of these transformations changes the actual design to be synthesized. Partially unrolled loops require reshaping of local arrays in which data is stored. This usually behaves nicely, but can interact poorly in rare situations.
For example:
- For partitioned arrays, when the partition factor is not divisible by the unrolling/vectorization factor.
- The resulting access requires a lot of multiplexers and will create a difficult problem for the scheduler (might severely increase memory usage and compilation time),Xilinxrecommends that you use partitioning factors that are powers of two (as the vectorization factor is always a power of two).
- If the loop being vectorized has an unrelated resource constraint, the scheduler complains about II not being met.
- This is not necessarily correlated with a loss of performance (usually it is still performing better) since the II is computed on the unrolled loop (which has therefore a multiplied throughput for each iteration).
- The scheduler informs you of the possible resources constraints and resolving those will further improve the performance.
- Note that a common occurrence is that a local array does not get automatically reshaped (usually because it is accessed in a later section of the code in non-vectorizable way).
Reducing Kernel to Kernel Communication Latency with OpenCL Pipes
TheOpenCLAPI 2.0 specification introduces a new memory object called a pipe. A pipe stores data organized as a FIFO. Pipe objects can only be accessed using built-in functions that read from and write to a pipe. Pipe objects are not accessible from the host. Pipes can be used to stream data from one kernel to another inside the FPGA without having to use the external memory, which greatly improves the overall system latency.
In theSDAcceldevelopment environment, pipes must be statically defined outside of all kernel functions. Dynamic pipe allocation using theOpenCL 2.x clCreatePipe
API is not currently supported. The depth of a pipe must be specified by using the xcl_reqd_pipe_depth attribute in the pipe declaration.
The valid depth values are as follows:
- 16
- 32
- 64
- 128
- 256
- 512
- 1024
- 2048
- 4096
- 8192
- 16384
- 32768
A given pipe can have one and only one producer and consumer in different kernels.
pipe int p0 __attribute__((xcl_reqd_pipe_depth(32)));
Pipes can be accessed using standardOpenCLread_pipe()
andwrite_pipe()
built-in functions in non-blocking mode or using theXilinxextendedread_pipe_block()
andwrite_pipe_block()
functions in blocking mode. The status of pipes can be queried usingOpenCLget_pipe_num_packets()
andget_pipe_max_packets()
built-in functions. See theOpenCLC Specification, Version 2.0 from Khronos Group for more details on these built-in functions.
The following function signatures are the currently supported pipe functions, wheregentype
indicates the built-inOpenCLC scalar integer or floating-point data types.
int read_pipe_block (pipe gentype p, gentype *ptr) int write_pipe_block (pipe gentype p, const gentype *ptr)
The following “Blocking Pipes Example” fromSDAccel Getting Started Exampleson GitHub uses pipes to pass data from one processing stage to the next usingblocking read_pipe_block()
andwrite_pipe_block()
functions:
pipe int p0 __attribute__((xcl_reqd_pipe_depth(32))); pipe int p1 __attribute__((xcl_reqd_pipe_depth(32))); // Input Stage Kernel : Read Data from Global Memory and write into Pipe P0 kernel __attribute__ ((reqd_work_group_size(1, 1, 1))) void input_stage(__global int *input, int size) { __attribute__((xcl_pipeline_loop)) mem_rd: for (int i = 0 ; i < size ; i++) { //blocking Write command to pipe P0 write_pipe_block(p0, &input[i]); } } // Adder Stage Kernel: Read Input data from Pipe P0 and write the result // into Pipe P1 kernel __attribute__ ((reqd_work_group_size(1, 1, 1))) void adder_stage(int inc, int size) { __attribute__((xcl_pipeline_loop)) execute: for(int i = 0 ; i < size ; i++) { int input_data, output_data; //blocking read command to Pipe P0 read_pipe_block(p0, &input_data); output_data = input_data + inc; //blocking write command to Pipe P1 write_pipe_block(p1, &output_data); } } // Output Stage Kernel: Read result from Pipe P1 and write the result to Global // Memory kernel __attribute__ ((reqd_work_group_size(1, 1, 1))) void output_stage(__global int *output, int size) { __attribute__((xcl_pipeline_loop)) mem_wr: for (int i = 0 ; i < size ; i++) { //blocking read command to Pipe P1 read_pipe_block(p1, &output[i]); } }
The Device Traceline view shows the detailed activities and stalls on theOpenCLpipes after hardware emulation is run. This info can be used to choose the correct FIFO sizes to achieve the optimal application area and performance.
Figure:Device Traceline View
Optimizing Computational Parallelism
By default, C/C++ does not model computational parallelism, as it always executes any algorithm sequentially. On the other hand, theOpenCLAPI does model computational parallelism with respect to work groups, but it does not use any additional parallelism within the algorithm description. However, fully configurable computational engines like FPGAs allow more freedom to exploit computational parallelism.
Coding Data Parallelism
To leverage computational parallelism during the implementation of an algorithm on the FPGA, it should be mentioned that the synthesis tool will need to be able to recognize computational parallelism from the source code first. Loops and functions are prime candidates for reflecting computational parallelism and compute units in the source description. However, even in this case, it is key to verify that the implementation takes advantage of the computational parallelism as in some cases theSDxtool might not be able to apply the desired transformation due to the structure of the source code.
It is quite common, that some computational parallelism might not be reflected in the source code to begin with. In this case, it will need to be added. A typical example is a kernel that might be described to operate on a single input value, while the FPGA implementation might execute computations more efficiently in parallel on multiple values. This kind of parallel modeling is described in theUsing Full AXI Data Widthsection. A 512-bit interface can be created usingOpenCLvector data types such asint16
or C/C++ arbitrary precision data typeap_int<512>
.
int16
. Refer to the
Median Filter Examplein the
visioncategory at
SDAccel Getting Started Exampleson GitHub for the recommended way to use vectors.
Loop Parallelism
Loops are the basic C/C++/OpenCL™API method of representing repetitive algorithmic code. The following example illustrates various implementation aspects of a loop structure:
for(int i = 0; i<255; i++) { out[i] = in[i]+in[i+1]; } out[255] = in[255];
This code iterates over an array of values and adds consecutive values, except the last value. If this loop is implemented as written, each loop iteration requires two cycles for implementation, which results in a total of 510 cycles for implementation. This can be analyzed in detail through the Schedule Viewer in the HLS Project:
Figure:Implemented Loop Structure in Schedule Viewer
This can also be analyzed in terms of total numbers and latency through theVivadosynthesis results:
Figure:Synthesis Results Performance Estimates
The key numbers here are the latency numbers and total LUT usage. For example, depending on the configuration, you could get latency of 511 and total LUT usage of 47. As you will see, these values can widely vary based on the implementation choices. While this implementation will require very little area, it results in significant latency.
Unrolling Loops
__attribute__((opencl_unroll_hint))
or a C/C++ loop can utilize the unroll pragma:
#pragma HLS UNROLL
When applied to this specific example, the Schedule Viewer in the HLS Project will be:
Figure:Schedule Viewer
With an estimated performance of:
Figure:Performance Estimates
As you can see, the total latency was considerably improved to now be 127 cycles and as expected the computational hardware was increased to 4845 LUTs, to perform the same computation in parallel.
However, if you analyze the for-loop, you might ask why this algorithm cannot be implemented in a single cycle, as each addition is completely independent of the previous loop iteration. The reason is the memory interface to be utilized for the variableout
.SDx™environment uses dual port memory by default for an array. However, this implies that at most two values can be written to the memory per cycle. Thus to see a fully parallel implementation, you must specify that the variableout
should be kept in registers as in this example:
#pragma HLS array_partition variable= out complete dim= 0
For more information see thepragma HLS array_partitionsection inSDx Pragma Reference Guide.
The results of this transformation can be observed in the following Schedule Viewer:
Figure:Tranformation Results in Schedule Viewer
The associated estimates are:
Figure:Transformation Results Performance Estimates
As you can see, this code can be implemented as a combinatorial function requiring only a fraction of the cycle to complete.
Pipelining Loops
Pipelining loops allows you to overlap iterations of a loop in time. Allowing iterations to operate concurrently is often a good compromise, as resources can be shared between iterations (less resource utilization), while requiring less execution time compared to loops that are not unrolled.
Pipelining is enabled in C/C++ via the following pragma:
#pragma HLS PIPELINE
While theOpenCLAPI uses the following attribute:
__attribute__((xcl_pipeline_loop))
__attribute__((xcl_pipeline_workitems))
More details to any of these specifications are provided in theSDx Pragma Reference Guideand theSDAccel Environment Programmers Guide.
In this example, the Schedule Viewer in the HLS Project produces the following information:
Figure:Pipelining Loops in Schedule Viewer
With the overall estimates being:
Figure:Performance Estimates
Because each iteration of a loop consumes only two cycles of latency, there can only be a single iteration overlap. This enables the total latency to be cut into half compared to the original, resulting in 257 cycles of total latency. However, this reduction in latency was achieved using fewer resources when compared to unrolling.
In most cases, loop pipelining by itself can improve overall performance. However, the effectiveness of the pipelining will depend on the structure of the loop. Some common limitations are:
- Resources with limited availability such as memory ports or process channels can limit the overlap of the iterations (Initiation Interval).
- Similarly, loop-carried dependencies such as those created by variables conditions computed in one iteration affecting the next, might increase the initial interval of the pipeline.
These are reported by the tool during high-level synthesis and can be observed and examined in the Schedule Viewer. For the best possible performance, the code might have to be modified to eliminate these limiting factors, or the tool needs to be instructed to eliminate some dependency by restructuring the memory implementation of an array or breaking the dependencies all together.
Task Parallelism
Task parallelism allows you to take advantage of data flow parallelism. In contrast to loop parallelism, when task parallelism is deployed, full execution units (tasks) are allowed to operate in parallel taking advantage of extra buffering introduced between the tasks.
Look at the following example:
void run (ap_uint<16> in[1024], ap_uint<16> out[1024] ) { ap_uint<16> tmp[128]; for(int i = 0; i<8; i++) { processA(&(in[i*128]), tmp); processB(tmp, &(out[i*128])); } }
When this code is executed, the functionprocessA
andprocessB
are executed sequentially 128 times in a row. Given the combined latency forprocessA
andprocessB
in the loop is 278, the total latency can be estimated as:
Figure:Performance Estimates
The extra cycle is due to loop setup and can be observed in the Schedule Viewer.
For C/C++ code, Task Parallelism is performed by adding the DATAFLOW pragma into the for-loop:
#pragma HLS DATAFLOW
__attribute__ ((xcl_dataflow))
Refer toSDx Pragma Reference GuideandSDAccel Environment Programmers Guidefor more details regarding the specifics and limitations of these modifiers.
As illustrated by the estimates in the HLS Report, applying the transformation will considerably improve the overall performance effectively using a double (ping pong) buffer scheme between the tasks:
Figure:Performances Estimates
The overall latency of the design has almost halved in this case due to concurrent execution of the different tasks of the different iterations. Given the 139 cycles per processing function and the full overlap of the 128 iterations, this allows the total latency to be:
(1x only processA + 127x both processes + 1x only processB) * 139 cycles = 17931 cycles
Using task parallelism is a very powerful way improve performance when it comes to implementation. However, the effectiveness of applying the DATAFLOW pragma to a specific and arbitrary piece of code might vary vastly. The coding guidelines for applying DATAFLOW effectively are provided inSDx Pragma Reference GuideandSDAccel Environment Programmers Guide. However, it is often necessary to actually look at the execution pattern of the individual tasks to understand the final implementation of the DATAFLOW pragma. Towards that end, theSDAccelenvironment provides the Detailed Kernel Trace, which nicely illustrates concurrent execution.
Figure:Detailed Kernel Trace
For this Detailed Kernel Trace, the tool displays the start of the dataflowed loop, as shown in the previous figure. It illustrates how processA is starting up right away with the beginning of the loop, while processB waits until the completion of the processA before it can start up its first iteration. However, while processB completes the first iteration of the loop, processA begins operating on the second iteration and so forth.
A more abstract representation of this information is presented in the Application Timeline (Host & Device) and Device Hardware Transaction View (device-only during hardware emulation).
Optimizing Compute Units
Data Width
One, if not the most important, aspect for performance is the data width required for the implementation. The tool propagates port widths throughout the algorithm. In some cases, especially when starting out with an algorithmic description, the C/C++/OpenCL™API code might only utilize large data types such as integers even at the ports of the design. However, as the algorithm gets mapped to a fully configurable implementation, smaller data types such as 10- or 12-bit might often suffice. Towards that end it is beneficial to check the size of basic operations in the HLS Synthesis report during optimization. In general, when theSDx™environment maps an algorithm onto the FPGA, much processing is required to comprehend the C/C++/OpenCLAPI structure and extract operational dependencies. Therefore, to perform this mapping theSDxenvironment generally partitions the source code into operational units which are then mapped onto the FPGA. Several aspects influence the number and size of these operational units (ops) as seen by the tool.
In the following table, the basic operations and their bitwidth are reported.
Figure:Operations Utilization Estimates
Look for bit widths of 16, 32, and 64 bits commonly used in algorithmic descriptions, and verify that the associated operation from the C/C++/OpenCLAPI source actually requires the bit width to be this large. This can considerably improve the implementation of the algorithm, as smaller operations require less computation time.
Fixed Point Arithmetic
Some applications use floating point computation only because they are optimized for other hardware architecture. As explained inDeep Learning with INT8 Optimization on Xilinx Devices, using fixed point arithmetic for applications like deep learning can save the power efficiency and area significantly while keeping the same level of accuracy. It is recommended to explore fixed point arithmetic for your application before committing to using floating point operations.
Macro Operations
It is sometimes advantageous to think about larger computational elements. The tool will operate on the source code independently of the remaining source code, effectively mapping the algorithm without consideration of surrounding operations onto the FPGA. When applied,SDxtool keeps operational boundaries, effectively creating macro operations for specific code. This utilizes the following principles:
- Operational locality to the mapping process.
- Reduction in complexity for the heuristics.
#pragma HLS inline off
__attribute__((always_inline))
Using Optimized Libraries
TheOpenCLspecification provides many math built-in functions. All math built-in functions with thenative_
prefix are mapped to one or more native device instructions and will typically have better performance compared to the corresponding functions (without thenative_
prefix). The accuracy and in some cases the input ranges of these functions is implementation-defined. InSDAccel™environment thesenative_
built-in functions use the equivalent functions in theVivado®Hight-Level Synthesis (HLS) tool Math library, which are already optimized forXilinx®FPGAs in terms of area and performance.Xilinxrecommends that you usenative_
built-in functions or the HLS tool Math library if the accuracy meets the application requirement.
Optimizing Memory Architecture
Memory architecture is a key aspect of implementation. Due to the limited access bandwidth, it can heavily impact the overall performance, as shown in the following example.:
void run (ap_uint<16> in[256][4], ap_uint<16> out[256] ) { ... ap_uint<16> inMem[256][4]; ap_uint<16> outMem[256]; ... Preprocess input to local memory for( int j=0; j<256; j++) { #pragma HLS PIPELINE OFF ap_uint<16> sum = 0; for( int i = 0; i<4; i++) { sum += inMem[j][i]; } outMem[j] = sum; } ... Postprocess write local memory to output }
Figure:Performance Estimates
The overall latency of 4608 (Loop 2) is due to 256 iterations of 18 cycles (16 cycles spent in the inner loop, plus the reset of sum, plus the output being written). This is can be observed in the Schedule Viewer in the HLS Project. The estimates become considerably better when unrolling the inner loop.
Figure:Performance Estimates
However, this improvement is largely due to the fact that this process uses both ports of a dual port memory. This can be seen from the Schedule Viewer in the HLS Project:
Figure:Schedule Viewer
As you can see, two read operations are performed per cycle to access all the values from the memory to calculate the sum. This is often an undesired result as this completely blocks the access to the memory. To further improve the results, the memory can be split into four smaller memories along the second dimension:
#pragma HLS ARRAY_PARTITION variable=inMem complete dim=2
This results in four array reads, all executed on different memories using a single port:
Figure:Executed Four Arrays Results
Figure:Performance Estimates
Alternatively, the memory can be reshaped into to a single memory with four words in parallel. This is performed through the pragma:
#pragma HLS array_reshape variable=inMem complete dim=2
This results in the same latency as when the array partitioning, but with a single memory using a single port:
Figure:Latency Result
Although, either solution creates comparable results with respect to overall latency and utilization, reshaping the array results in cleaner interfaces and less routing congestion making this the preferred solution.
void run (ap_uint<16> in[256][4], ap_uint<16> out[256] ) { ... ap_uint<16> inMem[256][4]; ap_uint<16> outMem[256]; #pragma HLS array_reshape variable=inMem complete dim=2 ... Preprocess input to local memory for( int j=0; j<256; j++) { #pragma HLS PIPELINE OFF ap_uint<16> sum = 0; for( int i = 0; i<4; i++) { #pragma HLS UNROLL sum += inMem[j][i]; } outMem[j] = sum; } ... Postprocess write local memory to output }