Kernel Optimization - 现金网注册平台,现金网博e百,真人百家乐游戏开户

Interface Attributes (Detailed Kernel Trace)

The detailed kernel trace provides easy access to the AXI transactions and their properties. The AXI transactions are presented for the global memory, as well as the Kernel side (Kernel "pass" 1:1:1) of the AXI interconnect. The following figure illustrates a typical kernel trace of a newly accelerated algorithm.

Most interesting with respect to performance are the fields:

Burst Length: Describes how many packages are sent within one transaction
Burst Size: Describes the number of bytes being transferred as part of one package

Given a burst length of 1 and just 4 Bytes per package, it will require many individual AXI transactions to transfer any reasonable amount of data.

Note:The SDSoC™environment never creates burst sizes less than 4 bytes, even if smaller data is transmitted. In this case, if consecutive items are accessed without AXI bursts enabled, it is possible to observe multiple AXI reads to the same address.

Small burst lengths, as well as burst sizes, considerably less than 512-bits are therefore good opportunities to optimize interface performance. The following sections show improved implementations:

Using Burst Data Transfers

Transferring data in bursts hides the memory access latency and improves bandwidth usage and efficiency of the memory controller.

Note:Infer burst transfers from successive requests of data from consecutive address locations. Refer to "Inferring Burst Transfer from/to Global Memory" in SDAccel Environment Programmers Guidefor more details.

If burst data transfers occur, the detailed kernel trace will reflect the higher burst rate as a larger burst length number:

In the previous figure, it is also possible to observe that the memory data transfers following the AXI interconnect are actually implemented rather differently (shorter transaction time). Hover over these transactions, you would see that the AXI interconnect has packed the 16x4 Byte transaction into a single package transaction of 1x64 Bytes. This effectively uses theAXI4bandwidth which is even more favorable. The next section focuses on this optimization technique in more detail.

Burst inference is heavily dependent on coding style and access pattern. To avoid potential modeling pitfalls, refer to theSDAccel Environment Programmers Guide(UG1277). However, you can ease burst detection and improve performance by isolating data transfer and computation, as shown in the following code snippet:

void kernel(T in[1024], T out[1024]) { T tmpIn[1024]; T tmpOu[1024]; read(in, tmpIn); process(tmpIn, tmpOut); write(tmpOut, out); }

In short, the functionreadis responsible for reading from the AXI input to an internal variable(tmpIn). The computation is implemented by the functionprocessworking on the internal variablestmpInandtmpOut. The functionwritetakes the produced output and writes to the AXI output.

The isolation of the read and write function from the computation results in:

Simple control structures (loops) in the read/write function which makes burst detection simpler.
The isolation of the computational function away from the AXI interfaces, simplifies potential kernel optimization. See theKernel Optimizationchapter for more information.
The internal variables are mapped to on-chip memory, which allow faster access compared to AXI transactions. Acceleration platforms supported inSDAccelenvironment can have as much as 10 MB on-chip memories that can be used as pipes, local memories, and private memories. Using these resources effectively can greatly improve the efficiency and performance of your applications.

Using Full AXI Data Width

The user data width between the kernel and the memory controller can be configured by theSDAccelcompiler based on the data types of the kernel arguments. To maximize the data throughput,Xilinxrecommends that you choose data types map to the full data width on the memory controller. The memory controller in all supported acceleration cards supports 512-bit user interface, which can be mapped toOpenCL™vector data types, such asint16or C/C++ arbitrary precision data typeap_int<512>.

As shown on the following figure, you can observe burst AXI transactions (Burst Length 16) and a 512 bit package size (Burst Size 64 Bytes).

This example shows good interface configuration as it maximizes AXI data width as well as it shows actual burst transactions.

Complex structs or classes, used to declare interfaces, can lead to very complex hardware interfaces due to memory layout and data packing differences. This can introduce potential issues that are very difficult to debug in a complex system.

Note:Use simple structs for kernel arguments that can be packed to 32-bit boundary. Refer to the Custom Data Type Examplein kernel_to_gmemcategory at Xilinx On-boarding Example GitHubfor the recommended way to use structs.

OpenCL Attributes

TheOpenCLAPI provides attributes to support a more automatic approach to incrementing AXI data width usage. The change of the interface data types, as stated above is supported in the API as well but will require the same code changes as C/C++ to the algorithm to accommodate the larger input vector.

To eliminate manual code modifications, the followingOpenCLattributes are interpreted to perform data path widening and vectorization of the algorithm. A detailed description can be found in theSDx Pragma Reference Guide(UG1253).

vec_type_hint
reqd_work_group_size
xcl_zero_global_work_offset

Examine the combined functionality on the following case:

__attribute__((reqd_work_group_size(64, 1, 1))) __attribute__((vec_type_hint(int))) __attribute__((xcl_zero_global_work_offset)) __kernel void vector_add(__global int* c, __global const int* a, __global const int* b) { size_t idx = get_global_id(0); c[idx] = a[idx] + b[idx]; }

In this case, the hard coded interface is a 32-bit wide data path(int *c, int* a, int *b), which drastically limits the memory throughput if implemented directly. However, the automatic widening and transformation is applied, based on the values of the three attributes.

__attribute__((vec_type_hint(int)))

Declares that intis the main type used for computation and memory transfer (32 bit). This knowledge is used to calculate the vectorization/widening factor based on the target bandwidth of the AXI interface (512 bits). In this example the factor would be 16 = 512 bits / 32 bit. This implies that in theory, 16 values could be processed if vectorization can be applied.

__attribute__((reqd_work_group_size(X, Y, Z)))

Defines the total number of work items (where X, Y, and Zare positive constants). X*Y*Zis the maximum number of work items therefore defining the maximum possible vectorization factor which would saturate the memory bandwidth. In this example, the total number of work items is 64*1*1=64.

The actual vectorization factor to be applied will be the greatest common divider of the vectorization factor defined by the actual coded type or the vec_type_hint, and the maximum possible vectorization factor defined throughreqd_work_group_size.

The quotient of maximum possible vectorization factor divided by the actual vectorization factor provides the remaining loop count of theOpenCLdescription. As this loop is pipelined, it can be advantageous to have several remaining loop iterations to take advantage of a pipelined implementation. This is especially true if the vectorizedOpenCLcode has long latency.

There is one optional parameter that is highly recommended to be specified for performance optimization onOpenCLinterfaces.

The__attribute__((xcl_zero_global_work_offset))instructs the compiler that no global offset parameter is used at runtime, and all accesses are aligned. This gives the compiler valuable information with regard to alignment of the work groups, which in turn usually propagate to the alignment of the memory accesses (less hardware).

It should be noted, that the application of these transformations changes the actual design to be synthesized. Partially unrolled loops require reshaping of local arrays in which data is stored. This usually behaves nicely, but can interact poorly in rare situations.

For example:

For partitioned arrays, when the partition factor is not divisible by the unrolling/vectorization factor.
- The resulting access requires a lot of multiplexers and will create a difficult problem for the scheduler (might severely increase memory usage and compilation time),Xilinxrecommends that you use partitioning factors that are powers of two (as the vectorization factor is always a power of two).
If the loop being vectorized has an unrelated resource constraint, the scheduler complains about II not being met.
- This is not necessarily correlated with a loss of performance (usually it is still performing better) since the II is computed on the unrolled loop (which has therefore a multiplied throughput for each iteration).
- The scheduler informs you of the possible resources constraints and resolving those will further improve the performance.
- Note that a common occurrence is that a local array does not get automatically reshaped (usually because it is accessed in a later section of the code in non-vectorizable way).

Reducing Kernel to Kernel Communication Latency with OpenCL Pipes

TheOpenCLAPI 2.0 specification introduces a new memory object called a pipe. A pipe stores data organized as a FIFO. Pipe objects can only be accessed using built-in functions that read from and write to a pipe. Pipe objects are not accessible from the host. Pipes can be used to stream data from one kernel to another inside the FPGA without having to use the external memory, which greatly improves the overall system latency.

In theSDAcceldevelopment environment, pipes must be statically defined outside of all kernel functions. Dynamic pipe allocation using theOpenCL 2.x clCreatePipeAPI is not currently supported. The depth of a pipe must be specified by using the xcl_reqd_pipe_depth attribute in the pipe declaration.

The valid depth values are as follows:

16
32
64
128
256
512
1024
2048
4096
8192
16384
32768

A given pipe can have one and only one producer and consumer in different kernels.

pipe int p0 __attribute__((xcl_reqd_pipe_depth(32)));

Pipes can be accessed using standardOpenCLread_pipe()andwrite_pipe()built-in functions in non-blocking mode or using theXilinxextendedread_pipe_block()andwrite_pipe_block()functions in blocking mode. The status of pipes can be queried usingOpenCLget_pipe_num_packets()andget_pipe_max_packets()built-in functions. See theOpenCLC Specification, Version 2.0 from Khronos Group for more details on these built-in functions.

The following function signatures are the currently supported pipe functions, wheregentypeindicates the built-inOpenCLC scalar integer or floating-point data types.

int read_pipe_block (pipe gentype p, gentype *ptr) int write_pipe_block (pipe gentype p, const gentype *ptr)

The following “Blocking Pipes Example” fromSDAccel Getting Started Exampleson GitHub uses pipes to pass data from one processing stage to the next usingblocking read_pipe_block()andwrite_pipe_block()functions:

pipe int p0 __attribute__((xcl_reqd_pipe_depth(32))); pipe int p1 __attribute__((xcl_reqd_pipe_depth(32))); // Input Stage Kernel : Read Data from Global Memory and write into Pipe P0 kernel __attribute__ ((reqd_work_group_size(1, 1, 1))) void input_stage(__global int *input, int size) { __attribute__((xcl_pipeline_loop)) mem_rd: for (int i = 0 ; i < size ; i++) { //blocking Write command to pipe P0 write_pipe_block(p0, &input[i]); } } // Adder Stage Kernel: Read Input data from Pipe P0 and write the result // into Pipe P1 kernel __attribute__ ((reqd_work_group_size(1, 1, 1))) void adder_stage(int inc, int size) { __attribute__((xcl_pipeline_loop)) execute: for(int i = 0 ; i < size ; i++) { int input_data, output_data; //blocking read command to Pipe P0 read_pipe_block(p0, &input_data); output_data = input_data + inc; //blocking write command to Pipe P1 write_pipe_block(p1, &output_data); } } // Output Stage Kernel: Read result from Pipe P1 and write the result to Global // Memory kernel __attribute__ ((reqd_work_group_size(1, 1, 1))) void output_stage(__global int *output, int size) { __attribute__((xcl_pipeline_loop)) mem_wr: for (int i = 0 ; i < size ; i++) { //blocking read command to Pipe P1 read_pipe_block(p1, &output[i]); } }

The Device Traceline view shows the detailed activities and stalls on theOpenCLpipes after hardware emulation is run. This info can be used to choose the correct FIFO sizes to achieve the optimal application area and performance.

Optimizing Computational Parallelism

By default, C/C++ does not model computational parallelism, as it always executes any algorithm sequentially. On the other hand, theOpenCLAPI does model computational parallelism with respect to work groups, but it does not use any additional parallelism within the algorithm description. However, fully configurable computational engines like FPGAs allow more freedom to exploit computational parallelism.

Coding Data Parallelism

To leverage computational parallelism during the implementation of an algorithm on the FPGA, it should be mentioned that the synthesis tool will need to be able to recognize computational parallelism from the source code first. Loops and functions are prime candidates for reflecting computational parallelism and compute units in the source description. However, even in this case, it is key to verify that the implementation takes advantage of the computational parallelism as in some cases theSDxtool might not be able to apply the desired transformation due to the structure of the source code.

It is quite common, that some computational parallelism might not be reflected in the source code to begin with. In this case, it will need to be added. A typical example is a kernel that might be described to operate on a single input value, while the FPGA implementation might execute computations more efficiently in parallel on multiple values. This kind of parallel modeling is described in theUsing Full AXI Data Widthsection. A 512-bit interface can be created usingOpenCLvector data types such asint16or C/C++ arbitrary precision data typeap_int<512>.

Note:These vector types can also be used as a powerful way to model data parallelism within a kernel, with up to 16 data paths operating in parallel in case of int16. Refer to the Median Filter Examplein the visioncategory at SDAccel Getting Started Exampleson GitHub for the recommended way to use vectors.

Loop Parallelism

Loops are the basic C/C++/OpenCL™API method of representing repetitive algorithmic code. The following example illustrates various implementation aspects of a loop structure:

for(int i = 0; i<255; i++) { out[i] = in[i]+in[i+1]; } out[255] = in[255];

This code iterates over an array of values and adds consecutive values, except the last value. If this loop is implemented as written, each loop iteration requires two cycles for implementation, which results in a total of 510 cycles for implementation. This can be analyzed in detail through the Schedule Viewer in the HLS Project:

This can also be analyzed in terms of total numbers and latency through theVivadosynthesis results:

The key numbers here are the latency numbers and total LUT usage. For example, depending on the configuration, you could get latency of 511 and total LUT usage of 47. As you will see, these values can widely vary based on the implementation choices. While this implementation will require very little area, it results in significant latency.

Unrolling Loops

Unrolling a loop enables the full parallelism of the model to be exploited. To do this, you can simply mark a loop to be unrolled and the tool will create the implementation with the most parallelism possible. To mark a loop to unroll, an OpenCLloop can be marked with the UNROLL attribute:

__attribute__((opencl_unroll_hint))

or a C/C++ loop can utilize the unroll pragma:

#pragma HLS UNROLL

When applied to this specific example, the Schedule Viewer in the HLS Project will be:

With an estimated performance of:

As you can see, the total latency was considerably improved to now be 127 cycles and as expected the computational hardware was increased to 4845 LUTs, to perform the same computation in parallel.

However, if you analyze the for-loop, you might ask why this algorithm cannot be implemented in a single cycle, as each addition is completely independent of the previous loop iteration. The reason is the memory interface to be utilized for the variableout.SDx™environment uses dual port memory by default for an array. However, this implies that at most two values can be written to the memory per cycle. Thus to see a fully parallel implementation, you must specify that the variableoutshould be kept in registers as in this example:

#pragma HLS array_partition variable= out complete dim= 0

For more information see thepragma HLS array_partitionsection inSDx Pragma Reference Guide.

The results of this transformation can be observed in the following Schedule Viewer:

The associated estimates are:

As you can see, this code can be implemented as a combinatorial function requiring only a fraction of the cycle to complete.

Pipelining Loops

Pipelining loops allows you to overlap iterations of a loop in time. Allowing iterations to operate concurrently is often a good compromise, as resources can be shared between iterations (less resource utilization), while requiring less execution time compared to loops that are not unrolled.

Pipelining is enabled in C/C++ via the following pragma:

#pragma HLS PIPELINE

While theOpenCLAPI uses the following attribute:

__attribute__((xcl_pipeline_loop))

Note:The OpenCLAPI has an additional way of specifying loop pipelining. This has to do with the fact that work item loops are not explicitly stated and pipelining these loops requires the attribute:

__attribute__((xcl_pipeline_workitems))

More details to any of these specifications are provided in theSDx Pragma Reference Guideand theSDAccel Environment Programmers Guide.

In this example, the Schedule Viewer in the HLS Project produces the following information:

With the overall estimates being:

Because each iteration of a loop consumes only two cycles of latency, there can only be a single iteration overlap. This enables the total latency to be cut into half compared to the original, resulting in 257 cycles of total latency. However, this reduction in latency was achieved using fewer resources when compared to unrolling.

In most cases, loop pipelining by itself can improve overall performance. However, the effectiveness of the pipelining will depend on the structure of the loop. Some common limitations are:

Resources with limited availability such as memory ports or process channels can limit the overlap of the iterations (Initiation Interval).
Similarly, loop-carried dependencies such as those created by variables conditions computed in one iteration affecting the next, might increase the initial interval of the pipeline.

These are reported by the tool during high-level synthesis and can be observed and examined in the Schedule Viewer. For the best possible performance, the code might have to be modified to eliminate these limiting factors, or the tool needs to be instructed to eliminate some dependency by restructuring the memory implementation of an array or breaking the dependencies all together.

Task Parallelism

Task parallelism allows you to take advantage of data flow parallelism. In contrast to loop parallelism, when task parallelism is deployed, full execution units (tasks) are allowed to operate in parallel taking advantage of extra buffering introduced between the tasks.

Look at the following example:

void run (ap_uint<16> in[1024], ap_uint<16> out[1024] ) { ap_uint<16> tmp[128]; for(int i = 0; i<8; i++) { processA(&(in[i*128]), tmp); processB(tmp, &(out[i*128])); } }

When this code is executed, the functionprocessAandprocessBare executed sequentially 128 times in a row. Given the combined latency forprocessAandprocessBin the loop is 278, the total latency can be estimated as:

The extra cycle is due to loop setup and can be observed in the Schedule Viewer.

For C/C++ code, Task Parallelism is performed by adding the DATAFLOW pragma into the for-loop:

#pragma HLS DATAFLOW

For OpenCLAPI code, add the attribute before the for-loop:

__attribute__ ((xcl_dataflow))

Refer toSDx Pragma Reference GuideandSDAccel Environment Programmers Guidefor more details regarding the specifics and limitations of these modifiers.

As illustrated by the estimates in the HLS Report, applying the transformation will considerably improve the overall performance effectively using a double (ping pong) buffer scheme between the tasks:

The overall latency of the design has almost halved in this case due to concurrent execution of the different tasks of the different iterations. Given the 139 cycles per processing function and the full overlap of the 128 iterations, this allows the total latency to be:

(1x only processA + 127x both processes + 1x only processB) * 139 cycles = 17931 cycles

Using task parallelism is a very powerful way improve performance when it comes to implementation. However, the effectiveness of applying the DATAFLOW pragma to a specific and arbitrary piece of code might vary vastly. The coding guidelines for applying DATAFLOW effectively are provided inSDx Pragma Reference GuideandSDAccel Environment Programmers Guide. However, it is often necessary to actually look at the execution pattern of the individual tasks to understand the final implementation of the DATAFLOW pragma. Towards that end, theSDAccelenvironment provides the Detailed Kernel Trace, which nicely illustrates concurrent execution.

For this Detailed Kernel Trace, the tool displays the start of the dataflowed loop, as shown in the previous figure. It illustrates how processA is starting up right away with the beginning of the loop, while processB waits until the completion of the processA before it can start up its first iteration. However, while processB completes the first iteration of the loop, processA begins operating on the second iteration and so forth.

A more abstract representation of this information is presented in the Application Timeline (Host & Device) and Device Hardware Transaction View (device-only during hardware emulation).

Optimizing Compute Units

Data Width

One, if not the most important, aspect for performance is the data width required for the implementation. The tool propagates port widths throughout the algorithm. In some cases, especially when starting out with an algorithmic description, the C/C++/OpenCL™API code might only utilize large data types such as integers even at the ports of the design. However, as the algorithm gets mapped to a fully configurable implementation, smaller data types such as 10- or 12-bit might often suffice. Towards that end it is beneficial to check the size of basic operations in the HLS Synthesis report during optimization. In general, when theSDx™environment maps an algorithm onto the FPGA, much processing is required to comprehend the C/C++/OpenCLAPI structure and extract operational dependencies. Therefore, to perform this mapping theSDxenvironment generally partitions the source code into operational units which are then mapped onto the FPGA. Several aspects influence the number and size of these operational units (ops) as seen by the tool.

In the following table, the basic operations and their bitwidth are reported.

Look for bit widths of 16, 32, and 64 bits commonly used in algorithmic descriptions, and verify that the associated operation from the C/C++/OpenCLAPI source actually requires the bit width to be this large. This can considerably improve the implementation of the algorithm, as smaller operations require less computation time.

Fixed Point Arithmetic

Some applications use floating point computation only because they are optimized for other hardware architecture. As explained inDeep Learning with INT8 Optimization on Xilinx Devices, using fixed point arithmetic for applications like deep learning can save the power efficiency and area significantly while keeping the same level of accuracy. It is recommended to explore fixed point arithmetic for your application before committing to using floating point operations.

Macro Operations

It is sometimes advantageous to think about larger computational elements. The tool will operate on the source code independently of the remaining source code, effectively mapping the algorithm without consideration of surrounding operations onto the FPGA. When applied,SDxtool keeps operational boundaries, effectively creating macro operations for specific code. This utilizes the following principles:

Operational locality to the mapping process.
Reduction in complexity for the heuristics.

This might create vastly different results when applied. In C/C++ macro operations are created with the help of

#pragma HLS inline off

While in the OpenCLAPI, the same kind of macro operation can be generated by notspecifying the following attribute, when defining a function.:

__attribute__((always_inline))

Using Optimized Libraries

TheOpenCLspecification provides many math built-in functions. All math built-in functions with thenative_prefix are mapped to one or more native device instructions and will typically have better performance compared to the corresponding functions (without thenative_prefix). The accuracy and in some cases the input ranges of these functions is implementation-defined. InSDAccel™environment thesenative_built-in functions use the equivalent functions in theVivado®Hight-Level Synthesis (HLS) tool Math library, which are already optimized forXilinx®FPGAs in terms of area and performance.Xilinxrecommends that you usenative_built-in functions or the HLS tool Math library if the accuracy meets the application requirement.

Optimizing Memory Architecture

Memory architecture is a key aspect of implementation. Due to the limited access bandwidth, it can heavily impact the overall performance, as shown in the following example.:

void run (ap_uint<16> in[256][4], ap_uint<16> out[256] ) { ... ap_uint<16> inMem[256][4]; ap_uint<16> outMem[256]; ... Preprocess input to local memory for( int j=0; j<256; j++) { #pragma HLS PIPELINE OFF ap_uint<16> sum = 0; for( int i = 0; i<4; i++) { sum += inMem[j][i]; } outMem[j] = sum; } ... Postprocess write local memory to output }

This code adds the four values associated with the inner dimension of the two dimensional input array. If implemented without any additional modifications, it results in the following estimates:

The overall latency of 4608 (Loop 2) is due to 256 iterations of 18 cycles (16 cycles spent in the inner loop, plus the reset of sum, plus the output being written). This is can be observed in the Schedule Viewer in the HLS Project. The estimates become considerably better when unrolling the inner loop.

However, this improvement is largely due to the fact that this process uses both ports of a dual port memory. This can be seen from the Schedule Viewer in the HLS Project:

As you can see, two read operations are performed per cycle to access all the values from the memory to calculate the sum. This is often an undesired result as this completely blocks the access to the memory. To further improve the results, the memory can be split into four smaller memories along the second dimension:

#pragma HLS ARRAY_PARTITION variable=inMem complete dim=2

This results in four array reads, all executed on different memories using a single port:

Using a total of 256 * 4 cycles = 1024 cycles for loop 2.

Alternatively, the memory can be reshaped into to a single memory with four words in parallel. This is performed through the pragma:

#pragma HLS array_reshape variable=inMem complete dim=2

This results in the same latency as when the array partitioning, but with a single memory using a single port:

Although, either solution creates comparable results with respect to overall latency and utilization, reshaping the array results in cleaner interfaces and less routing congestion making this the preferred solution.

Note:This completes array optimization, in a real design the latency could be further improved by exploiting loop parallelism (see the Loop Parallelismsection).

void run (ap_uint<16> in[256][4], ap_uint<16> out[256] ) { ... ap_uint<16> inMem[256][4]; ap_uint<16> outMem[256]; #pragma HLS array_reshape variable=inMem complete dim=2 ... Preprocess input to local memory for( int j=0; j<256; j++) { #pragma HLS PIPELINE OFF ap_uint<16> sum = 0; for( int i = 0; i<4; i++) { #pragma HLS UNROLL sum += inMem[j][i]; } outMem[j] = sum; } ... Postprocess write local memory to output }