OpenCL Attributes

Optimizations in OpenCL

This section describesOpenCL™attributes that can be added to source code to assist system optimization by theSDAccel™development environment, andVivado®High-Level Synthesis (HLS) tool synthesis.

TheSDx™environment providesOpenCLattributes to optimize your code for data movement and kernel performance. The goal of data movement optimization is to maximize the system level data throughput by maximizing interface bandwidth usage and DDR bandwidth usage. The goal of kernel computation optimization is to create processing logic that can consume all the data as soon as they arrive at kernel interfaces. This is generally achieved by expanding the processing code to match the data path with techniques such as function inlining and pipelining, loop unrolling, array partitioning, dataflowing, etc.

TheOpenCLattributes include the types specified below:

Table 1.OpenCLAttributes by Type
Type	Attributes
Kernel Optimization	reqd_work_group_size vec_type_hint work_group_size_hint xcl_latency xcl_max_work_group_size xcl_zero_global_work_offset
Function Inlining	always_inline
Task-level Pipeline	xcl_latency xcl_reqd_pipe_depth
Pipeline	xcl_pipeline_loop xcl_pipeline_workitems
Loop Optimization	opencl_unroll_hint xcl_loop_tripcount xcl_pipeline_loop
Array Optimization	xcl_array_partition xcl_array_reshape Note:Array variables only accept a single array optimization attribute.

TIP:The SDAccelcompiler also supports many of the standard attributes supported by gcc, such as:

ALWAYS_INLINE
NOINLINE
UNROLL
NOUNROLL

always_inline

Description

The ALWAYS_INLINE attribute indicates that a function must be inlined. This attribute is a standard feature of GCC, and a standard feature of theSDxtool compilers.

TIP:The NOINLINE attribute is also a standard feature of GCC, and is also supported by SDxtool compilers.

This attribute enables a compiler optimization to have a function inlined into the calling function. The inlined function is dissolved and no longer appears as a separate level of hierarchy in the RTL.

In some cases, inlining a function allows operations within the function to be shared and optimized more effectively with surrounding operations in the calling function. However, an inlined function can no longer be shared with other functions, so the logic might be duplicated between the inlined function and a separate instance of the function which can be more broadly shared. While this can improve performance, this will also increase the area required for implementing the RTL.

ForOpenCLkernels, theSDxcompiler uses its own rules to inline or not inline a function. To directly control inlining functions, use the ALWAYS_INLINE or NOINLINE attributes.

By default, inlining is only performed on the next level of function hierarchy, not sub-functions.

IMPORTANT:When used with the XCL_DATAFLOW attribute, the compiler will ignore the ALWAYS_INLINE attribute and not inline the function.

Syntax

Place the attribute in the OpenCLAPI source before the function definition to always have it inlined whenever the function is called.

__attribute__((always_inline))

Examples

This example adds the ALWAYS_INLINE attribute to functionfoo:

__attribute__((always_inline)) void foo ( a, b, c, d ) { ... }

This example prevents the inlining of the functionfoo:

__attribute__((noinline)) void foo ( a, b, c, d ) { ... }

opencl_unroll_hint

Description

IMPORTANT:This is a compiler hint which the compiler may ignore.

Loop unrolling is the first optimization technique available in theSDAcceldevelopment environment. The purpose of the loop unroll optimization is to expose concurrency to the compiler. This newly exposed concurrency reduces latency and improves performance, but also consumes more FPGA fabric resources.

The OPENCL_UNROLL_HINT attribute is part of theOpenCLLanguage Specification, and specifies that loops (for,while,do) can be unrolled by theOpenCLcompiler.

The OPENCL_UNROLL_HINT attribute qualifier must appear immediately before the loop to be affected. You can use this attribute to specify full unrolling of the loop, partial unrolling by a specified amount, or to disable unrolling of the loop.

Syntax

Place the attribute in theOpenCLsource before the loop definition:

__attribute__((opencl_unroll_hint()))

Where:

is an optional loop unrolling factor and must be a positive integer, or compile time constant expression. An unroll factor of 1 disables unrolling.
TIP:If is not specified, the compiler automatically determines the unrolling factor for the loop.

Example 1

The following example unrolls theforloop by a factor of 2. This results in two parallel loop iterations instead of four sequential iterations for the compute unit to complete the operation.

__attribute__((opencl_unroll_hint(2))) for(int i = 0; i < LENGTH; i++) { bufc[i] = bufa[i] * bufb[i]; }

Conceptually the compiler transforms the loop above to the following code:

for(int i = 0; i < LENGTH; i+=2) { bufc[i] = bufa[i] * bufb[i]; bufc[i+1] = bufa[i+1] * bufb[i+1]; }

reqd_work_group_size

Description

WhenOpenCLAPI kernels are submitted for execution on anOpenCLdevice, they execute within an index space, called an ND range, which can have 1, 2, or 3 dimensions. This is called the global size in theOpenCLAPI. The work-group size defines the amount of the ND range that can be processed by a single invocation of a kernel compute unit. The work-group size is also called the local size in theOpenCLAPI. TheOpenCLcompiler can determine the work-group size based on the properties of the kernel and selected device. After the work-group size (local size) is determined, the ND range (global size) is divided automatically into work-groups, and the work-groups are scheduled for execution on the device.

Although theOpenCLcompiler can define the work-group size, the specification of the REQD_WORK_GROUP_SIZE attribute on the kernel to define the work-group size is highly recommended for FPGA implementations of the kernel. The attribute is recommended for performance optimization during the generation of the custom logic for a kernel. SeeOpenCL Execution Modelin theSDAccel Environment Profiling and Optimization Guide(UG1207)for more information.

TIP:In the case of an FPGA implementation, the specification of the REQD_WORK_GROUP_SIZE attribute is highly recommended as it can be used for performance optimization during the generation of the custom logic for a kernel.

OpenCLkernel functions are executed exactly one time for each point in the ND range index space. This unit of work for each point in the ND range is called a work-item. Work-items are organized into work-groups, which are the unit of work scheduled onto compute units. The optional REQD_WORK_GROUP_SIZE attribute defines the work-group size of a compute unit that must be used as thelocal_work_sizeargument toclEnqueueNDRangeKernel. This allows the compiler to optimize the generated code appropriately for this kernel.

Syntax

Place this attribute before the kernel definition, or before the primary function specified for the kernel:

__attribute__((reqd_work_group_size(, , )))

Where:

, , : Specifies the ND range of the kernel. This represents each dimension of a three dimensional matrix specifying the size of the work-group for the kernel.

Examples

The followingOpenCLC kernel code shows a vector addition design where two arrays of data are summed into a third array. The required size of the work-group is 16x1x1. This kernel will execute 16 times to produce a valid result.

#include  // For VHLS OpenCL C kernels, the full work group is synthesized __attribute__ ((reqd_work_group_size(16, 1, 1))) __kernel void vadd(__global int* a, __global int* b, __global int* c) { int idx = get_global_id(0); c[idx] = a[idx] + b[idx]; }

vec_type_hint

Description

IMPORTANT:This is a compiler hint which the compiler may ignore.

The optional__attribute__((vec_type_hint()))is part of theOpenCLLanguage Specification, and hints to theOpenCLcompiler representing the computational width of the kernel, providing a basis for calculating processor bandwidth usage when the compiler is looking to auto-vectorize the code.

By default, the kernel is assumed to have the__attribute__((vec_type_hint(int)))qualifier. This lets you specify a different vectorization type.

Implicit in autovectorization is the assumption that any libraries called from the kernel must be re-compilable at runtime to handle cases where the compiler decides to merge or separate workitems. This means that these libraries can never be hard-coded binaries or that hard-coded binaries must be accompanied either by source or some re-targetable intermediate representation. This might be a code security question for some.

Syntax

Place this attribute before the kernel definition, or before the primary function specified for the kernel:

__attribute__((vec_type_hint()))

Where:

: is one of the built-in vector types listed in the following table, or the constituent scalar element types.
Note:When not specified, the kernel is assumed to have an INT type.

Table 2.Vector Types
Type	Description
char	A vector of 8-bit signed two’s complement integer values.
uchar	A vector of 8-bit unsigned integer values.
short	A vector of 16-bit signed two’s complement integer values.
ushort	A vector of 16-bit unsigned integer values.
int	A vector of 32-bit signed two’s complement integer values.
uint	A vector of 32-bit unsigned integer values.
long	A vector of 64-bit signed two’s complement integer values.
ulong	A vector of 64-bit unsigned integer values.
float	A vector of 32-bit floating-point values.
double	A vector of 64-bit floating-point values.

Note: is assumed to be 1 when not specified. The vector data type names defined above where is any value other than 2, 3, 4, 8 and 16, are also reserved. Therefore, < n> can only be specified as 2,3,4,8, and 16.

Examples

The following example autovectorizes assuming double-wide integer as the basic computation width:

#include  // For VHLS OpenCL C kernels, the full work group is synthesized __attribute__((vec_type_hint(double))) __attribute__ ((reqd_work_group_size(16, 1, 1))) __kernel void ...

work_group_size_hint

Description

IMPORTANT:This is a compiler hint, which the compiler may ignore.

The work-group size in theOpenCLAPI standard defines the size of the ND range space that can be handled by a single invocation of a kernel compute unit. WhenOpenCLkernels are submitted for execution on anOpenCLdevice, they execute within an index space, called an ND range, which can have 1, 2, or 3 dimensions.See "OpenCL Execution Model" inSDAccel Environment Profiling and Optimization Guide(UG1207)for more information.

OpenCLkernel functions are executed exactly one time for each point in the ND range index space. This unit of work for each point in the ND range is called a work-item. Unlikeforloops in C, where loop iterations are executed sequentially and in-order, anOpenCLruntime and device is free to execute work-items in parallel and in any order.

Work-items are organized into work-groups, which are the unit of work scheduled onto compute units. The optional WORK_GROUP_SIZE_HINT attribute is part of theOpenCLLanguage Specification, and is a hint to the compiler that indicates the work-group size value most likely to be specified by thelocal_work_sizeargument toclEnqueueNDRangeKernel. This allows the compiler to optimize the generated code according to the expected value.

TIP:In the case of an FPGA implementation, the specification of the REQD_WORK_GROUP_SIZE attribute, instead of the WORK_GROUP_SIZE_HINT is highly recommended because it can be used for performance optimization during the generation of the custom logic for a kernel.

Syntax

Place this attribute before the kernel definition, or before the primary function specified for the kernel:

__attribute__((work_group_size_hint(, , )))

Where:

, , : Specifies the ND range of the kernel. This represents each dimension of a three dimensional matrix specifying the size of the work-group for the kernel.

Examples

The following example is a hint to the compiler that the kernel will most likely be executed with a work-group size of 1:

__attribute__((work_group_size_hint(1, 1, 1))) __kernel void ...

xcl_array_partition

Description

IMPORTANT:Array variables only accept one attribute. While XCL_ARRAY_PARTITION does support multi-dimensional arrays, you can only reshape one dimension of the array with a single attribute.

An advantage of using the FPGA over other compute devices forOpenCLprograms is the ability for the application programmer to customize the memory architecture all throughout the system and into the compute unit. By default, theSDAccelcompiler generates a memory architecture within the compute unit that maximizes local and private memory bandwidth based on static code analysis of the kernel code. Further optimization of these memories is possible based on attributes in the kernel source code, which can be used to specify physical layouts and implementations of local and private memories. The attribute in theSDAccelcompiler to control the physical layout of memories in a compute unit isarray_partition.

For one-dimensional arrays, the XCL_ARRAY_PARTITION attribute implements an array declared within kernel code as multiple physical memories instead of a single physical memory. The selection of which partitioning scheme to use depends on the specific application and its performance goals. The array partitioning schemes available in theSDAcceltool compiler arecyclic,block, andcomplete.

Syntax

Place the attribute with the definition of the array variable:

__attribute__((xcl_array_partition(, , )))

Where:

: Specifies one of the following partition types:
- cyclic: Cyclic partitioning is the implementation of an array as a set of smaller physical memories that can be accessed simultaneously by the logic in the compute unit. The array is partitioned cyclically by putting one element into each memory before coming back to the first memory to repeat the cycle until the array is fully partitioned.
- block: Block partitioning is the physical implementation of an array as a set of smaller memories that can be accessed simultaneously by the logic inside of the compute unit. In this case, each memory block is filled with elements from the array before moving on to the next memory.
- complete: Complete partitioning decomposes the array into individual elements. For a one-dimensional array, this corresponds to resolving a memory into individual registers. The defaultiscomplete.
: For cyclic type partitioning, thespecifies how many physical memories to partition the original array into in the kernel code. For block type partitioning, thespecifies the number of elements from the original array to store in each physical memory.
IMPORTANT:For completetype partitioning, the > is not specified.
: Specifies which array dimension to partition. Specified as an integer from 1 to .SDAccelenvironment supports arrays of N dimensions and can partition the array on any single dimension.

Example 1

For example, consider the following array declaration:

int buffer[16];

The integer array, named buffer, stores 16 values that are 32-bits wide each. Cyclic partitioning can be applied to this array with the following declaration:

int buffer[16] __attribute__((xcl_array_partition(cyclic,4,1)));

In this example, the cyclicattribute tells theSDAccelcompiler to distribute the contents of the array among four physical memories. This attribute increases the immediate memory bandwidth for operations accessing the array buffer by a factor of four.

All arrays inside of a compute unit in the context of theSDAccelenvironment are capable of sustaining a maximum of two concurrent accesses. By dividing the original array in the code into four physical memories, the resulting compute unit can sustain a maximum of eight concurrent accesses to the array buffer.

Example 2

Using the same integer array as found in Example 1, block partitioning can be applied to the array with the following declaration:

int buffer[16] __attribute__((xcl_array_partition(block,4,1)));

Because the size of the block is four, theSDAccelcompiler will generate four physical memories, sequentially filling each memory with data from the array.

Example 3

Using the same integer array as found in Example 1, complete partitioning can be applied to the array with the following declaration:

int buffer[16] __attribute__((xcl_array_partition(complete, 1)));

In this example, the array is completely partitioned into distributed RAM, or 16 independent registers in the programmable logic of the kernel. Because complete is the default, the same effect can also be accomplished with the following declaration:

int buffer[16] __attribute__((xcl_array_partition));

While this creates an implementation with the highest possible memory bandwidth, it is not suited to all applications. The way in which data is accessed by the kernel code through either constant or data dependent indexes affects the amount of supporting logic that theSDAccelcompiler has to build around each register to ensure functional equivalence with the usage in the original code. As a general best practice guideline for theSDxenvironment, the complete partitioning attribute is best suited for arrays in which at least one dimension of the array is accessed through the use of constant indexes.

xcl_array_reshape

Description

IMPORTANT:Array variables only accept one attribute. While the XCL_ARRAY_RESHAPE attribute does support multi-dimensional arrays, you can only reshape one dimension of the array with a single attribute.

This attribute combines array partitioning with vertical array mapping.

The XCL_ARRAY_RESHAPE attribute combines the effect of XCL_ARRAY_PARTITION, breaking an array into smaller arrays, and concatenating elements of arrays by increasing bit-widths. This reduces the number of block RAM consumed while providing parallel access to the data. This attribute creates a new array with fewer elements but with greater bit-width, allowing more data to be accessed in a single clock cycle.

Given the following code:

void foo (...) { int array1[N] __attribute__((xcl_array_reshape(block, 2, 1))); int array2[N] __attribute__((xcl_array_reshape(cycle, 2, 1))); int array3[N] __attribute__((xcl_array_reshape(complete, 1))); ... }

The ARRAY_RESHAPE attribute transforms the arrays into the form shown in the following figure:

Syntax

Place the attribute with the definition of the array variable:

__attribute__((xcl_array_reshape(,, )))

Where:

: Specifies one of the following partition types:
- cyclic: Cyclic partitioning is the implementation of an array as a set of smaller physical memories that can be accessed simultaneously by the logic in the compute unit. The array is partitioned cyclically by putting one element into each memory before coming back to the first memory to repeat the cycle until the array is fully partitioned.
- block: Block partitioning is the physical implementation of an array as a set of smaller memories that can be accessed simultaneously by the logic inside of the compute unit. In this case, each memory block is filled with elements from the array before moving on to the next memory.
- complete: Complete partitioning decomposes the array into individual elements. For a one-dimensional array, this corresponds to resolving a memory into individual registers. The defaultiscomplete.
: For cyclic type partitioning, thespecifies how many physical memories to partition the original array into in the kernel code. For Block type partitioning, thespecifies the number of elements from the original array to store in each physical memory.
IMPORTANT:For completetype partitioning, the should not be specified.
: Specifies which array dimension to partition. Specified as an integer from 1 to .SDAccelenvironment supports arrays of dimensions and can partition the array on any single dimension.

Example 1

Reshapes (partition and maps) an 8-bit array with 17 elements, AB[17], into a new 32-bit array with five elements using block mapping.

int AB[17] __attribute__((xcl_array_reshape(block,4,1)));

TIP:A of 4 indicates that the array should be divided into four. As a result, the 17 elements are reshaped into an array of five elements, with four times the bit-width. In this case, the last element, AB[17], is mapped to the lower eight bits of the fifth element, and the rest of the fifth element is empty.

Example 2

Reshapes the two-dimensional array AB[6][4] into a new array of dimension [6][2], in which dimension 2 has twice the bit-width:

int AB[6][4] __attribute__((xcl_array_reshape(block,2,2)));

Example 3

Reshapes the three-dimensional 8-bit array, AB[4][2][2] in functionfoo, into a new single element array (a register), 128 bits wide (4*2*2*8):

int AB[4][2][2] __attribute__((xcl_array_reshape(complete,0)));

TIP:A of 0 means to reshape all dimensions of the array.

xcl_latency

Description

The XCL_LATENCY attribute specifies a minimum, or maximum latency value, or both, for the completion of functions, loops, and regions. Latency is defined as the number of clock cycles required to produce an output. Function or region latency is the number of clock cycles required for the code to compute all output values, and return. Loop latency is the number of cycles to execute all iterations of the loop. See "Performance Metrics Example" ofVivado Design Suite User Guide: High-Level Synthesis(UG902).

TheVivadoHigh-Level Synthesis (HLS) tool always tries to minimize latency in the design. When the XCL_LATENCY attribute is specified, the tool behavior is as follows:

When latency is greater than the minimum, or less than the maximum: The constraint is satisfied. No further optimizations are performed.
When latency is less than the minimum: If the HLS tool can achieve less than the minimum specified latency, it extends the latency to the specified value, potentially increasing sharing.
When latency is greater than the maximum: If the HLS tool cannot schedule within the maximum limit, it increases effort to achieve the specified constraint. If it still fails to meet the maximum latency, it issues a warning, and produces a design with the smallest achievable latency in excess of the maximum.

TIP:You can also use the XCL_LATENCY attribute to limit the efforts of the tool to find a optimum solution. Specifying latency constraints for scopes within the code: loops, functions, or regions, reduces the possible solutions within that scope and improves tool runtime. Refer to "Improving Run Time and Capacity" of Vivado Design Suite User Guide: High-Level Synthesis(UG902)for more information.

Syntax

Assign the XCL_LATENCY attribute before the body of the function, loop, or region:

__attribute__((xcl_latency(min, max)))

Where:

: Specifies the minimum latency for the function, loop, or region of code.
: Specifies the maximum latency for the function, loop, or region of code.

Example 1

Theforloop in thetestfunction is specified to have a minimum latency of 4 and a maximum latency of 8:

__kernel void test(__global float *A, __global float *B, __global float *C, int id) { for (unsigned int i = 0; i < id; i++) __attribute__((xcl_latency(4, 12))) { C[id] = A[id] * B[id]; } }

xcl_latency

Description

TheVivadoHigh-Level Synthesis (HLS) tool always tries to minimize latency in the design. When the XCL_LATENCY attribute is specified, the tool behavior is as follows:

When latency is greater than the minimum, or less than the maximum: The constraint is satisfied. No further optimizations are performed.
When latency is less than the minimum: If the HLS tool can achieve less than the minimum specified latency, it extends the latency to the specified value, potentially increasing sharing.
When latency is greater than the maximum: If the HLS tool cannot schedule within the maximum limit, it increases effort to achieve the specified constraint. If it still fails to meet the maximum latency, it issues a warning, and produces a design with the smallest achievable latency in excess of the maximum.

Syntax

Assign the XCL_LATENCY attribute before the body of the function, loop, or region:

__attribute__((xcl_latency(min, max)))

Where:

: Specifies the minimum latency for the function, loop, or region of code.
: Specifies the maximum latency for the function, loop, or region of code.

Example 1

Theforloop in thetestfunction is specified to have a minimum latency of 4 and a maximum latency of 8:

__kernel void test(__global float *A, __global float *B, __global float *C, int id) { for (unsigned int i = 0; i < id; i++) __attribute__((xcl_latency(4, 12))) { C[id] = A[id] * B[id]; } }

xcl_loop_tripcount

Description

The XCL_LOOP_TRIPCOUNT attribute can be applied to a loop to manually specify the total number of iterations performed by the loop.

IMPORTANT:The XCL_LOOP_TRIPCOUNT attribute is for analysis only, and does not impact the results of synthesis.

TheVivadoHigh-Level Synthesis (HLS) reports the total latency of each loop, which is the number of clock cycles to execute all iterations of the loop. The loop latency is therefore a function of the number of loop iterations, or tripcount.

The tripcount can be a constant value. It may depend on the value of variables used in the loop expression (for example,x), or depend on control statements used inside the loop. In some cases, the HLS tool cannot determine the tripcount, and the latency is unknown. This includes cases in which the variables used to determine the tripcount are:


             
              Input arguments, or
              Variables calculated by dynamic operation.
             
             In cases where the loop latency is unknown or cannot be calculated, the XCL_LOOP_TRIPCOUNT attribute lets you specify minimum, maximum, and average iterations for a loop. This lets the tool analyze how the loop latency contributes to the total design latency in the reports, and helps you determine appropriate optimizations for the design.


            
             Syntax
             Place the attribute in theOpenCLsource before the loop declaration:
             __attribute__((xcl_loop_tripcount(, , )))
             Where:
             
              : Specifies the minimum number of loop iterations.
              : Specifies the maximum number of loop iterations.
              : Specifies the average number of loop iterations.
             
            
            
             Examples
             In this example the WHILE loop in functionfis specified to have a minimum tripcount of 2, a maximum tripcount of 64, and an average tripcount of 33:
             __kernel void f(__global int *a) { unsigned i = 0; __attribute__((xcl_loop_tripcount(2, 64, 33))) while(i < 64) { a[i] = i; i++; } }
            
            
             See Also
             
              pragma HLS loop_tripcount
              Vivado Design Suite User Guide: High-Level Synthesis(UG902)


          
           xcl_max_work_group_size
           
            
             Description
             Use this attribute instead of REQD_WORK_GROUP_SIZE when you need to specify a larger kernel than the 4K size.
             Extends the default maximum work group size supported in theSDxenvironment by thereqd_work_group_sizeattribute.SDxenvironment supports work size larger than 4096 with the XCL_MAX_WORK_GROUP_SIZE attribute.
             
              Note:The actual workgroup size limit is dependent on the
              Xilinxdevice selected for the platform.
             
            
            
             Syntax
             Place this attribute before the kernel definition, or before the primary function specified for the kernel:
             __attribute__((xcl_max_work_group_size(, , )))
             Where:
             
              , , : Specifies the ND range of the kernel. This represents each dimension of a three dimensional matrix specifying the size of the work-group for the kernel.
             
            
            
             Example 1
             Below is the kernel source code for an un-optimized adder. No attributes were specified for this design, other than the work size equal to the size of the matrices (for example, 64x64). That is, iterating over an entire workgroup will fully add the input matrices, a and b, and output the result. All three are global integer pointers, which means each value in the matrices is four bytes, and is stored in off-chip DDR global memory.
             #define RANK 64 __kernel __attribute__ ((reqd_work_group_size(RANK, RANK, 1))) void madd(__global int* a, __global int* b, __global int* output) { int index = get_local_id(1)*get_local_size(0) + get_local_id(0); output[index] = a[index] + b[index]; }
             This local work size of (64, 64, 1) is the same as the global work size. It should be noted that this setting creates a total work size of 4096.
             
              Note:This is the largest work size that
              SDAccelenvironment supports with the standard
              OpenCLattribute REQD_WORK_GROUP_SIZE.
              SDAccelenvironment supports work size larger than 4096 with the
              Xilinxattribute
              xcl_max_work_group_size.
             
             Any matrix larger than 64x64 would need to only use one dimension to define the work size. That is, a 128x128 matrix could be operated on by a kernel with a work size of (128, 1, 1), where each invocation operates on an entire row, or column of data.
            
            
             See Also
             
              SDAccel Environment Profiling and Optimization Guide(UG1207)
              https://www.khronos.org/
              The OpenCL C Specification
             
            
           
          
          
           xcl_pipeline_loop
           
            
             Description
             You can pipeline a loop to improve latency and maximize kernel throughput and performance.
             Although unrolling loops increases concurrency, it does not address the issue of keeping all elements in a kernel data path busy at all times. Even in an unrolled case, loop control dependencies can lead to sequential behavior. The sequential behavior of operations results in idle hardware and a loss of performance.
             Xilinxaddresses this issue by introducing a vendor extension on top of theOpenCL2.0 API specification for loop pipelining using the XCL_PIPELINE_LOOP attribute.
             By default, the XOCC compiler automatically pipelines loops with a trip count more than 64, or unrolls loops with a trip count less than 64. This should provide good results. However, you can choose to pipeline loops (instead of the automatic unrolling) by explicitly specifying the NOUNROLL attribute and XCL_PIPELINE_LOOP attribute before the loop.
            
            
             Syntax
             Place the attribute in theOpenCLsource before the loop definition:
             __attribute__((xcl_pipeline_loop()))
             Where:
             
              : Specifies the desired initiation interval (II) for the pipeline. TheVivadoHigh-Level Synthesis (HLS) tool tries to meet this request, however, based on data dependencies the loop might have a larger initiation interval. When the II is not specified, the default is 1.
             
            
            
             Example 1
             The following example specifies an II target of 3 for theforloop in the specified function:
             __kernel void f(__global int *a) { __attribute__((xcl_pipeline_loop(3))) for (unsigned i = 0; i < 64; ++i) a[i] = i; }
            
            
             See Also
             
              pragma HLS pipeline
              SDAccel Environment Profiling and Optimization Guide(UG1207)
              Vivado Design Suite User Guide: High-Level Synthesis(UG902)
             
            
           
          
          
           xcl_pipeline_workitems
           
            
             Description
             Pipeline a work item to improve latency and throughput. Work item pipelining is the extension of loop pipelining to the kernel work group. This is necessary for maximizing kernel throughput and performance.
            
            
             Syntax
             Place the attribute in theOpenCLAPI source before the elements to pipeline:
             __attribute__((xcl_pipeline_workitems))
            
            
             Example 1
             In order to handle thereqd_work_group_sizeattribute in the following example,SDAcceltool automatically inserts a loop nest to handle the three-dimensional characteristics of the ND range (3,1,1). As a result of the added loop nest, the execution profile of this kernel is like an unpipelined loop. Adding the XCL_PIPELINE_WORKITEMS attribute adds concurrency and improves the throughput of the code.
             kernel __attribute__ ((reqd_work_group_size(3,1,1))) void foo(...) { ... __attribute__((xcl_pipeline_workitems)) { int tid = get_global_id(0); op_Read(tid); op_Compute(tid); op_Write(tid); } ... }
            
            
             Example 2
             The following example adds the work-item pipeline to the appropriate elements of the kernel:
             __kernel __attribute__ ((reqd_work_group_size(8, 8, 1))) void madd(__global int* a, __global int* b, __global int* output) { int rank = get_local_size(0); __local unsigned int bufa[64]; __local unsigned int bufb[64]; __attribute__((xcl_pipeline_workitems)) { int x = get_local_id(0); int y = get_local_id(1); bufa[x*rank + y] = a[x*rank + y]; bufb[x*rank + y] = b[x*rank + y]; } barrier(CLK_LOCAL_MEM_FENCE); __attribute__((xcl_pipeline_workitems)) { int index = get_local_id(1)*rank + get_local_id(0); output[index] = bufa[index] + bufb[index]; } }
            
            
             See Also
             
              pragma HLS pipeline
              SDAccel Environment Profiling and Optimization Guide(UG1207)
              Vivado Design Suite User Guide: High-Level Synthesis(UG902)
             
            
           
          
          
           xcl_reqd_pipe_depth
           
            
             Description
             
              IMPORTANT:Pipes must be declared in lower case alphanumerics. In addition,
              printf()is not supported with variables used in pipes.
             
             TheOpenCLframework 2.0 specification introduces a new memory object called pipe. A pipe stores data organized as a FIFO. Pipes can be used to stream data from one kernel to another inside the FPGA without having to use the external memory, which greatly improves the overall system latency.
             
              In the
              SDAcceldevelopment environment, pipes must be statically defined outside of all kernel functions:. The depth of a pipe must be specified by using the XCL_REQD_PIPE_DEPTH attribute in the pipe declaration:
              pipe int p0 __attribute__((xcl_reqd_pipe_depth(512)));
             
             Pipes can only be accessed using standardOpenCLread_pipe()andwrite_pipe()built-in functions in non-blocking mode, or usingXilinxextendedread_pipe_block()andwrite_pipe_block()functions in blocking mode.
             
              IMPORTANT:A given pipe can have one and only one producer and consumer in different kernels.
             
             Pipe objects are not accessible from the host CPU. The status of pipes can be queried usingOpenCLget_pipe_num_packets()andget_pipe_max_packets()built-in functions. SeeThe OpenCL C Specificationfrom KhronosOpenCLWorking Group for more details on these built-in functions.
            
            
             Syntax
             This attribute must be assigned at the declaration of the pipe object:
             pipe int  __attribute__((xcl_reqd_pipe_depth()));
             Where:
             
              : Specifies an identifier for the pipe, which must consist of lower-case alphanumerics. For example  not .
              : Specifies the depth of the pipe. Valid depth values are 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768.
             
            
            
             Examples
             The following is thedataflow_pipes_oclexample fromXilinx GitHubthat use pipes to pass data from one processing stage to the next using blockingread_pipe_block()andwrite_pipe_block()functions:
             pipe int p0 __attribute__((xcl_reqd_pipe_depth(32))); pipe int p1 __attribute__((xcl_reqd_pipe_depth(32))); // Input Stage Kernel : Read Data from Global Memory and write into Pipe P0 kernel __attribute__ ((reqd_work_group_size(1, 1, 1))) void input_stage(__global int *input, int size) { __attribute__((xcl_pipeline_loop)) mem_rd: for (int i = 0 ; i < size ; i++) { //blocking Write command to pipe P0 write_pipe_block(p0, &input[i]); } } // Adder Stage Kernel: Read Input data from Pipe P0 and write the result // into Pipe P1 kernel __attribute__ ((reqd_work_group_size(1, 1, 1))) void adder_stage(int inc, int size) { __attribute__((xcl_pipeline_loop)) execute: for(int i = 0 ; i < size ; i++) { int input_data, output_data; //blocking read command to Pipe P0 read_pipe_block(p0, &input_data); output_data = input_data + inc; //blocking write command to Pipe P1 write_pipe_block(p1, &output_data); } } // Output Stage Kernel: Read result from Pipe P1 and write the result to // Global Memory kernel __attribute__ ((reqd_work_group_size(1, 1, 1))) void output_stage(__global int *output, int size) { __attribute__((xcl_pipeline_loop)) mem_wr: for (int i = 0 ; i < size ; i++) { //blocking read command to Pipe P1 read_pipe_block(p1, &output[i]); } }
            
            
             See Also
             
              SDAccel Environment Profiling and Optimization Guide(UG1207)
              https://www.khronos.org/
              The OpenCL C Specification
             
            
           
          
          
           xcl_zero_global_work_offset
           
            
             Description
             If you useclEnqueueNDRangeKernelwith theglobal_work_offsetset to NULL or all zeros, you can use this attribute to tell the compiler that theglobal_work_offsetis always zero.
             This attribute can improve memory performance when you have memory accesses like:
             A[get_global_id(x)] = ...;
             
              Note:You can specify REQD_WORK_GROUP_SIZE, VEC_TYPE_HINT, and XCL_ZERO_GLOBAL_WORK_OFFSET together to maximize performance.
             
            
            
             Syntax
             Place this attribute before the kernel definition, or before the primary function specified for the kernel:
             __kernel __attribute__((xcl_zero_global_work_offset)) void test (__global short *input, __global short *output, __constant short *constants) { }
            
            
             See Also
             
              reqd_work_group_size
              vec_type_hint
              clEnqueueNDRangeKernel
              SDAccel Environment Profiling and Optimization Guide(UG1207)