Unrolling Loops

Loop unrolling is the first optimization technique available in the SDAccel compiler. The purpose of the loop unroll optimization is to expose concurrency to the compiler. This is an official attribute in the OpenCL 2.0 specification.

For example, starting with the code:

#define LENGTH 64 __kernel void __attribute__ ((reqd_work_group_size(1, 1, 1))) krnl_vmult( __global int* a, __global int* b, __global int* c) { local int bufa[LENGTH]; local int bufb[LENGTH]; local int bufc[LENGTH]; for(int i = 0; i < LENGTH; i++) { bufa[i] = a[i]; bufb[i] = b[i]; } for(int i = 0; i < LENGTH; i++) { bufc[i] = bufa[i] * bufb[i]; } for(int i = 0; i < LENGTH; i++) { c[i] = bufc[i]; } return; }

This kernel multiplies two integer vectors,aandb. The length of the vectors is 64. Since we want to isolate the performance of the for loop, we first read the two vectors into local. Also, a third local memory is used to store the output vector,c, so all data in the for loop uses local memories. Once the loop is completed, the entire output vector is written back to DDR.

Below is the latency and area estimate after running xocc with--report estimateoption with the baseline design without any loop unroll attribute. Note that in this particular example, the actual computation only uses one 32-bit integer multiplier (4 DSPs) and takes 64 cycles to complete.

The performance of the vector multiplier can be improved by using theopencl_unroll_hintattribute with an unroll factor of 2:

__attribute__((opencl_unroll_hint(2))) for(int i = 0; i < LENGTH; i++) { bufc[i] = bufa[i] * bufb[i]; }

The code above tells SDAccel to unroll the loop by a factor of two. Conceptually the compiler transforms the loop above to the code below:

for(int i = 0; i < LENGTH; i+=2) { bufc[i] = bufa[i] * bufb[i]; bufc[i+1] = bufa[i+1] * bufb[i+1]; }

This results in LENGTH/2 or 32 loop iterations for the compute unit to complete the operation. By enabling SDAccel to reduce the loop iteration count, the programmer has exposed more concurrency to the compiler. This newly exposed concurrency reduces latency and improves performance, but also consumes more FPGA fabric resources. Below is the latency and area estimate for the unroll factor of 2. Note that 2x DSPs are used for the compute unit and the computation is reduced by 32 clock cycles.

Another variety of this attribute is to unroll the loop completely. The syntax for the fully unrolled version of the vector multiplier example is as shown below:

__attribute__((opencl_unroll_hint)) for(int i = 0; i < LENGTH; i++) { bufc[i] = bufa[i] * bufb[i]; }

Due to resource constraints, the full unrolling is appropriate for loops of small or medium length. Large loops may require too many resources to implement on the FPGA device. For larger loops, it is recommended to use loop pipeline (see the Piplining Loops section of this guide).