Loop Pipelining and Loop Unrolling

Both loop pipelining and loop unrolling improve the hardware function's performance by exploiting the parallelism between loop iterations. The basic concepts of loop pipelining and loop unrolling and example codes to apply these techniques are shown and the limiting factors to achieve optimal performance using these techniques are discussed.

Loop Pipelining

In sequential languages such as C/C++, the operations in a loop are executed sequentially and the next iteration of the loop can only begin when the last operation in the current loop iteration is complete. Loop pipelining allows the operations in a loop to be implemented in a concurrent manner as shown in the following figure.

Figure:Loop Pipelining



As shown in the above figure, without pipelining, there are three clock cycles between the twoRDoperations and it requires six clock cycles for the entire loop to finish. However, with pipelining, there is only one clock cycle between the twoRDoperations and it requires four clock cycles for the entire loop to finish, that is, the next iteration of the loop can start before the current iteration is finished.

An important term for loop pipelining is calledInitiation Interval (II), which is the number of clock cycles between the start times of consecutive loop iterations. In the above figure, the Initiation Interval (II) is one because there is only one clock cycle between the start times of consecutive loop iterations.

To pipeline a loop, put #pragma HLS pipelineat the beginning of the loop body, as illustrated in the following code snippet. Vivado HLS tries to pipeline the loop with minimum Initiation Interval.
for (index_a = 0; index_a < A_NROWS; index_a++) { for (index_b = 0; index_b < B_NCOLS; index_b++) { #pragma HLS PIPELINE II=1 float result = 0; for (index_d = 0; index_d < A_NCOLS; index_d++) { float product_term = in_A[index_a][index_d] * in_B[index_d][index_b]; result += product_term; } out_C[index_a * B_NCOLS + index_b] = result; } }

Loop Unrolling

Loop unrolling is another technique to exploit parallelism between loop iterations. It creates multiple copies of the loop body and adjust the loop iteration counter accordingly. The following code snippet shows a normal rolled loop:
int sum = 0; for(int i = 0; i < 10; i++) { sum += a[i]; }
After the loop is unrolled by a factor of 2, the loop becomes:
int sum = 0; for(int i = 0; i < 10; i+=2) { sum += a[i]; sum += a[i+1]; }
So unrolling a loop by a factor of N basically creates N copies of the loop body, the loop variable referenced by each copy is updated accordingly ( such as the a[i+1]in the above code snippet ), and the loop iteration counter is also updated accordingly ( such as the i+=2in the above code snippet ).

Loop unrolling creates more operations in each loop iteration, so that Vivado HLS can exploit more parallelism among these operations. More parallelism means more throughput and higher system performance. If the factor N is less than the total number of loop iterations (10 in the example above), it is called a "partial unroll". If the factor N is the same as the number of loop iterations, it is called a "full unroll". Obviously, "full unroll" requires the loop bounds be known at compile time but exposes the most parallelism.

To unroll a loop, simply put #pragma HLS unroll [factor=N]at the beginning of the desired loop. Without the optional factor=N, the loop will be fully unrolled.
int sum = 0; for(int i = 0; i < 10; i++) { #pragma HLS unroll factor=2 sum += a[i]; }

Factors Limiting the Parallelism Achieved by Loop Pipelining and Loop Unrolling

Both loop pipelining and loop unrolling exploit the parallelism between loop iterations. However, parallelism between loop iterations is limited by two main factors: one is the data dependencies between loop iterations, the other is the number of available hardware resources.

A data dependence from an operation in one iteration to another operation in a subsequent iteration is called a loop-carried dependence. It implies that the operation in the subsequent iteration cannot start until the operation in the current iteration has finished computing the data input for the operation in subsequent iteration. Loop-carried dependencies fundamentally limit the initiation interval that can be achieved using loop pipelining and the parallelism that can be exploited using loop unrolling.

The following example demonstrates loop-carried dependencies among operations producing and consuming variables a and b.
while (a != b) { if (a > b) a –= b; else b –= a; }
Obviously, operations in the next iteration of this loop can not start until the current iteration has calculated and updated the values of a and b. Array accesses are a common source of loop-carried dependencies, as shown in the following example:
for (i = 1; i < N; i++) mem[i] = mem[i-1] + i;
In this case, the next iteration of the loop must wait until the current iteration updates the content of the array. In case of loop pipelining, the minimum Initiation Interval is the total number of clock cycles required for the memory read, the add operation, and the memory write.
Another performance limiting factor for loop pipelining and loop unrolling is the number of available hardware resources. The following figure shows an example the issues created by resource limitations, which in this case prevents the loop to be pipelined with an initiation interval of 1.

Figure:Resource Contention



In this example, if the loop is pipelined with an initiation interval of one, there are two read operations. If the memory has only a single port, then the two read operations cannot be executed simultaneously and must be executed in two cycles. So the minimal initiation interval can only be two, as shown in part (B) of the figure. The same can happen with other hardware resources. For example, if the op_computeis implemented with a DSP core which cannot accept new inputs every cycle, and there is only one such DSP core. Then op_computecannot be issued to the DSP core each cycle, and an initiation interval of one is not possible.