Loop Pipelining and Loop Unrolling
Both loop pipelining and loop unrolling improve the hardware function's performance by exploiting the parallelism between loop iterations. The basic concepts of loop pipelining and loop unrolling and example codes to apply these techniques are shown and the limiting factors to achieve optimal performance using these techniques are discussed.
Loop Pipelining
In sequential languages such as C/C++, the operations in a loop are executed sequentially and the next iteration of the loop can only begin when the last operation in the current loop iteration is complete. Loop pipelining allows the operations in a loop to be implemented in a concurrent manner as shown in the following figure.
Figure:Loop Pipelining
As shown in the above figure, without pipelining, there are three clock cycles between the twoRD
operations and it requires six clock cycles for the entire loop to finish. However, with pipelining, there is only one clock cycle between the twoRD
operations and it requires four clock cycles for the entire loop to finish, that is, the next iteration of the loop can start before the current iteration is finished.
An important term for loop pipelining is calledInitiation Interval (II), which is the number of clock cycles between the start times of consecutive loop iterations. In the above figure, the Initiation Interval (II) is one because there is only one clock cycle between the start times of consecutive loop iterations.
#pragma HLS pipeline
at the beginning of the loop body, as illustrated in the following code snippet. Vivado HLS tries to pipeline the loop with minimum
Initiation Interval.
for (index_a = 0; index_a < A_NROWS; index_a++) { for (index_b = 0; index_b < B_NCOLS; index_b++) { #pragma HLS PIPELINE II=1 float result = 0; for (index_d = 0; index_d < A_NCOLS; index_d++) { float product_term = in_A[index_a][index_d] * in_B[index_d][index_b]; result += product_term; } out_C[index_a * B_NCOLS + index_b] = result; } }
Loop Unrolling
int sum = 0; for(int i = 0; i < 10; i++) { sum += a[i]; }
After the loop is unrolled by a factor of 2, the loop becomes:
int sum = 0; for(int i = 0; i < 10; i+=2) { sum += a[i]; sum += a[i+1]; }
So unrolling a loop by a factor of N basically creates N copies of the loop body, the loop variable referenced by each copy is updated accordingly ( such as the
a[i+1]
in the above code snippet ), and the loop iteration counter is also updated accordingly ( such as the
i+=2
in the above code snippet ).
Loop unrolling creates more operations in each loop iteration, so that Vivado HLS can exploit more parallelism among these operations. More parallelism means more throughput and higher system performance. If the factor N is less than the total number of loop iterations (10 in the example above), it is called a "partial unroll". If the factor N is the same as the number of loop iterations, it is called a "full unroll". Obviously, "full unroll" requires the loop bounds be known at compile time but exposes the most parallelism.
#pragma HLS unroll [factor=N]
at the beginning of the desired loop. Without the optional
factor=N
, the loop will be fully unrolled.
int sum = 0; for(int i = 0; i < 10; i++) { #pragma HLS unroll factor=2 sum += a[i]; }
Factors Limiting the Parallelism Achieved by Loop Pipelining and Loop Unrolling
Both loop pipelining and loop unrolling exploit the parallelism between loop iterations. However, parallelism between loop iterations is limited by two main factors: one is the data dependencies between loop iterations, the other is the number of available hardware resources.
A data dependence from an operation in one iteration to another operation in a subsequent iteration is called a loop-carried dependence. It implies that the operation in the subsequent iteration cannot start until the operation in the current iteration has finished computing the data input for the operation in subsequent iteration. Loop-carried dependencies fundamentally limit the initiation interval that can be achieved using loop pipelining and the parallelism that can be exploited using loop unrolling.
while (a != b) { if (a > b) a –= b; else b –= a; }
for (i = 1; i < N; i++) mem[i] = mem[i-1] + i;
In this case, the next iteration of the loop must wait until the current iteration updates the content of the array. In case of loop pipelining, the minimum Initiation Interval is the total number of clock cycles required for the memory read, the add operation, and the memory write.
Figure:Resource Contention
op_compute
is implemented with a DSP core which cannot accept new inputs every cycle, and there is only one such DSP core. Then
op_compute
cannot be issued to the DSP core each cycle, and an initiation interval of one is not possible.