Data Flow Pipelining

The previously discussed optimization techniques are all "fine grain" parallelizing optimizations at the level of operators, such as multiplier, adder, and memory load/store operations. These techniques optimize the parallelism between these operators. Data flow pipelining on the other hand, exploits the "coarse grain" parallelism at the level of functions and loops. Data flow pipelining can increase the concurrency between functions and loops.

Function Data Flow Pipelining

The default behavior for a series of function calls in Vivado HLS is to complete a function before starting the next function. Part (A) in the following figure shows the latency without function data flow pipelining. Assuming it takes eight cycles for the three functions to complete, the code requires eight cycles before a new input can be processed by "func_A" and also eight cycles before an output is written by "func_C" (assume the output is written at the end of "func_C").

Figure:Function Data Flow Pipelining

An example execution with data flow pipelining is shown in the part (B) of the figure above. Assuming the execution offunc_Atakes three cycles,func_Acan begin processing a new input every three clock cycles rather than waiting for all the three functions to complete, resulting in increased throughput, The complete execution to produce an output then requires only five clock cycles, resulting in shorter overall latency.

Vivado HLS implements function data flow pipelining by inserting "channels" between the functions. These channels are implemented as either ping-pong buffers or FIFOs, depending on the access patterns of the producer and the consumer of the data.

If a function parameter (producer or consumer) is an array, the corresponding channel is implemented as a multi-buffer using standard memory accesses (with associated address and control signals).
For scalar, pointer and reference parameters as well as the function return, the channel is implemented as a FIFO, which uses less hardware resources (no address generation) but requires that the data is accessed sequentially.

To use function data flow pipelining, put #pragma HLS dataflowwhere the data flow optimization is desired. The following code snippet shows an example:

void top(a, b, c, d) { #pragma HLS dataflow func_A(a, b, i1); func_B(c, i1, i2); func_C(i2, d); }

Loop Data Flow Pipelining

Data flow pipelining can also be applied to loops in similar manner as it can be applied to functions. It enables a sequence of loops, normally executed sequentially, to execute concurrently. Data flow pipelining should be applied to a function, loop or region which contains either all function or all loops: do not apply on a scope which contains a mixture of loops and functions.

The following figure shows the advantages data flow pipelining can produce when applied to loops. Without data flow pipelining, loop N must execute and complete all iterations before loop M can begin. The same applies to the relationship between loops M and P. In this example, it is eight cycles before loop N can start processing the next value and eight cycles before an output is written (assuming the output is written when loop P finishes).

Figure:Loop Data Flow Pipelining

With data flow pipelining, these loops can operate concurrently. An example execution with data flow pipelining is shown in part (B) of the figure above. Assuming the loop M takes three cycles to execute, the code can accept new inputs every three cycles. Similarly, it can produce an output value every five cycles, using the same hardware resources. Vivado HLS automatically inserts channels between the loops to ensure data can flow asynchronously from one loop to the next. As with data flow pipelining, the channels between the loops are implemented either as multi-buffers or FIFOs.

To use loop data flow pipelining, put#pragma HLS dataflowwhere the data flow optimization is desired.