Algorithm With Optimal Data Access Patterns

The key to implementing the convolution example reviewed in the previous section as a high-performance design with minimal resources is to:

Maximize the flow of data through the system. Refrain from using any coding techniques or algorithm behavior that inhibits the continuous flow of data.
Maximize the reuse of data. Use local caches to ensure there are no requirements to re-read data and the incoming data can keep flowing.
Embrace conditional branching. This is expensive on a CPU, GPU, or DSP but optimal in an FPGA.

The first step is to understand how data flows through the system into and out of the FPGA. The convolution algorithm is performed on an image. When data from an image is produced and consumed, it is transferred in a standard raster-scan manner as shown in the following figure.

If the data is transferred to the FPGA in a streaming manner, the FPGA should process it in a streaming manner and transfer it back from the FPGA in this manner.

The convolution algorithm shown below embraces this style of coding. At this level of abstraction a concise view of the code is shown. However, there are now intermediate buffers,hconvandvconv, between each loop. Because these are accessed in a streaming manner, they are optimized into single registers in the final implementation.

template static void convolution_strm( int width, int height, T src[TEST_IMG_ROWS][TEST_IMG_COLS], T dst[TEST_IMG_ROWS][TEST_IMG_COLS], const T *hcoeff, const T *vcoeff) {T hconv_buffer[MAX_IMG_COLS*MAX_IMG_ROWS];T vconv_buffer[MAX_IMG_COLS*MAX_IMG_ROWS];T *phconv, *pvconv;// These assertions let HLS know the upper bounds of loops assert(height < MAX_IMG_ROWS); assert(width < MAX_IMG_COLS); assert(vconv_xlim < MAX_IMG_COLS - (K - 1)); // Horizontal convolution HConvH:for(int col = 0; col < height; col++) { HConvW:for(int row = 0; row < width; row++) { HConv:for(int i = 0; i < K; i++) { } } } // Vertical convolution VConvH:for(int col = 0; col < height; col++) { VConvW:for(int row = 0; row < vconv_xlim; row++) { VConv:for(int i = 0; i < K; i++) { } } } Border:for (int i = 0; i < height; i++) { for (int j = 0; j < width; j++) { } }

All three processing loops now embrace conditional branching to ensure the continuous processing of data.