Optimal Vertical Convolution

The vertical convolution represents a challenge to the streaming data model preferred by an FPGA. The data must be accessed by column but you do not wish to store the entire image. The solution is to use line buffers, as shown in the following figure.



Once again, the samples are read in a streaming manner, this time from the local bufferhconv. The algorithm requires at least K-1 lines of data before it can process the first sample. All the calculations performed before this are discarded through the use of conditionals.

A line buffer allows K-1 lines of data to be stored. Each time a new sample is read, another sample is pushed out the line buffer. An interesting point to note here is that the newest sample is used in the calculation, and then the sample is stored into the line buffer and the old sample ejected out. This ensures that only K-1 lines are required to be cached rather than an unknown number of lines, and minimizes the use of local storage. Although a line buffer does require multiple lines to be stored locally, the convolution kernel size K is always much less than the 1080 lines in a full video image.

The first calculation can be performed when the first sample on the Kth line is read. The algorithm then proceeds to output values until the final pixel is read.

// Vertical convolution phconv=hconv_buffer; // set/reset pointer to start of buffer pvconv=vconv_buffer; // set/reset pointer to start of buffer VConvH:for(int col = 0; col < height; col++) { VConvW:for(int row = 0; row < vconv_xlim; row++) { #pragma HLS DEPENDENCE variable=linebuf inter false #pragma HLS PIPELINET in_val = *phconv++;// Reset pixel value on-the-fly - eliminates an O(height*width) loop T out_val = 0; VConv:for(int i = 0; i < K; i++) { T vwin_val = i < K - 1 ? linebuf[i][row] : in_val; out_val += vwin_val * vcoeff[i]; if (i > 0) linebuf[i - 1][row] = vwin_val; } if (col >= K - 1) {*pvconv++ = out_val;} }

The code above once again processes all the samples in the design in a streaming manner. The task is constantly running. Following a coding style where you minimize the number of re-reads (or re-writes) forces you to cache the data locally. This is an ideal strategy when targeting an FPGA.