Optimal Horizontal Convolution

To perform the calculation in a more efficient manner for FPGA implementation, the horizontal convolution is computed as shown in the following figure.



The algorithm must use the K previous samples to compute the convolution result. It therefore copies the sample into a temporary cachehwin. This use of local storage means there is no need to re-read values from the PS and interrupt the flow of data. For the first calculation there are not enough values in hwin to compute a result, so conditionally, no output values are written.

The algorithm keeps reading input samples and caching them intohwin. Each time it reads a new sample, it pushes an unneeded sample out ofhwin. The first time an output value can be written is after the Kth input has been read. An output value can now be written. The algorithm proceeds in this manner along the rows until the final sample has been read. At that point, only the last K samples are stored inhwin: all that is required to compute the convolution.

As shown below, the code to perform these operations uses both local storage to prevent re-reads from the PL – the reads from local storage can be performed in parallel in the final implementation – and the extensive use of conditional branching to ensure each new data sample can be processed in a different manner.

// Horizontal convolution phconv=hconv_buffer; // set / reset pointer to start of buffer // These assertions let HLS know the upper bounds of loops assert(height < MAX_IMG_ROWS); assert(width < MAX_IMG_COLS); assert(vconv_xlim < MAX_IMG_COLS - (K - 1)); HConvH:for(int col = 0; col < height; col++) { HConvW:for(int row = 0; row < width; row++) { #pragma HLS PIPELINET in_val = *src++;// Reset pixel value on-the-fly - eliminates an O(height*width) loop T out_val = 0; HConv:for(int i = 0; i < K; i++) { hwin[i] = i < K - 1 ? hwin[i + 1] : in_val; out_val += hwin[i] * hcoeff[i]; } if (row >= K - 1) {*phconv++=out_val;} } }

An interesting point to note in the code above is the use of the temporary variableout_valto perform the convolution calculation. This variable is set to zero before the calculation is performed, negating the need to spend two million clock cycles to reset the values, as in the previous example.

Throughout the entire process, the samples in the src input are processed in a raster-streaming manner. Every sample is read in turn. The outputs from the task are either discarded or used, but the task keeps constantly computing. This represents a difference from code written to perform on a CPU.