Standard Vertical Convolution

The next step is to perform the vertical convolution shown in the following figure.

The process for the vertical convolution is similar to the horizontal convolution. A set of K data samples is required to convolve with the convolution coefficients, Vcoeff in this case. After the first output is created using the first K samples in the vertical direction, the next set of K values is used to create the second output. The process continues down through each column until the final output is created.

After the vertical convolution, the image is now smaller than the source imagesrcdue to both the horizontal and vertical border effect.

The code for performing these operations is shown below.

Clear_Dst:for(int i = 0; i < height * width; i++){ dst[i]=0; } // Vertical convolution VconvH:for(int col = border_width; col < height - border_width; col++){ VconvW:for(int row = 0; row < width; row++){ int pixel = col * width + row; Vconv:for(int i = - border_width; i <= border_width; i++){ int offset = i * width; dst[pixel] += local[pixel + offset] * vcoeff[i + border_width]; } } }

This code highlights similar issues to those already discussed with the horizontal convolution code.

Many clock cycles are spent to set the values in the output imagedstto zero. In this case, approximately another two million cycles for a 1920*1080 image size.
There are multiple accesses per pixel to re-read data stored in arraylocal.
There are multiple writes per pixel to the output array/portdst.

The access patterns in the code above in fact creates the requirement to have such a large local array. The algorithm requires the data on row K to be available to perform the first calculation. Processing data down the rows before proceeding to the next column requires the entire image to be stored locally. This requires that all values be stored and results in large local storage on the FPGA.

In addition, when you reach the stage where you wish to use compiler directives to optimize the performance of the hardware function, the flow of data between the horizontal and vertical loop cannot be managed via a FIFO (a high-performance and low-resource unit) because the data is not streamed out of arraylocal: a FIFO can only be used with sequential access patterns. Instead, this code which requires arbitrary/random accesses requires a ping-pong block RAM to improve performance. This doubles the memory requirements for the implementation of the local array to approximately four million data samples, which is too large for an FPGA.