Standard Border Pixel Convolution

The final step in performing the convolution is to create the data around the border. These pixels can be created by simply reusing the nearest pixel in the convolved output. The following figures shows how this is achieved.

The border region is populated with the nearest valid value. The following code performs the operations shown in the figure.

int border_width_offset = border_width * width; int border_height_offset = (height - border_width - 1) * width; // Border pixels Top_Border:for(int col = 0; col < border_width; col++){ int offset = col * width; for(int row = 0; row < border_width; row++){ int pixel = offset + row; dst[pixel] = dst[border_width_offset + border_width]; } for(int row = border_width; row < width - border_width; row++){ int pixel = offset + row; dst[pixel] = dst[border_width_offset + row]; } for(int row = width - border_width; row < width; row++){ int pixel = offset + row; dst[pixel] = dst[border_width_offset + width - border_width - 1]; } } Side_Border:for(int col = border_width; col < height - border_width; col++){ int offset = col * width; for(int row = 0; row < border_width; row++){ int pixel = offset + row; dst[pixel] = dst[offset + border_width]; } for(int row = width - border_width; row < width; row++){ int pixel = offset + row; dst[pixel] = dst[offset + width - border_width - 1]; } } Bottom_Border:for(int col = height - border_width; col < height; col++){ int offset = col * width; for(int row = 0; row < border_width; row++){ int pixel = offset + row; dst[pixel] = dst[border_height_offset + border_width]; } for(int row = border_width; row < width - border_width; row++){ int pixel = offset + row; dst[pixel] = dst[border_height_offset + row]; } for(int row = width - border_width; row < width; row++){ int pixel = offset + row; dst[pixel] = dst[border_height_offset + width - border_width - 1]; } }

The code suffers from the same repeated access for data. The data stored outside the FPGA in the arraydstmust now be available to be read as input data re-read multiple times. Even in the first loop,dst[border_width_offset + border_width]is read multiple times but the values ofborder_width_offsetandborder_widthdo not change.

This code is very intuitive to both read and write. When implemented with the SDSoC environment it is approximately 120M clock cycles, which meets or slightly exceeds the performance of a CPU. However, as shown in the next section, optimal data access patterns ensure this same algorithm can be implemented on the FPGA at a rate of one pixel per clock cycle, or approximately 2M clock cycles.

The summary from this review is that the following poor data access patterns negatively impact the performance and size of the FPGA implementation:

Multiple accesses to read and then re-read data. Use local storage where possible.
Accessing data in an arbitrary or random access manner. This requires the data to be stored locally in arrays and costs resources.
Setting default values in arrays costs clock cycles and performance.