Memory Access Optimizations

Improving memory accesses and the data transfer rate between the PS and PL is the first category of optimizations that can be made to a hardware function. Efficient memory accesses are critical to the performance of hardware functions running on an FPGA.

Data is often stored in an external DDR. However, because accesses to an off-chip DDR can take time and reduce the system performance, it is well understood that in high-performance systems data can also be stored in a local cache to reduce the memory access times and improve run time. In addition to these memories, an FPGA provides local memory where small- to medium-sized data blocks can be stored and efficiently accessed. The memory architecture of a system which uses FPGAs to accelerate performance is similar to that of a CPU+GPU or CPU+DSP system where consideration is given to making the most common memory accesses through the most local memory.

A well-designed hardware function minimizes the latency impact of accessing and storing the data through the extensive use of local memories.

A few suggested techniques include the following:

Data Motion Optimization - This includes the transfer of data between the PS and the PL fabric. The SDSoC environment implements a default data motion architecture based on the arguments of the functions selected for the PL. However, optimization directives might be used, for example, to ensure data is stored in contiguous memory to improve the efficiency of the data transfer or a scatter-gather transfer is used to more efficiently transfer very large sets of data.
Data Access Patterns - An FPGA excels at processing data quickly and in a highly concurrent manner. Poor data access patterns interrupt the flow of data and leave the computational power of an FPGA waiting for data. Good data access patterns minimize the use of re-reading data and increase the use of conditional branching to process one sample in multiple ways.
On-Chip Memories - On-chip memories utilize the block RAM on the FPGA and are physically located near the computation. They allow one-cycle reads and writes, thus drastically improving memory access performance. Copying the data efficiently using optimal data access patterns from the DDR to these local memories can be done very quickly using burst transactions and can considerably improve performance.