Optimizing the Hardware Function

TheSDSoC™environment employs heterogeneous cross-compilation, withArm®CPU-specific compilers for theZynq®-7000andZynq® UltraScale+™ MPSoCprocessor CPUs, and theVivado®High-Level Synthesis (HLS) tool as a programmable logic (PL) cross-compiler for hardware functions. This section explains the default behavior and optimization directives associated with the HLS cross-compiler.

The default behavior of the HLS tool is to execute functions and loops in a sequential manner such that the hardware is an accurate reflection of the C/C++ code. Optimization directives can be used to enhance the performance of the hardware function, allowing pipelining which substantially increases the performance of the functions. This chapter outlines a general methodology for optimizing your design for high performance.

There are many possible goals when trying to optimize a design using the HLS tool. The methodology assumes you want to create a design with the highest possible performance, processing one sample of new input data every clock cycle, and so addresses those optimizations before the ones used for reducing latency or resources.

Detailed explanations of the optimizations discussed here are provided in the Vivado Design Suite User Guide: High-Level Synthesis(UG902).

Note: Xilinx®recommends reviewing the methodology and obtaining a global perspective of hardware function optimization before reviewing the details of specific optimization.

Understanding the Hardware Function Optimization Methodology

Hardware functions are synthesized in the PL by theVivadoHLS tool compiler. This compiler automatically translates C/C++ code into an FPGA hardware implementation, and as with all compilers, does so using compiler defaults.

In addition to the compiler defaults, the HLS tool provides a number of optimizations that are applied to the C/C++ code through the use of pragmas in the code. This chapter explains the optimizations that can be applied and a recommended methodology for applying them.

There are two flows for optimizing the hardware functions:

Top-down flow: In this flow, program decomposition into hardware functions proceeds top-down within the SDSoCenvironment, letting the system cross-compiler create pipelines of functions that automatically operate in dataflow mode. The microarchitecture for each hardware function is optimized using the HLS tool.
Bottom-up flow: In this flow, the hardware functions are optimized in isolation from the system using the HLS tool compiler provided in the Vivado Design Suite. The hardware functions are analyzed, optimizations directives can be applied to create an implementation other than the default, and the resulting optimized hardware functions are then incorporated into the SDSoCenvironment.

The bottom-up flow is often used in organizations where the software and hardware are optimized by different teams and can be used by software programmers who wish to take advantage of existing hardware implementations from within their organization or from partners. Both flows are supported, and the same optimization methodology is used in either case. Both workflows result in the same high-performance system.Xilinxsees the choice as a workflow decision made by individual teams and organizations and provides no recommendation on which flow to use.

The optimization methodology for hardware functions is shown in the following figure:

This figure details all the steps in the methodology and the subsequent sections in this chapter explain the optimizations in detail.

Note:When optimizing code, the build should be done in Release configuration and not Debug configuration due to the -O optimization. Debug should be used to confirm that the functions and system are working, and Release should be used to confirm performance optimizations.

IMPORTANT:Designs will reach the optimum performance after Step 3.

Step 1: SeeOptimizing Metrics, and review the topics in this chapter prior to attempting to optimize.
Step 2: SeePipelining for Performance
Step 3: SeeOptimizing Structures for Performance
Step 4: SeeReducing Latency. This step is used to minimize or specifically control the latency through the design and is required only for applications where this is of concern.
Step 5: SeeReducing Area. This topic explains how to reduce the resources required for hardware implementation and is typically applied only when larger hardware functions fail to implement in the available resources. The FPGA has a fixed number of resources, and there is typically no benefit in creating a smaller implementation if the performance goals have been met.

Baselining Hardware Functions

Before you perform any hardware function optimization, it is important to understand the performance achieved with the existing code and compiler defaults, and appreciate how performance is measured. Select the functions to implement hardware and build the project.

After you build a project, a report is available in the Hardware Reports section of the IDE. The report is also available from//_sds/vhls//solution/syn/report/.rpt. This report details the performance estimates and usage estimates.

The key factors in the performance estimates are ordered by the timing, interval (which includes loop initiation interval), and latency.

The timing summary shows the target and estimated clock period. If the estimated clock period is greater than the target,the hardware will not function at this clock period. Reduce the clock period by using theProject Settings>Data Motion Network Clock Frequencyoption. Alternatively, because this is only an estimate at this point in the flow, it might be possible to proceed through the remainder of the flow if the estimate only exceeds the target by 20%. Further optimizations are applied when the bitstream is generated, and it might still be possible to satisfy the timing requirements. However, this is an indication that the hardware function is not guaranteed to meet timing.
The function initiation interval (II) is the number of clock cycles before the function can accept new inputs and is generallythe most critical performance metric in any system. In an ideal hardware function, the hardware processes data at the rate of one sample per clock cycle. If the largest data set passed into the hardware is size (for example:my_array[]), the most optimal II is + 1. This means the hardware function processes data samples in clock cycles and can accept new data one clock cycle after all samples are processed. It is possible to create a hardware function with anII ; however, this requires greater resources in the programmable logic (PL) with typically little benefit. Often, this hardware function is ideal because it consumes and produces data at a rate faster than the rest of the system.
The loop initiation interval is the number of clock cycles before the next iteration of a loop starts to process data. This metric becomes important as you delve deeper into the analysis to locate and remove performance bottlenecks.
The latency is the number of clock cycles required for the function to compute all output values. This is simply the lag from when data is applied until when it is ready. For most applications this is of little concern, especially when the latency of the hardware function vastly exceeds that of the software or system functions, such as DMA; however, it is a performance metric that you should review and confirm is not an issue for your application.
The loop iteration latency is the number of clock cycles it takes to complete one iteration of a loop, and the loop latency is the number of cycles to execute all iterations of the loop. SeeOptimizing Metrics.

The Area Estimates section of the report details how many resources are required in the PL to implement the hardware function and how many are available. The key metric here is the Utilization (%).The utilization (%) should not exceed 100% for any of the resources. A figure greater than 100% means there are not enough resources to implement the hardware function, and a larger FPGA device might be required. As with the timing, at this point in the flow, this is an estimate. If the numbers are only slightly over 100%, it might be possible for the hardware to be optimized during bitstream creation.

You should already have an understanding of the required performance of your system and what metrics are required from the hardware functions; however, even if you are unfamiliar with hardware concepts such as clock cycles, you are now aware that the highest performing hardware functions have anII = + 1, where is the largest data set processed by the function. With an understanding of the current design performance and a set of baseline performance metrics, you can now proceed to apply optimization directives to the hardware functions.

Optimizing Metrics

The following table shows the first directive for you to consider adding to your design.

Table 1.Optimization Strategy Step 1: Optimization For Metrics
Directives and Configurations	Description
LOOP_TRIPCOUNT	Used for loops that have variable bounds. Provides an estimate for the loop iteration count. This has no impact on synthesis, only on reporting.

A common issue when hardware functions are first compiled is report files showing the latency and interval as a question mark “?” rather than as numerical values. If the design has loops with variable loop bounds, the compiler cannot determine the latency or II and uses the “?” to indicate this condition. Variable loop bounds are where the loop iteration limit cannot be resolved at compile time, as when the loop iteration limit is an input argument to the hardware function, such as variable height, width, or depth parameters.

To resolve this condition, use the hardware function report to locate the lowest level loop, which fails to report a numerical value, and use theLOOP_TRIPCOUNTdirective to apply an estimated tripcount. The tripcount is the minimum, average, and/or maximum number of expected iterations. This allows values for latency and interval to be reported and allows implementations with different optimizations to be compared.

Because theLOOP_TRIPCOUNTvalue is only used for reporting and has no impact on the resulting hardware implementation, any value can be used. However, an accurate expected value results in more useful reports.

Pipelining for Performance

The next stage in creating a high-performance design is to pipeline the functions, loops, and operations. Pipelining results in the greatest level of concurrency and a very high level of performance. The following table shows the directives you can use for pipelining.

Table 2.Optimization Strategy Step 2: Pipeline for Performance
Directives and Configurations	Description
PIPELINE	Reduces the initiation interval by allowing the concurrent execution of operations within a loop or function.
DATAFLOW	Enables task-level pipelining, allowing functions and loops to execute concurrently. Used to minimize interval.
RESOURCE	Specifies pipelining on the hardware resource used to implement a variable (array, arithmetic operation).
Config Compile	Allows loops to be automatically pipelined based on their iteration count when using the bottom-up flow.

At this stage of the optimization process, you want to create as much concurrent operation as possible. You can apply the PIPELINE directive to functions and loops. You can use the DATAFLOW directive at the level that contains the functions and loops to make them work in parallel. Although rarely required, the RESOURCE directive can be used to squeeze out the highest levels of performance.

A recommended strategy is to work from the bottom up and be aware of the following:

Some functions and loops contain sub-functions. If the sub-function is not pipelined, the function above it might show limited improvement when it is pipelined. The non-pipelined sub-function will be the limiting factor.
Some functions and loops contain sub-loops. When you use the PIPELINE directive, the directive automatically unrolls all loops in the hierarchy below. This can create a great deal of logic. It might make more sense to pipeline the loops in the hierarchy below.
For cases where it does make sense to pipeline the upper hierarchy and unroll any loops lower in the hierarchy, loops with variable bounds cannot be unrolled, and any loops and functions in the hierarchy above these loops cannot be pipelined. To address this issue, pipeline these loops with variable bounds, and use the DATAFLOW optimization to ensure the pipelined loops operate concurrently to maximize the performance of the tasks that contains the loops. Alternatively, rewrite the loop to remove the variable bound. Apply a maximum upper bound with a conditional break.

The basic strategy at this point in the optimization process is to pipeline the tasks (functions and loops) as much as possible. For detailed information on which functions and loops to pipeline, seeHardware Function Pipeline Strategies.

Although not commonly used, you can also apply pipelining at the operator level. For example, wire routing in the FPGA can introduce large and unanticipated delays that make it difficult for the design to be implemented at the required clock frequency. In this case, you can use the RESOURCE directive to pipeline specific operations such as multipliers, adders, and block RAM to add additional pipeline register stages at the logic level and allow the hardware function to process data at the highest possible performance level without the need for recursion.

Note:The configuration commands are used to change the optimization default settings and are only available from within the Vivado®HLS tool when using a bottom-up flow. For more details, see the Vivado Design Suite User Guide: High-Level Synthesis(UG902).

Hardware Function Pipeline Strategies

The key optimization directives for obtaining a high-performance design are the PIPELINE and DATAFLOW directives. This section discusses in detail how to apply these directives for various C code architectures.

There are two types of C/C++ functions: those that are frame-based and those that are sampled-based. No matter which coding style is used, the hardware function can be implemented with the same performance in both cases. The difference is only in how the optimization directives are applied.

Frame-Based C Code

The primary characteristic of a frame-based coding style is that the function processes multiple data samples—a frame of data—typically supplied as an array or pointer with data accessed through pointer arithmetic during each transaction (a transaction is considered to be one complete execution of the C function). In this coding style, the data is typically processed through a series of loops or nested loops.

The following is an example outline of frame-based C code:

void foo( data_t in1[HEIGHT][WIDTH], data_t in2[HEIGHT][WIDTH], data_t out[HEIGHT][WIDTH] { Loop1: for(int i = 0; i < HEIGHT; i++) { Loop2: for(int j = 0; j < WIDTH; j++) { out[i][j] = in1[i][j] * in2[i][j]; Loop3: for(int k = 0; k < NUM_BITS; k++) { . . . . } } }

When seeking to pipeline any C/C++ code for maximum performance in hardware, you want to place the pipeline optimization directive at the level where a sample of data is processed.

The above example is representative of code used to process an image or video frame and can be used to highlight how to effectively pipeline hardware functions. Two sets of input are provided as frames of data to the function, and the output is also a frame of data. There are multiple locations where this function can be pipelined:

At the level of function foo.
At the level of loop Loop1.
At the level of loop Loop2.
At the level of loop Loop3.

There are advantages and disadvantages for placing the PIPELINE directive at various locations. Understanding them helps guide you to the best location to place the pipeline directive in your code.

Function Level

The function accepts a frame of data as input ( in1and in2). If the function is pipelined with II = 1—read a new set of inputs every clock cycle—this informs the compiler to read all HEIGHT*WIDTHvalues of in1and in2in a single clock cycle. This is a lot of data to read in one cycle and is unlikely the design you want.

If the PIPELINE directive is applied to function foo, all loops in the hierarchy below this level must be unrolled. This is a requirement for pipelining, namely, there cannot be sequential logic inside the pipeline. This would createHEIGHT*WIDTH*NUM_ELEMENTcopies of the logic, which would lead to a large design.

Because the data is accessed in a sequential manner, the arrays on the interface to the hardware function can be implemented as multiple types of hardware interface:

Block RAM interface
AXI4interface
AXI4-Liteinterface
AXI4-Streaminterface
FIFO interface

A block RAM interface can be implemented as a dual-port interface supplying two samples per clock. The other interface types can only supply one sample per clock. This would result in a bottleneck; there would be a highly parallel but large hardware design unable to process all the data in parallel, resulting in a waste of hardware resources.

Loop1 Level

The logic in Loop1 processes an entire row of the two-dimensional matrix. Placing the PIPELINE directive here would create a design which seeks to process one row in each clock cycle. Again, this would unroll the loops below and create additional logic. To make use of the additional hardware, transfer an entire row of data each clock cycle: an array of HEIGHTdata words, with each word being WIDTH* bits wide.

Because it is unlikely the host code running on the PS can process such large data words, this would again be a case where there are many highly parallel hardware resources that cannot operate in parallel due to bandwidth limitations.

Loop2 Level

The logic in Loop2 seeks to process one sample from the arrays. In an image algorithm, this is the level of a single pixel. This is the level to pipeline if the design is to process one sample per clock cycle. This is also the rate at which the interfaces consume and produce data to and from the PS.

This causes Loop3 to be completely unrolled and process one sample per clock. It is a requirement that all the operations in Loop3 execute in parallel. In a typical design, the logic in Loop3 is a shift register or is processing bits within a word. To execute at one sample per clock, you want these processes to occur in parallel and hence you want to unroll the loop. The hardware function created by pipelining Loop2 processes one data sample per clock and creates parallel logic only where needed to achieve the required level of data throughput.

Loop3 Level

As stated above, given that Loop2 operates on each data sample or pixel, Loop3 will typically be doing bit-level or data shifting tasks, so this level is doing multiple operations per pixel. Pipelining this level would mean performing each operation in this loop once per clock and thus NUM_BITSclocks per pixel: processing at the rate of multiple clocks per pixel or data sample.

For example, Loop3 might contain a shift register holding the previous pixels required for a windowing or convolution algorithm. Adding the PIPELINE directive at this level informs the compiler to shift one data value every clock cycle. The design would only return to the logic in Loop2 and read the next inputs afterNUM_BITSiterations resulting in a very slow data processing rate.

The ideal location to pipeline in this example is Loop2.

When dealing with frame-based code you will want to pipeline at the loop level and typically pipeline the loop that operates at the level of a sample. If in doubt, place a print command into the C code and to confirm this is the level you wish to execute on each clock cycle.

For cases where there are multiple loops at the same level of hierarchy—the example above shows only a set of nested loops—the best location to place the PIPELINE directive can be determined for each loop and then the DATAFLOW directive applied to the function to ensure each of the loops executes in a concurrent manner.

Sample-Based C Code

An example outline of sample-based C code is shown below. The primary characteristic of this coding style is that the function processes a single data sample during each transaction.

void foo (data_t *in, data_t *out) { static data_t acc; Loop1: for (int i=N-1;i>=0;i--) { acc+= ..some calculation..; } *out=acc>>N; }

Another characteristic of sample-based coding style is that the function often contains a static variable: a variable whose value must be remembered between invocations of the function, such as an accumulator or sample counter.

With sample-based code, the location of the PIPELINE directive is clear, namely, to achieve anII = 1and process one data value each clock cycle, for which the function must be pipelined.

This unrolls any loops inside the function and creates additional hardware logic, but there is no way around this. IfLoop1is not pipelined, it takes a minimum ofNclock cycles to complete. Only then can the function read the nextxinput value.

When dealing with C code that processes at the sample level, the strategy is always to pipeline the function.

In this type of coding style, the loops are typically operating on arrays and performing a shift register or line buffer functions. It is not uncommon to partition these arrays into individual elements as discussed inOptimizing Structures for Performanceto ensure all samples are shifted in a single clock cycle. If the array is implemented in a block RAM, only a maximum of two samples can be read or written in each clock cycle, creating a data processing bottleneck.

The solution here is to pipeline functionfoo. Doing so results in a design that processes one sample per clock.

Optimizing Structures for Performance

C code can contain descriptions that prevent a function or loop from being pipelined with the required performance. This is often implied by the structure of the C code or the default logic structures used to implement the PL. In some cases, this might require a code modification, but in most cases these issues can be addressed using additional optimization directives.

The following example shows a case where an optimization directive is used to improve the structure of the implementation and the performance of pipelining. In this initial example, the PIPELINE directive is added to a loop to improve the performance of the loop. This example code shows a loop being used inside a function.

#include "bottleneck.h" dout_t bottleneck(...) { ... SUM_LOOP: for(i=3;i


            When the code above is compiled into hardware, the following message appears as output:
            INFO: [SCHED 61] Pipelining loop 'SUM_LOOP'. WARNING: [SCHED 69] Unable to schedule 'load' operation ('mem_load_2', bottleneck.c:62) on array 'mem' due to limited memory ports. INFO: [SCHED 61] Pipelining result: Target II: 1, Final II: 2, Depth: 3. I
            The issue in this example is that arrays are implemented using the efficient block RAM resources in the PL. This results in a small, cost-efficient, fast design. The disadvantage of block RAM is that, like other memories, such as DDR or SRAM, they have a limited number of data ports, typically a maximum of two.
            In the code above, four data values frommemare required to compute the value ofsum. Becausememis an array and implemented in a block RAM that only has two data ports, only two values can be read (or written) in each clock cycle. With this configuration, it is impossible to compute the value ofsumin one clock cycle and thus consume or produce data with an II of 1 (process one data sample per clock).
            The memory port limitation issue can be solved by using the ARRAY_PARTITION directive on thememarray. This directive partitions arrays into smaller arrays, improving the data structure by providing more data ports and allowing a higher performance pipeline.
            With the additional directive shown below, arraymemis partitioned into two dual-port memories so that all four reads can occur in one clock cycle. There are multiple options to partitioning an array. In this case, cyclic partitioning with a factor of two ensures the first partition contains elements 0, 2, 4, and so forth, from the original array and the second partition contains elements 1, 3, 5, and so forth. Because the partitioning ensures there are now two dual-port block RAMs (with a total of four data ports), this allows elements 0, 1, 2, and 3 to be read in a single clock cycle.
            
             Note:The ARRAY_PARTITION directive cannot be used on arrays which are arguments of the function selected as an accelerator.
            
            #include "bottleneck.h" dout_t bottleneck(...) { #pragma HLS ARRAY_PARTITION variable=mem cyclic factor=2 dim=1 ... SUM_LOOP: for(i=3;i

            Other such issues might be encountered when trying to pipeline loops and functions. The following table lists the directives that are likely to address these issues by helping to reduce bottlenecks in data structures.
            
             
              
               Table 3.Optimization Strategy Step 3: Optimize Structures for Performance
              
              
               
               
              
              
               
                Directives and Configurations
                Description
               
              
              
               
                ARRAY_PARTITION
                Partitions large arrays into multiple smaller arrays or into individual registers to improve access to data and remove block RAM bottlenecks.
               
               
                DEPENDENCE
                Provides additional information that can overcome loop-carry dependencies and allow loops to be pipelined (or pipelined with lower intervals).
               
               
                INLINE
                Inlines a function, removing all function hierarchy. Enables logic optimization across function boundaries and improves latency/interval by reducing function call overhead.
               
               
                UNROLL
                Unrolls for-loops to create multiple independent operations rather than a single collection of operations, allowing greater hardware parallelism. This also allows for partial unrolling of loops.
               
               
                Config Array Partition
                This configuration determines how arrays are automatically partitioned, including global arrays, and if the partitioning impacts array ports.
               
               
                Config Compile
                Controls synthesis specific optimizations such as the automatic loop pipelining and floating point math optimizations.
               
               
                Config Schedule
                Determines the effort level to use during the synthesis scheduling phase, the verbosity of the output messages, and to specify if II should be relaxed in pipelined tasks to achieve timing.
               
               
                Config Unroll
                Allows all loops below the specified number of loop iterations to be automatically unrolled.
               
              
             
            
            In addition to the ARRAY_PARTITION directive, the configuration for array partitioning can be used to automatically partition arrays.
            The DEPENDENCE directive might be required to remove implied dependencies when pipelining loops. Such dependencies are reported by messageSCHED-68.
            @W [SCHED-68] Target II not met due to carried dependence(s)
            The INLINE directive removes function boundaries. This can be used to bring logic or loops up one level of hierarchy. It might be more efficient to pipeline the logic in a function by including it in the function above it, and merging loops into the function above them where the DATAFLOW optimization can be used to execute all the loops concurrently without the overhead of the intermediate sub-function call. This might lead to a higher performing design.
            The UNROLL directive might be required for cases where a loop cannot be pipelined with the required II. If a loop can only be pipelined withII = 4, it will constrain the other loops and functions in the system to be limited toII = 4. In some cases, it might be worth unrolling or partially unrolling the loop to create more logic, and remove a potential bottleneck. If the loop can only achieveII = 4, unrolling the loop by a factor of 4 creates logic that can process four iterations of the loop in parallel and achieveII = 1.
            The configuration commands are used to change the optimization default settings and are only available from within theVivadoHLS tool when using a bottom-up flow. For more details, see theVivado Design Suite User Guide: High-Level Synthesis(UG902). If optimization directives cannot be used to improve the initiation interval, it might require changes to the code. See examples of such in the same guide.

Table 3.Optimization Strategy Step 3: Optimize Structures for Performance
Directives and Configurations	Description
ARRAY_PARTITION	Partitions large arrays into multiple smaller arrays or into individual registers to improve access to data and remove block RAM bottlenecks.
DEPENDENCE	Provides additional information that can overcome loop-carry dependencies and allow loops to be pipelined (or pipelined with lower intervals).
INLINE	Inlines a function, removing all function hierarchy. Enables logic optimization across function boundaries and improves latency/interval by reducing function call overhead.
UNROLL	Unrolls for-loops to create multiple independent operations rather than a single collection of operations, allowing greater hardware parallelism. This also allows for partial unrolling of loops.
Config Array Partition	This configuration determines how arrays are automatically partitioned, including global arrays, and if the partitioning impacts array ports.
Config Compile	Controls synthesis specific optimizations such as the automatic loop pipelining and floating point math optimizations.
Config Schedule	Determines the effort level to use during the synthesis scheduling phase, the verbosity of the output messages, and to specify if II should be relaxed in pipelined tasks to achieve timing.
Config Unroll	Allows all loops below the specified number of loop iterations to be automatically unrolled.


          
           Reducing Latency
           
            When the compiler finishes minimizing the initiation interval (II), it automatically seeks to minimize the latency. The optimization directives listed in the following table can help specify a particular latency or inform the compiler to achieve a latency lower than the one produced, namely, instruct the compiler to satisfy the latency directive even if it results in a higher II. This could result in a lower performance design.
            Latency directive are generally not required because most applications have a required throughput but no required latency. When hardware functions are integrated with a processor, the latency of the processor is generally the limiting factor in the system.
            If the loops and functions are not pipelined, the throughput is limited by the latency because the task does not start reading the next set of inputs until the current task has completed.
            
             
              
               Table 4.Optimization Strategy Step 4: Reduce Latency
              
              
               
               
              
              
               
                Directive
                Description
               
              
              
               
                LATENCY
                Allows a minimum and maximum latency constraint to be specified.
               
               
                LOOP_FLATTEN
                Allows nested loops to be collapsed into a single loop. This removes the loop transition overhead and improves the latency. Nested loops are automatically flattened when the PIPELINE directive is applied.
               
               
                LOOP_MERGE
                Merges consecutive loops to reduce overall latency, increase logic resource sharing, and improve logic optimization.
               
              
             
            
            The loop optimization directives can be used to flatten a loop hierarchy or merge consecutive loops together. The benefit to the latency is due to the fact that it typically costs a clock cycle in the control logic to enter and leave the logic created by a loop. The fewer the number of transitions between loops, the lesser number of clock cycles a design takes to complete.
           
          
          
           Reducing Area
           
            In hardware, the number of resources required to implement a logic function is referred to as the design area. Design area also refers to how much area the resource used on the fixed-size PL fabric. The area is important when the hardware is too large to be implemented in the target device, and when the hardware function consumes a very high percentage (> 90%) of the available area. This can result in difficulties when trying to wire the hardware logic together because the wires themselves require resources.
            After meeting the required performance target or initiation interval (II), the next step might be to reduce the area while maintaining the same performance. This step can be optimal because there is no advantage to reducing the area if the hardware function is operating at the required performance, and no other hardware functions are to be implemented in the remaining space in the PL.
            The most common area optimization is the optimization of dataflow memory channels to reduce the number of block RAM resources required to implement the hardware function. Each device has a limited number of block RAM resources.
            If you used the DATAFLOW optimization, and the compiler cannot determine whether the tasks in the design are streaming data, it implements the memory channels between dataflow tasks using ping-pong buffers. These require two block RAMs each of size , where  is the number of samples to be transferred between the tasks (typically the size of the array passed between tasks). If the design is pipelined and the data is streaming from one task to the next with values produced and consumed in a sequential manner, you can greatly reduce the area by using the STREAM directive to specify that the arrays are to be implemented in a streaming manner that uses a simple FIFO for which you can specify the depth. FIFOs with a small depth are implemented using registers and the PL fabric has many registers.
            For most applications, the depth can be specified as 1, which results in the memory channel being implemented as a simple register. If the algorithm implements data compression or extrapolation, where some tasks consume more data than they produce or produce more data than they consume, some arrays must be specified with a higher depth:
            
             For tasks which produce and consume data at the same rate, specify the array between them to stream with a depth of 1.
             For tasks which reduce the data rate by a factor of X-to-1, specify arrays at the input of the task to stream with a depth of X. All arrays prior to this in the function should also have a depth of X to ensure the hardware function does not stall because the FIFOs are full.
             For tasks which increase the data rate by a factor of 1-to-Y, specify arrays at the output of the task to stream with a depth of Y. All arrays after this in the function should also have a depth of Y to ensure the hardware function does not stall because the FIFOs are full.
            
            
             Note:If the depth is set too small, the hardware function will stall (hang) during hardware emulation, resulting in lower performance or even deadlock in some cases, due to full FIFOs causing the rest of the system to wait.
            
            The following table lists the other directives and configurations to consider when attempting to minimize the resources used to implement the design.
            
             
              
               Table 5.Optimization Strategy Step 5: Reduce Area
              
              
               
               
              
              
               
                Directives and Configurations
                Description
               
              
              
               
                ALLOCATION
                Specifies a limit for the number of operations, hardware resources, or functions used. This can force the sharing of hardware resources but might increase latency.
               
               
                ARRAY_MAP
                Combines multiple smaller arrays into a single large array to help reduce the number of block RAM resources.
               
               
                ARRAY_RESHAPE
                Reshapes an array from one with many elements to one with greater word width. Useful for improving block RAM accesses without increasing the number of block RAM.
               
               
                DATA_PACK
                Packs the data fields of an internal struct into a single scalar with a wider word width, allowing a single control signal to control all fields.
               
               
                LOOP_MERGE
                Merges consecutive loops to reduce overall latency, increase sharing, and improve logic optimization.
               
               
                OCCURRENCE
                Used when pipelining functions or loops to specify that the code in a location is executed at a lesser rate than the code in the enclosing function or loop.
               
               
                RESOURCE
                Specifies that a specific hardware resource (core) is used to implement a variable (array, arithmetic operation).
               
               
                STREAM
                Specifies that a specific memory channel is to be implemented as a FIFO with an optional specific depth.
               
               
                Config Bind
                Determines the effort level to use during the synthesis binding phase and can be used to globally minimize the number of operations used.
               
               
                Config Dataflow
                This configuration specifies the default memory channel and FIFO depth in dataflow optimization.
               
              
             
            
            The ALLOCATION and RESOURCE directives are used to limit the number of operations and to select which cores (hardware resources) are used to implement the operations. For example, you could limit the function or loop to using only one multiplier, and specify it to be implemented using a pipelined multiplier.
            If the ARRAY_PARITION directive is used to improve the initiation interval you might want to consider using the ARRAY_RESHAPE directive instead. The ARRAY_RESHAPE optimization performs a similar task to array partitioning, however, the reshape optimization recombines the elements created by partitioning into a single block RAM with wider data ports. This might prevent an increase in the number of block RAM resources required.
            If the C code contains a series of loops with similar indexing, merging the loops with the LOOP_MERGE directive might allow some optimizations to occur. Finally, in cases where a section of code in a pipeline region is only required to operate at an initiation interval lower than the rest of the region, the OCCURENCE directive is used to indicate that this logic can be optimized to execute at a lower rate.
            
             Note:The configuration commands are used to change the optimization default settings and are only available from within the
             VivadoHLS tool when using a bottom-up flow. For more information, see the
             Vivado Design Suite User Guide: High-Level Synthesis(UG902).
            
           
          
          
           Design Optimization Workflow
           
            Before performing any optimizations, it is recommended to create a new build configuration within the project. Using different build configurations allows one set of results to be compared against a different set of results. In addition to the standard Debug and Release configurations, custom configurations with more useful names (for example,Opt_ver1andUnOpt_ver) might be created in the window using theProject Settings>Manage Build Configurations for the Projecttoolbar button.
            Different build configurations allow you to compare not only the results, but also the log files and even output RTL files used to implement the FPGA (the RTL files are only recommended for users very familiar with hardware design).
            The basic optimization strategy for a high-performance design is:
            
             Create an initial or baseline design.
             Pipeline the loops and functions. Apply theDATAFLOWoptimization to execute loops and functions concurrently.
             Address any issues that limit pipelining, such as array bottlenecks and loop dependencies (with ARRAY_PARTITION and DEPENDENCE directives).
             Specify a specific latency or reduce the size of the dataflow memory channels and use the ALLOCATION and RESOURCES directives to further reduce area.
            
            
             Note:It might sometimes be necessary to make adjustments to the code to meet performance.
            
            In summary, the goal is to always meet performance first, before reducing area. If the strategy is to create a design with the fewest resources, omit the steps to improving performance, although the baseline results might be very close to the smallest possible design.
            Throughout the optimization process, it is highly recommended to review the console output (or log file) after compilation. When the compiler cannot reach the specified performance goals of an optimization, it automatically relaxes the goals (except the clock frequency) and creates a design with the goals that can be satisfied. It is important to review the output from the compilation log files and reports to understand what optimizations have been performed.
            For specific details on applying optimizations, refer toVivado Design Suite User Guide: High-Level Synthesis(UG902).
           
          
          
           Optimization Guidelines
           
            This section documents several fundamental optimization techniques to enhance hardware function performance using theVivadoHLS tool. These techniques include: function inlining, loop and function pipelining, loop unrolling, increasing local memory bandwidth, and streaming data flow between loops and functions.
           
           
            Function Inlining
            
             Similar to function inlining of software functions, it can be beneficial to inline hardware functions.
             Function inlining replaces a function call by substituting a copy of the function body after resolving the actual and formal arguments. After that, the inlined function is dissolved and no longer appears as a separate level of hierarchy. Function inlining allows operations within the inlined function be optimized more effectively with surrounding operations, thus improving the overall latency or the initiation interval for a loop.
             To inline a function, put#pragma HLS inlineat the beginning of the body of the desired function. The following code snippet directs theVivadoHLS tool to inline themmult_kernelfunction:
             void mmult_kernel(float in_A[A_NROWS][A_NCOLS], float in_B[A_NCOLS][B_NCOLS], float out_C[A_NROWS][B_NCOLS]) { #pragma HLS INLINE int index_a, index_b, index_d; // rest of code body omitted }
            
           
           
            Loop Pipelining and Loop Unrolling
            
             Both loop pipelining and loop unrolling improve the performance of the hardware functions by exploiting the parallelism between loop iterations. The basic concepts of loop pipelining and loop unrolling and example codes to apply these techniques are shown and the limiting factors to achieve optimal performance using these techniques are discussed.
            
            
             Loop Pipelining
             
              In sequential languages such as C/C++, the operations in a loop are executed sequentially, and the next iteration of the loop can only begin when the last operation in the current loop iteration is complete. Loop pipelining allows the operations in a loop to be implemented in a concurrent manner as shown in the following figure.
              
               Figure:Loop Pipelining
               

               
               

              
              As shown in the previous figure, without pipelining, there are three clock cycles between the twoRDoperations, and it requires six clock cycles for the entire loop to finish. However, with pipelining, there is only one clock cycle between the twoRDoperations, and it requires four clock cycles for the entire loop to finish, that is, the next iteration of the loop can start before the current iteration is finished.
              An important term for loop pipelining is called initiation interval (II), which is the number of clock cycles between the start times of consecutive loop iterations. In the above figure, the II is one because there is only one clock cycle between the start times of consecutive loop iterations.
              To pipeline a loop, put#pragma HLS pipelineat the beginning of the loop body, as illustrated in the following code snippet. TheVivadoHLS tool tries to pipeline the loop with minimum II.
              for (index_a = 0; index_a < A_NROWS; index_a++) { for (index_b = 0; index_b < B_NCOLS; index_b++) { #pragma HLS PIPELINE II=1 float result = 0; for (index_d = 0; index_d < A_NCOLS; index_d++) { float product_term = in_A[index_a][index_d] * in_B[index_d][index_b]; result += product_term; } out_C[index_a * B_NCOLS + index_b] = result; } }
             
            
            
             Loop Unrolling
             
              Loop unrolling is another technique to exploit parallelism between loop iterations. It creates multiple copies of the loop body and adjust the loop iteration counter accordingly. The following code snippet shows a normal rolled loop:
              int sum = 0; for(int i = 0; i < 10; i++) { sum += a[i]; }
              After the loop is unrolled by a factor of two, the loop becomes:
              int sum = 0; for(int i = 0; i < 10; i+=2) { sum += a[i]; sum += a[i+1]; }
              Unrolling a loop by a factor of  creates  copies of the loop body, the loop variable referenced by each copy is updated accordingly (such as thea[i+1]in the above code snippet), and the loop iteration counter is also updated accordingly (such as thei+=2in the above code snippet).
              Loop unrolling creates more operations in each loop iteration, so that theVivadoHLS tool can exploit more parallelism among these operations. More parallelism means more throughput and higher system performance.
              
               When the factor  is less than the total number of loop iterations (10 in the example above), it is called apartial unroll.
               When the factor  is the same as the number of loop iterations, it is called afull unroll. Whilefull unrollrequires that the loop bounds are known at compile time, it exposes the most parallelism.
              
              To unroll a loop, put#pragma HLS unroll [factor=N]at the beginning of the loop. Without the optionalfactor=N, the loop will be fully unrolled.
              int sum = 0; for(int i = 0; i < 10; i++) { #pragma HLS unroll factor=2 sum += a[i]; }
             
            
            
             Factors Limiting the Parallelism Achieved by Loop Pipelining and Loop Unrolling
             
              Both loop pipelining and loop unrolling exploit the parallelism between loop iterations. However, parallelism between loop iterations is limited by two main factors:
              
               The data dependencies between loop iterations.
               The number of available hardware resources.
              
              A data dependence from an operation in one iteration to another operation in a subsequent iteration is called a loop-carried dependence. It implies that the operation in the subsequent iteration cannot start until the operation in the current iteration has finished computing the data input for the operation in subsequent iteration. Loop-carried dependencies fundamentally limit the initiation interval that can be achieved using loop pipelining and the parallelism that can be exploited using loop unrolling.
              The following example demonstrates loop-carried dependencies among operations producing and consuming variablesaandb.
              while (a != b) { if (a > b) a –= b; else b –= a; }
              Operations in the next iteration of this loop can not start until the current iteration has calculated and updated the values ofaandb. Array accesses are a common source of loop-carried dependencies, as shown in the following example:
              for (i = 1; i < N; i++) mem[i] = mem[i-1] + i;
              In this case, the next iteration of the loop must wait until the current iteration updates the content of the array. In case of loop pipelining, the minimum initiation interval (II) is the total number of clock cycles required for the memory read, the add operation, and the memory write
              Another performance limiting factor for loop pipelining and loop unrolling is the number of available hardware resources. The following figure shows an example the issues created by resource limitations, which in this case prevents the loop to be pipelined with an initiation interval of 1.
              
               Figure:Resource Contention
               

               
               

              
              In this example, if the loop is pipelined with an initiation interval of one, there are two read operations. If the memory has only a single port, then the two read operations cannot be executed simultaneously and must be executed in two cycles. So the minimal initiation interval can only be two, as shown in part (B) of the figure. The same can happen with other hardware resources. For example, if theop_computeis implemented with a DSP core which cannot accept new inputs every cycle, and there is only one such DSP core. Thenop_computecannot be issued to the DSP core each cycle, and an initiation interval of one is not possible.
             
            
           
           
            Increasing Local Memory Bandwidth
            
             This section shows several ways provided by theVivadoHLS tool to increase local memory bandwidth, which can be used together with loop pipelining and loop unrolling to improve system performance.
             Arrays are intuitive and useful constructs in C/C++ programs. They allow for the algorithm to be easily captured and understood. In the HLS tool, each array is implemented by default with a single port memory resource; however, such memory implementation might not be the most ideal memory architecture for performance- oriented programs. Refer toLoop Pipelining and Loop Unrollingfor an example of resource contention caused by limited memory ports.
            
            
             Array Partitioning
             
              Arrays can be partitioned into smaller arrays. Physical implementation of memories have only a limited number of read ports and write ports, which can limit the throughput of a load/store intensive algorithm. The memory bandwidth can sometimes be improved by splitting up the original array (implemented as a single memory resource) into multiple smaller arrays (implemented as multiple memories), effectively increasing the number of load/store ports.
              TheVivadoHLS tool provides three types of array partitioning, as shown in the following figure.
              
               
                block
               
               
                The original array is split into equally sized blocks of consecutive elements of the original array.
               
               
                cyclic
               
               
                The original array is split into equally sized blocks interleaving the elements of the original array.
               
               
                complete
               
               
                The default operation is to split the array into its individual elements. This corresponds to implementing an array as a collection of registers rather than as a memory.
               
              
              
               Figure:Multi-dimension Array Partitioning
               

               
               

              
              To partition an array in the HLS tool, insert this in the hardware function source code:
              #pragma HLS array_partition variable=  factor= dim=
              Forblockandcyclicpartitioning, thefactoroption can be used to specify the number of arrays which are created. In the figure above, a factor of two is used, dividing the array into two smaller arrays. If the number of elements in the array is not an integer multiple of the factor, the last array will have fewer than average elements.
              When partitioning multi-dimensional arrays, thedimoption can be used to specify which dimension is partitioned. The following figure shows an example of partitioning different dimensions of a multi-dimensional array.
              
               Figure:Multi-dimension Array Partitioning
               

               
               

              
             
            
            
             Array Reshaping
             
              Arrays can also be reshaped to increase the memory bandwidth. Reshaping takes different elements from a dimension in the original array, and combines them into a single wider element. Array reshaping is similar to array partitioning, but instead of partitioning into multiple arrays, it widens array elements. The following figure illustrates the concept of array reshaping.
              
               Figure:Array Reshaping
               

               
               

              
              To use array reshaping in theVivadoHLS tool, insert this in the hardware function source code:
              #pragma HLS array_reshape variable=  factor= dim=
              The options have the same meaning as the array partition pragma.
             
            
           
           
            Data Flow Pipelining
            
             The previously discussed optimization techniques are all "fine grain" parallelizing optimizations at the level of operators, such as multiplier, adder, and memory load/store operations. These techniques optimize the parallelism between these operators. Data flow pipelining on the other hand, exploits the "coarse grain" parallelism at the level of functions and loops. Data flow pipelining can increase the concurrency between functions and loops.
            
            
             Function Data Flow Pipelining
             
              The default behavior for a series of function calls in theVivadoHLS tool is to complete a function before starting the next function. In the following figure, part (A) shows the latency without function data flow pipelining. Assuming it takes eight cycles for the three functions to complete, the code requires eight cycles before a new input can be processed byfunc_Aand also eight cycles before an output is written byfunc_C(assume the output is written at the end offunc_C).
              
               Figure:Function Data Flow Pipelining
               

               
               

              
              An example execution with data flow pipelining is shown in the part (B) of the figure above. Assuming the execution offunc_Atakes three cycles,func_Acan begin processing a new input every three clock cycles rather than waiting for all the three functions to complete, resulting in increased throughput, The complete execution to produce an output then requires only five clock cycles, resulting in shorter overall latency.
              The HLS tool implements function data flow pipelining by inserting "channels" between the functions. These channels are implemented as either ping-pong buffers or FIFOs, depending on the access patterns of the producer and the consumer of the data.
              
               If a function parameter (producer or consumer) is an array, the corresponding channel is implemented as a multi-buffer using standard memory accesses (with associated address and control signals).
               For scalar, pointer and reference parameters, as well as the function return, the channel is implemented as a FIFO, which uses less hardware resources (no address generation), but requires that the data is accessed sequentially.
              
              To use function data flow pipelining, put#pragma HLS dataflowwhere the data flow optimization is desired. The following code snippet shows an example:
              void top(a, b, c, d) { #pragma HLS dataflow func_A(a, b, i1); func_B(c, i1, i2); func_C(i2, d); }
             
            
            
             Loop Dataflow Pipelining
             
              Data flow pipelining can also be applied to loops in similar manner as it can be applied to functions. It enables a sequence of loops, normally executed sequentially, to execute concurrently. Data flow pipelining should be applied to a function, loop or region which contains either all function or all loops: do not apply on a scope which contains a mixture of loops and functions.
              The following figure shows the advantages data flow pipelining can produce when applied to loops. Without data flow pipelining, loopNmust execute and complete all iterations before loopMcan begin. The same applies to the relationship between loopsMandP. In this example, it is eight cycles before loopNcan start processing the next value and eight cycles before an output is written (assuming the output is written when loopPfinishes).
              
               Figure:Loop Data Flow Pipelining
               

               
               

              
              With data flow pipelining, these loops can operate concurrently. An example execution with data flow pipelining is shown in part (B) of the figure above. Assuming the loopMtakes three cycles to execute, the code can accept new inputs every three cycles. Similarly, it can produce an output value every five cycles, using the same hardware resources. TheVivadoHLS tool automatically inserts channels between the loops to ensure data can flow asynchronously from one loop to the next. As with data flow pipelining, the channels between the loops are implemented either as multi-buffers or FIFOs.
              To use loop data flow pipelining, put#pragma HLS dataflowwhere you want the data flow optimization.
             
            
           
          
          
           Hardware Function Interfacing
           
            After defining what function is needed for acceleration, there are a few key items to ensure compilation is valid. TheVivadoHLS tool data types (ap_int,ap_uint,ap_fixed, etc.) cannot be part of the function parameter list that the software part of the application calls. These data types are unique to the HLS tool and have no bearing outside of the intended tool and associated compiler.
            For example, if the following function was written in the HLS tool, the parameter list needs to be adjusted, and the function body has to handle moving the data from the HLS tool to a more generic data type, as shown below:
            void foo(ap_int *a, ap_int *b, ap_int *c) { /* Function body */ }
            This needs to be modified if using local variables:
            void foo(int *a, int *b, int *c) { ap_int *local_a = a; ap_int *local_b = b; ap_int *local_c = c; // Remaining function body }
            
             IMPORTANT:Initializing local variables with input data can consume too much memory in the accelerator. Therefore, casting the input data types to the appropriate HLS data types will work instead.

Directive	Description
LATENCY	Allows a minimum and maximum latency constraint to be specified.
LOOP_FLATTEN	Allows nested loops to be collapsed into a single loop. This removes the loop transition overhead and improves the latency. Nested loops are automatically flattened when the PIPELINE directive is applied.
LOOP_MERGE	Merges consecutive loops to reduce overall latency, increase logic resource sharing, and improve logic optimization.

Directives and Configurations	Description
ALLOCATION	Specifies a limit for the number of operations, hardware resources, or functions used. This can force the sharing of hardware resources but might increase latency.
ARRAY_MAP	Combines multiple smaller arrays into a single large array to help reduce the number of block RAM resources.
ARRAY_RESHAPE	Reshapes an array from one with many elements to one with greater word width. Useful for improving block RAM accesses without increasing the number of block RAM.
DATA_PACK	Packs the data fields of an internal struct into a single scalar with a wider word width, allowing a single control signal to control all fields.
LOOP_MERGE	Merges consecutive loops to reduce overall latency, increase sharing, and improve logic optimization.
OCCURRENCE	Used when pipelining functions or loops to specify that the code in a location is executed at a lesser rate than the code in the enclosing function or loop.
RESOURCE	Specifies that a specific hardware resource (core) is used to implement a variable (array, arithmetic operation).
STREAM	Specifies that a specific memory channel is to be implemented as a FIFO with an optional specific depth.
Config Bind	Determines the effort level to use during the synthesis binding phase and can be used to globally minimize the number of operations used.
Config Dataflow	This configuration specifies the default memory channel and FIFO depth in dataflow optimization.