SDSoC Introduction and Overview

TheSDSoC™environment provides a framework for developing and delivering hardware accelerated embedded processor applications using standard programming languages. It includes a familiar embedded processor development flow with an Eclipse-based integrated development environment (IDE), compilers for the embedded processor application and for hardware functions implemented on the programmable logic resources of theXilinx®device. Thesdscc/sds++(referred to assds++) system compiler analyzes a program to determine the dataflow between software and hardware functions, generating an application-specific SoC supporting bare metal, Linux, and FreeRTOS as the target operating system. Thesds++system compiler generates hardware IP and software control code that automatically implements data transfers and synchronizes hardware accelerators and application software, therefore pipelining communication and computation.

Using SoC devices fromXilinx, such as theZynq®-7000 SoCand theZynqUltraScale+™MPSoC, you can implement elements of your application into hardware accelerators, running many times faster than optimized code running on a processor.XilinxFPGAs and SoC devices offer many advantages over traditional CPU/GPU acceleration, including a custom architecture capable of implementing any function that can run on a processor, resulting in better performance at lower power dissipation. To realize the advantages of software acceleration on aXilinxdevice, you should look to accelerate large compute intensive portions of your application in hardware. Implementing these functions in custom hardware allows you to achieve an ideal balance between performance and power. TheSDSoCenvironment provides tools and reports to profile the performance of your embedded processor application and determines where the opportunities for acceleration are. The tools also provide automated runtime instrumentation of cache, memory, and bus utilization to track real-time performance on the hardware.

Developers of hardware accelerated applications can make use of a familiar software-centric programming workflow to take advantage of FPGA acceleration with little or no prior FPGA or hardware design experience. As a software programmer, calling a hardware function is the same as calling a software function, letting the compiler implement the hardware/software partitioning. However, developers can also create predefined hardware accelerators for use in an embedded processor application, using a hardware-centric approach working through theVivado®HLS compiler, or creating and packaging optimized RTL accelerators for distribution as a library of C-Callable IP.

TheSDSoCenvironment provides predefined platforms for standard ZCU102, ZCU104, ZCU106, ZC702, and ZC706, which areZynq-based development boards. Third-party platforms are also available including: the Zedboard, Microzed, Zybo, Avnet Embedded Vision Kit, Video and Imaging Kit, SDR kit, and more. You can also create a custom platform to meet your specific market requirements. AnSDSoCplatform consists of a hardware portion defining the embedded processor, the hardware function, and any peripherals supported by the platform; and a software portion defining the operating system boot images, drivers, and the application code. You can start your project using one of the standardSDSoCplatforms to evaluate a design concept, to be later implemented on a custom platform for production.

Software Acceleration with SDSoC

When compared with processor architectures, the structures that comprise the programmable logic (PL) in aXilinxdevice enable a high degree of parallelism in application execution. The custom processing architecture generated by thesds++/sdscc(referred to assds++) system compiler for a hardware function in an accelerator presents a different execution paradigm from CPU execution, and provides an opportunity for significant performance gains. While you can re-target an existing embedded processor application for acceleration in PL, writing your application to use the source code libraries of existing hardware functions, such as theXilinxxfOpenCV library, or modifying your code to better use the PL device architecture, yields significant performance gains and power reduction.

CPUs have fixed resources and offer limited opportunities for parallelization of tasks or operations. A processor, regardless of its type, executes a program as a sequence of instructions generated by processor compiler tools, which transform an algorithm expressed in C/C++ into assembly language constructs that are native to the target processor. Even a simple operation, such as the multiplication of two values, results in multiple assembly instructions that must be executed across multiple clock cycles.

An FPGA is an inherently parallel processing device capable of implementing any function that can run on a processor.Xilinxdevices have an abundance of resources that can be programmed and configured to implement any custom architecture and achieve virtually any level of parallelism. Unlike a processor, where all computations share the same ALU, the FPGA programming logic acts as a blank canvas to define and implement your acceleration functions. The FPGA compiler creates a unique circuit optimized for each application or algorithm; for example, only implementing multiply and accumulate hardware for a neural net—not a whole ALU.

Thesds++system compiler invoked with the-coption compiles a file into a hardware IP by invoking theVivadoHigh-Level Synthesis (HLS) tool on the desired function definition. Before calling the HLS tool, thesds++compiler translates#pragma SDSinto pragmas understood by the HLS tool. The HLS tool performs hardware-oriented transformations and optimizations, including scheduling, pipelining, and dataflow operations to increase concurrency.

Thesds++linker analyzes program dataflow involving calls into and between hardware functions, mapping into a system hardware data motion network, and software control code (called stubs) to orchestrate accelerators and data transfers through data movers. As described in the following section, thesds++linker performs data transfer scheduling to identify operations that can be shared, and to insert wait barrier API calls into stubs to ensure program semantics are preserved.

Execution Model of an SDSoC Application

The execution model for anSDSoCenvironment application can be understood in terms of the normal execution of a C++ program running on the target CPU after the platform has booted. It is useful to understand how a C++ binary executable interfaces to hardware.

The set of declared hardware functions within a program is compiled into hardware accelerators that are accessed with the standard C runtime through calls into these functions. Each hardware function call in effect invokes the accelerator as a task and each of the arguments to the function is transferred between the CPU and the accelerator, accessible by the program after accelerator task completion. Data transfers between memory and accelerators are accomplished through data movers, such as a DMA engine, automatically inserted into the system by thesds++system compiler taking into account user data mover pragmas such aszero_copy.

To ensure program correctness, the system compiler intercepts each call to a hardware function, and replaces it with a call to a generated stub function that has an identical signature but with a derived name. The stub function orchestrates all data movement and accelerator operation, synchronizing software and accelerator hardware at the exit of the hardware function call. Within the stub, all accelerator and data mover control is realized through a set of send and receive APIs provided by thesds_liblibrary.

When program dataflow between hardware function calls involves array arguments that are not accessed after the function calls have been invoked within the program (other than destructors orfree()calls), and when the hardware accelerators can be connected using streams, the system compiler transfers data from one hardware accelerator to the next through direct hardware stream connections, rather than implementing a round trip to and from memory. This optimization can result in significant performance gains and reduction in hardware resources.

The SDSoCprogram execution model includes the following steps:

Initialization of thesds_liblibrary occurs during the program constructor before enteringmain().
Within a program, every call to a hardware function is intercepted by a function call into a stub function with the same function signature (other than name) as the original function. Within the stub function, the following steps occur:
1. A synchronous accelerator task control command is sent to the hardware.
2. For each argument to the hardware function, an asynchronous data transfer request is sent to the appropriate data mover, with an associatedwait()handle. A non-void return value is treated as an implicit output scalar argument.
3. A barrierwait()is issued for each transfer request. If a data transfer between accelerators is implemented as a direct hardware stream, the barrierwait()for this transfer occurs in the stub function for the last in the chain of accelerator functions for this argument.
Clean up of thesds_liblibrary occurs during the program destructor, upon exitingmain().

TIP:Steps 2a–2c ensure that program correctness is preserved at the entrance and exit of accelerator pipelines while enabling concurrent execution within the pipelines.

Sometimes, the programmer has insight of the potential concurrent execution of accelerator tasks that cannot be automatically inferred by the system compiler. In this case, thesds++system compiler supports a#pragma SDS async(ID)that can be inserted immediately preceding a call to a hardware function. This pragma instructs the compiler to generate a stub function without any barrierwait()calls for data transfers. As a result, after issuing all data transfer requests, control returns to the program, enabling concurrent execution of the program while the accelerator is running. In this case, it is your responsibility to insert a#pragma SDS wait(ID)within the program at appropriate synchronization points, which are resolved intosds_wait(ID)API calls to correctly synchronize hardware accelerators, their implicit data movers, and the CPU.

IMPORTANT:Every async(ID)pragma requires a matching wait(ID)pragma.

SDSoC Build Process

TheSDSoCbuild process uses a standard compilation and linking process. Similar tog++, thesds++system compiler invokes sub-processes to accomplish compilation and linking.

As shown in the following figure, compilation is extended not only to object code that runs on the CPU, but it also includes compilation and linking of hardware functions into IP blocks using theVivadoHigh-Level Synthesis (HLS) tool, and creating standard object files (.o) using the target CPU toolchain. System linking consists of program analysis of caller/callee relationships for all hardware functions, and the generation of an application-specific hardware/software network to implement every hardware function call. Thesds++system compiler invokes all necessary tools, includingVivadoHLS (function compiler), theVivado Design Suiteto implement the generated hardware system, and theArmcompiler andsds++linker to create the application binaries that run on the CPU invoking the accelerator (stubs) for each hardware function by outputting a complete bootable system for an SD card.

The compilation process includes the following tasks:

Analyzing the code and running a compilation for the main application on theArmcore, as well as a separate compilation for each of the hardware accelerators.
Compiling the application code through standard GNUArmcompilation tools with an object (.o) file produced as final output.
Running the hardware accelerated functions through the HLS tool to start the process of custom hardware creation with an object (.o) file as output.

After compilation, the linking process includes the following tasks:

Analyzing the data movement through the design and modifying the hardware platform to accept the accelerators.
Implementing the hardware accelerators into the programmable logic (PL) region using theVivado Design Suiteto run synthesis and implementation, and generate the bitstream for the device.
Updating the software images with hardware access APIs to call the hardware functions from the embedded processor application.
Producing an integrated SD card image that can boot the board with the application in an Executable and Linkable Format (ELF) file.

SDSoC Development Methodologies

TheSDSoCenvironment supports two primary use cases:

Software-centric design: The development of an accelerated application written by software programmers using standard programming languages, accelerating compute intensive functions into programmable logic, or identifying application bottlenecks for acceleration by profiling the application.

Hardware-centric design: The development of predefined accelerated functions for use in embedded processor applications like a library of intrinsic functions. This design methodology can be driven from a top-down approach of writing the hardware function in a standard programming language like C or C++, and then synthesized into RTL for implementation into programmable logic; or by using standard RTL design techniques to create and optimize the accelerated function.

The two use-cases are often combined, letting software and hardware developer teams define hardware accelerators and developing embedded processor applications to use them. This combined methodology involves different components of the application, developed by different people, and potentially from different companies. You can use predefined hardware functions from libraries available for use in your accelerated application, such as theXilinxxfOpenCV library, or develop all the accelerators within your own team.

Software-Centric Design

The software-centric approach to accelerated application development, or accelerator development, begins with the use of the C or C++ programming language. The code is written as a standard software program, with some attention to the specific architecture of the code. The software-centric development flow typically uses the following steps:

Table 1.Software-Centric Design Flow
Task	Steps
Profile the embedded processor application.	Baseline the performance, identify bottlenecks, and functions to accelerate. Assess acceleration potential, plan budgets, and requirements.
Code the desired accelerators.	Convert the desired functions to define the hardware function code without optimization.
Verify functionality, iterate as needed.	Run system emulation to generate application and accelerator profiling data including: Estimated FPGA resource usage. Overall application performance. Visual timeline showing application calls and accelerator start/stop times. Address design recommendations provided by tool guidance.
Optimize for performance, iterate as needed.	Analyze the profile summary and application timeline. Optimize data movement throughout system: Application to DDR, DDR to accelerator, and hardware function interface to local buffers (bursting) Maximize DDR bandwidth usage with efficient transfer sizes Overlapping of transfers Prefetching Optimize the accelerator code for performance: Task-level parallelism (dataflow) Instruction-level parallelism (pipelining and loop unrolling) Match datapath size to interface bandwidth (arbitrary bit-width)

Hardware-Centric Design

A hardware-centric flow first focuses on developing and optimizing the accelerators and typically leverages advanced FPGA design techniques to create a library of C-Callable IP. This begins with the definition of the hardware function in C or C++ for use inVivadoHLS, or the use of an RTL language, or an existing IP design or block design in theVivado Design Suite. The hardware function is defined in RTL code, synthesized, and implemented into the programmable logic of the target device. A software function signature is needed to use the C-Callable IP in the accelerator application, or a compiled library of functions is created for use across multiple applications. The hardware-centric development flow typically uses the following steps:

Table 2.Hardware-Centric Design Methodology
Task	Steps
Study theSDSoCplatform specification, and theZynq-7000 SoCdevice specification and programming model.	Hardware platform, software platform, data movers, AXI interface, DDR.
Identify cycle budgets and performance requirements.
Define the accelerator architecture and interfaces.
Develop the accelerator.	UseVivadoHLS for C or C++ hardware functions. Use traditional RTL design techniques in theVivado Design Suite.
Verify functionality and performance, iterate as needed.	Run hardware/software co-simulation inVivadoHLS. Run logic simulation in theVivadosimulator.
Optimize the quality of results to reduce resource utilization and increase frequency, iterate as needed.	For HLS, ensure the design rules check (DRC) is clean. Run theVivadoimplementation flow, using the techniques specified in theUltraFast Design Methodology Guide for the Vivado Design Suite(UG949). Use best practices for out-of-context synthesis and estimation.
Import the C-Callable IP into theSDSoCenvironment.	For the HLS flow, import the C or C++ code into yourSDSoCproject. For RTL flow, use the C-Callable IP wizard. SeeC-Callable Librariesfor more information.
Develop sample application code to test the hardware function.	Test sample applications with a dummy function having the same interfaces as the C-Callable IP. SeeC-Callable Librariesfor more information.
Verify the hardware function works properly with application, iterate as needed.	Use system emulation for debug. Use the Hardware debug methodology for complex internal debug problems.
Optimize host code for performance, iterate as needed:	Use the Profile Summary report, the Activity Timeline, and event timers in the host application to measure performance. Ensure the DRC is clean. Work to achieve an Activity Timeline that matches the desired performance. Techniques: Overlapping transactions, out-of-order (OOO) synthesis queues, and sub-devices.
Finalize the Software Acceleration Layer deliverable (API, share lib, plug-in…).

Best Practices for Acceleration with SDSoC

The following shows best practices when developing your application code and hardware function in theSDSoCenvironment:

General guidelines:
- Reduce resource utilization and improve parallelism by streaming data instead of copying data into the PL region. For example, in an image processing application, stream rows of pixels that make up a frame instead of copying the image frame in one long data transfer.
- Reuse the data local to the PL region rather than transferring it back and forth to limit DMA.
- Look to accelerate functions that have:
  - A high compute time to data transfer time ratio.
  - Predictable communication streams.
  - Self-contained control structure not needing control logic outside the accelerator.
- Look for opportunities to increase task-level parallelization by launching multiple accelerators concurrently, or multiple instances of an accelerator.
For a software-centric approach:
- Use good memory management techniques, such as having known array sizes, and usingsds_alloc()/sds_free()to allocate/de-allocate physically contiguous memory, thereby reducing the device footprint and increasing baseline performance.
- Use system emulation to validate your code frequently to ensure it is functionally correct.
- Write/migrate hardware functions to separate C/C++ files as to not re-compile the entire design for incremental changes.
For a hardware-centric approach using C-Callable IP:
- Keep track of theAXI4Interface offsets for an IP, or accelerator, and what function definition parameters require what data type. The interfaces need to be byte aligned.
- Maintain the originalVivadoIP project so that modifications to it can be quickly implemented.
- Keep the static library (.a) file and corresponding header file together.