Introduction to Programming with SDSoC

TheSDSoC™environment provides tools for developing embedded systems inXilinx®Zynq®-7000 SoC,Zynq® UltraScale+™ MPSoC, orMicroBlaze™embedded processor onXilinxdevices.

It includes:

An Eclipse-based integrated development environment (IDE) with compilers, debuggers, and profilers forArm®andMicroBlazeprocessors.
A hardware emulator.
A hardware compiler that synthesizes C/C++ functions into optimized hardware functions to be used in the programmable logic (PL).
A system compiler that generates complete hardware/software systems, including custom hardware accelerators and data mover hardware blocks (for example, DMA engines), from application code written in the C/C++ programming languages.

The sdscc/sds++(referred to as sds++) system compiler provides options to perform hardware/software event tracing, which provides detailed timeline visibility into accelerator tasks running in hardware, data transfers between accelerators and memory, and application code running on the CPUs.

XilinxFPGAs and SoC devices offer many advantages, including a programmable hardware architecture for implementing custom data paths, multi-level distributed memory architectures, and interfacing to custom input/output devices, with full customizability of the hardware/software interface to high-performance embedded CPUs. By building a custom hardware system, you can achieve higher performance and lower power dissipation for your embedded applications.

TheSDSoCenvironment is unique because it provides the programmer the ability to create hardware and software, while working within familiar software development workflows including cross-compiling, linking, profiling, debugging, and running application binaries on target hardware and in an emulator. Using thesds++system compiler, you can target parts of your application to be implemented as hardware accelerators running many times faster than optimized code running on a processor.

The programmer's view of the target device is heterogeneous computing, where code written in C/C++ is running on multi-coreArmCPUs, as well as in custom hardware accelerators, typically with a non-uniform memory architecture and custom interfaces to input/output devices. More attention to where code will run, how data is mapped into memory, and how hardware and software interact allows for better performance of your application.

In general, application code should reflect the heterogeneity of the target system. Take into consideration that C/C++ code compiled into hardware accelerators benefit from programming idioms that reflect microarchitecture details, while code running on CPUs benefit from idioms that reflect the instruction set, cache, and memory architecture.

When working in theSDSoCenvironment, the hardware/software interface between CPU and hardware accelerators is described through function calls and APIs specific to the underlying devices. The majority of the code will access accelerators through function calls rather than device driver APIs, with thesds++system compiler generating highly efficient access from the user space, automatically managing low level considerations such as cache management through custom drivers provided by the system compiler.

Software Acceleration with SDSoC

When compared with processor architectures, the structures that comprise the programmable logic (PL) in aXilinxdevice enable a high degree of parallelism in application execution. The custom processing architecture generated by thesds++/sdscc(referred to assds++) system compiler for a hardware function in an accelerator presents a different execution paradigm from CPU execution, and provides an opportunity for significant performance gains. While you can re-target an existing embedded processor application for acceleration in PL, writing your application to use the source code libraries of existing hardware functions, such as theXilinxxfOpenCV library, or modifying your code to better use the PL device architecture, yields significant performance gains and power reduction.

CPUs have fixed resources and offer limited opportunities for parallelization of tasks or operations. A processor, regardless of its type, executes a program as a sequence of instructions generated by processor compiler tools, which transform an algorithm expressed in C/C++ into assembly language constructs that are native to the target processor. Even a simple operation, such as the multiplication of two values, results in multiple assembly instructions that must be executed across multiple clock cycles.

An FPGA is an inherently parallel processing device capable of implementing any function that can run on a processor.Xilinxdevices have an abundance of resources that can be programmed and configured to implement any custom architecture and achieve virtually any level of parallelism. Unlike a processor, where all computations share the same ALU, the FPGA programming logic acts as a blank canvas to define and implement your acceleration functions. The FPGA compiler creates a unique circuit optimized for each application or algorithm; for example, only implementing multiply and accumulate hardware for a neural net—not a whole ALU.

Thesds++system compiler invoked with the-coption compiles a file into a hardware IP by invoking theVivadoHigh-Level Synthesis (HLS) tool on the desired function definition. Before calling the HLS tool, thesds++compiler translates#pragma SDSinto pragmas understood by the HLS tool. The HLS tool performs hardware-oriented transformations and optimizations, including scheduling, pipelining, and dataflow operations to increase concurrency.

Thesds++linker analyzes program dataflow involving calls into and between hardware functions, mapping into a system hardware data motion network, and software control code (called stubs) to orchestrate accelerators and data transfers through data movers. As described in the following section, thesds++linker performs data transfer scheduling to identify operations that can be shared, and to insert wait barrier API calls into stubs to ensure program semantics are preserved.

Execution Model of an SDSoC Application

The execution model for anSDSoCenvironment application can be understood in terms of the normal execution of a C++ program running on the target CPU after the platform has booted. It is useful to understand how a C++ binary executable interfaces to hardware.

The set of declared hardware functions within a program is compiled into hardware accelerators that are accessed with the standard C runtime through calls into these functions. Each hardware function call in effect invokes the accelerator as a task and each of the arguments to the function is transferred between the CPU and the accelerator, accessible by the program after accelerator task completion. Data transfers between memory and accelerators are accomplished through data movers, such as a DMA engine, automatically inserted into the system by thesds++system compiler taking into account user data mover pragmas such aszero_copy.

To ensure program correctness, the system compiler intercepts each call to a hardware function, and replaces it with a call to a generated stub function that has an identical signature but with a derived name. The stub function orchestrates all data movement and accelerator operation, synchronizing software and accelerator hardware at the exit of the hardware function call. Within the stub, all accelerator and data mover control is realized through a set of send and receive APIs provided by thesds_liblibrary.

When program dataflow between hardware function calls involves array arguments that are not accessed after the function calls have been invoked within the program (other than destructors orfree()calls), and when the hardware accelerators can be connected using streams, the system compiler transfers data from one hardware accelerator to the next through direct hardware stream connections, rather than implementing a round trip to and from memory. This optimization can result in significant performance gains and reduction in hardware resources.

The SDSoCprogram execution model includes the following steps:

Initialization of thesds_liblibrary occurs during the program constructor before enteringmain().
Within a program, every call to a hardware function is intercepted by a function call into a stub function with the same function signature (other than name) as the original function. Within the stub function, the following steps occur:
1. A synchronous accelerator task control command is sent to the hardware.
2. For each argument to the hardware function, an asynchronous data transfer request is sent to the appropriate data mover, with an associatedwait()handle. A non-void return value is treated as an implicit output scalar argument.
3. A barrierwait()is issued for each transfer request. If a data transfer between accelerators is implemented as a direct hardware stream, the barrierwait()for this transfer occurs in the stub function for the last in the chain of accelerator functions for this argument.
Clean up of thesds_liblibrary occurs during the program destructor, upon exitingmain().

TIP:Steps 2a–2c ensure that program correctness is preserved at the entrance and exit of accelerator pipelines while enabling concurrent execution within the pipelines.

Sometimes, the programmer has insight of the potential concurrent execution of accelerator tasks that cannot be automatically inferred by the system compiler. In this case, thesds++system compiler supports a#pragma SDS async(ID)that can be inserted immediately preceding a call to a hardware function. This pragma instructs the compiler to generate a stub function without any barrierwait()calls for data transfers. As a result, after issuing all data transfer requests, control returns to the program, enabling concurrent execution of the program while the accelerator is running. In this case, it is your responsibility to insert a#pragma SDS wait(ID)within the program at appropriate synchronization points, which are resolved intosds_wait(ID)API calls to correctly synchronize hardware accelerators, their implicit data movers, and the CPU.

IMPORTANT:Every async(ID)pragma requires a matching wait(ID)pragma.

SDSoC Build Process

TheSDSoCbuild process uses a standard compilation and linking process. Similar tog++, thesds++system compiler invokes sub-processes to accomplish compilation and linking.

As shown in the following figure, compilation is extended not only to object code that runs on the CPU, but it also includes compilation and linking of hardware functions into IP blocks using theVivadoHigh-Level Synthesis (HLS) tool, and creating standard object files (.o) using the target CPU toolchain. System linking consists of program analysis of caller/callee relationships for all hardware functions, and the generation of an application-specific hardware/software network to implement every hardware function call. Thesds++system compiler invokes all necessary tools, includingVivadoHLS (function compiler), theVivado Design Suiteto implement the generated hardware system, and theArmcompiler andsds++linker to create the application binaries that run on the CPU invoking the accelerator (stubs) for each hardware function by outputting a complete bootable system for an SD card.

The compilation process includes the following tasks:

Analyzing the code and running a compilation for the main application on theArmcore, as well as a separate compilation for each of the hardware accelerators.
Compiling the application code through standard GNUArmcompilation tools with an object (.o) file produced as final output.
Running the hardware accelerated functions through the HLS tool to start the process of custom hardware creation with an object (.o) file as output.

After compilation, the linking process includes the following tasks:

Analyzing the data movement through the design and modifying the hardware platform to accept the accelerators.
Implementing the hardware accelerators into the programmable logic (PL) region using theVivado Design Suiteto run synthesis and implementation, and generate the bitstream for the device.
Updating the software images with hardware access APIs to call the hardware functions from the embedded processor application.
Producing an integrated SD card image that can boot the board with the application in an Executable and Linkable Format (ELF) file.

SDSoCProgramming Flow Overview

Embedded system development follows the typical steps of: code development, compilation and link for the platform/device, profile the system for performance, and measure the actual performance.

TheSDSoCenvironment follows this standard software-centric flow, but also supports a more hardware-centric flow for defining hardware functions first, and then integrating those into the embedded application. What is unique about these two flows is that they are a heterogeneous programming model; meaning that writing code for the CPU side of the system is going to be different from writing the code for the programmable logic.

The software-centric approach focuses on the embedded processor application, and the acceleration of specific software functions into hardware functions running in the programmable logic (PL) region of theXilinxdevice. This requires converting the C or C++ code of the software function into Hardware Descriptive Language (HDL) that can be compiled for the programmable logic usingVivadoHLS.

A typical accelerated hardware function would be processor intensive (for example, complex computations that take a long time), processing lots of data. This code should be written so that data transfers are limited to streaming data into and from the accelerator, and should leverage instruction-level parallelism and task-level parallelism to take advantage of the massively parallel architecture of the programmable logic region of the device. The goal is to use parallelism to achieve the desired performance for accelerated functions. The goal of the accelerator would be to deliver, consume input data, process it, and output data as quickly as possible.

After the processor and accelerator code is written, it can be compiled for emulation, or for compiling/linking to the hardware platform. For emulation, the code compiles faster allowing for quick design iterations, where it can be used to estimate performance as well as checking data integrity; but runs slower than on actual hardware. Emulation is very accurate with respect to what executes on the hardware, because the same CPU code runs both on the quick emulator (QEMU) and the target device.

When building to the hardware platform, it will run exactly as written for the processor and for the hardware accelerators. The benefits of running on hardware would be to measure actual runtime, as well as being able to adjust the builds later for in-circuit debugging, or performance analysis.

The hardware-centric approach is used by designers experienced with developing on an FPGA. This approach lets you control what functionality will be in the accelerator and how data/commands will be transported between the logic and the CPU. This flow uses theVivado Design Suiteto create customized IP containing AXI interfaces that are used to communicate between the programmable logic (PL) region and the processing system (PS). This IP can then be packaged with thesdx_packcommand to map the IP's AXI interfaces to a header file to create a static library. Then, using this resulting include file and static library is as simple as calling a typical library function. The key is ensuring that the data width in the header file matches what is expected by the IP. See theSDSoC Environment User Guidefor more information on creating and using C-Callable IP.