Introduction to Debugging in SDSoC

TheSDSoC™environment includes an Eclipse-based integrated development environment (IDE) for implementing heterogeneous embedded systems.SDSoCsupportsArm®Cortex™-based applications using theZynq®-7000 SoCandZynq® UltraScale+™ MPSoCdevices, as well asMicroBlaze™processor-based applications on allXilinx®SoCs and FPGAs.

This user guide introduces the debugging capabilities of theSDSoCenvironment, and provides you with detailed instructions on how to analyze any failure encountered within theSDSoCflow.

Note:This user guide does notcover performance issues. If no tool problems are encountered, and the behavior of the design is deemed functionally correct, you can look for answers in the SDSoC Profiling and Optimization Guideto examine whether the performance of the design can be further improved.

SDSoCEnvironment Overview

TheSDSoCenvironment includes a system compiler that transforms C/C++ programs into complete hardware/software systems with select functions compiled into the programmable logic (PL). TheSDSoCsystem compiler analyzes a program to determine the data flow between software and hardware functions, and generates an application-specific system-on-chip (SoC) to realize the program.

To achieve high performance, each hardware function runs as an independent thread; the system compiler generates hardware and software components that ensure synchronization between hardware and software threads, while enabling pipelined computation and communication. Application code can involve many hardware functions, multiple instances of a specific hardware function, and calls to a hardware function from different parts of the program.

TheSDxintegrated development environment (IDE) supports software development workflows including profiling, compilation, linking, system performance analysis, and debugging. It also provides a fast performance estimation capability to enable exploration of the hardware/software interface before committing to a full hardware compile.

TheSDSoCsystem compiler targets a base platform and invokes theVivado®High-Level Synthesis (HLS) tool to compile synthesizable C/C++ functions into programmable logic. The system compiler then generates a complete hardware system, including DMAs, interconnects, hardware buffers, other IP, and the FPGA bitstream by invoking theVivado Design Suitetools. To ensure that all hardware function calls preserve their original behavior, theSDSoCsystem compiler generates system-specific software stubs and configuration data. The program includes the function calls to drivers required to use the generated IP blocks. Application and generated software is compiled and linked using a standard GNU toolchain.

By generating complete applications from a single source, the system compiler lets you iterate over design and architecture changes by refactoring at the program level, which reduces the time needed to achieve working programs running on the target platform.

Terminology

The following terms are widely used while designing in theSDSoCenvironment. The terms and their definitions are provided below.

Accelerator
Portions of the application code that have been implemented in the hardware in the FPGA general interconnect. These are also called hardware functions.
Data Mover
The data mover transfers data between accelerators, and between the processing system (PS) and accelerators. The SDSoCenvironment can generate various types of data movers based on the properties and size of the data being transferred.
Pipelining
Pipelining is a technique to increase instruction-level parallelism in the hardware implementation of an algorithm by overlapping independent stages of operations or functions. The data dependence in the original software implementation is preserved for functional equivalence, but the required circuit is divided into a chain of independent stages. All stages in the chain run in parallel on the same clock cycle. The only difference is the source of data for each stage. Each stage in the computation receives its data values from the result computed by the preceding stage during the previous clock cycle.
Pragma
Special directives that can be inserted into the source code to guide the system compiler. In the SDSoCenvironment, you control the system generation process by structuring hardware functions and calls to hardware functions in a way that balances communication and computation, and by inserting pragmas into your source code to guide the system compiler.
Processor
Processors in the context of the SDSoCenvironment mean a soft processor such as a MicroBlazeprocessor, or a hard processor such as the Armprocessors on Zynq-7000 SoCsand Zynq UltraScale+ MPSoCs.
System Port
A system port connects a data mover to the PS. It can be an ACP, AFI (corresponding to high-performance ports), MIG (corresponding to a PL-based DDR memory controller), or a stream port on the Zynq.

Elements of SDSoC

TheSDSoCenvironment includes the following features:

  • Thesds++system compiler, which generates complete hardware/software systems. Thesds++system compiler employs underlying features from theVivado Design SuiteSystem Edition, including theVivadoHigh-Level Synthesis (HLS) tool,VivadoIP integrator, IP libraries for data movement and interconnect, and tools for RTL synthesis, placement, routing, and bitstream generation.
  • An Eclipse-based integrated development environment (IDE) to create and manage application projects and workflows.
  • A system performance estimation capability to explore different scenarios for the hardware/software interface.

TheSDSoCenvironment also inherits many of the tools in theXilinxSoftware Development Kit (SDK), including GNU toolchains forZynq-7000 SoCsandZynq UltraScale+ MPSoCs, standard libraries (for example, glibc), and the Target Communication Framework (TCF) for communicating with embedded processor targets. It also features a performance analysis perspective within the Eclipse/CDT-based IDE.

Thesds++system compiler generates an application-specific system-on-chip for a targeted platform. The environment includes a number of standard base platforms for application development, and other platforms can be developed by third-party partners, or bySDSoCdesign teams. TheSDSoC Environment Platform Development Guidedescribes how to create a hardware platform design in theVivado Design Suite, configure platform interfaces, and define the corresponding software runtime environment to build a platform for use in theSDx™IDE.

The SDx™IDE lets you customize a target platform with application-specific hardware accelerators, and data motion networks connecting accelerators to the platform. A simplified Zynqand DDR configuration with memory access ports and hardware accelerators is shown below.

Figure:Simplified Zynq + DDR Diagram Showing Memory Access Ports and Memories



Execution Model of an SDSoC Application

The execution model for anSDSoCenvironment application can be understood in terms of the normal execution of a C++ program running on the target CPU after the platform has booted. It is useful to understand how a C++ binary executable interfaces to hardware.

The set of declared hardware functions within a program is compiled into hardware accelerators that are accessed with the standard C runtime through calls into these functions. Each hardware function call in effect invokes the accelerator as a task and each of the arguments to the function is transferred between the CPU and the accelerator, accessible by the program after accelerator task completion. Data transfers between memory and accelerators are accomplished through data movers, such as a DMA engine, automatically inserted into the system by thesds++system compiler taking into account user data mover pragmas such aszero_copy.

Figure:Architecture of anSDSoCSystem

To ensure program correctness, the system compiler intercepts each call to a hardware function, and replaces it with a call to a generated stub function that has an identical signature but with a derived name. The stub function orchestrates all data movement and accelerator operation, synchronizing software and accelerator hardware at the exit of the hardware function call. Within the stub, all accelerator and data mover control is realized through a set of send and receive APIs provided by thesds_liblibrary.

When program dataflow between hardware function calls involves array arguments that are not accessed after the function calls have been invoked within the program (other than destructors orfree()calls), and when the hardware accelerators can be connected using streams, the system compiler transfers data from one hardware accelerator to the next through direct hardware stream connections, rather than implementing a round trip to and from memory. This optimization can result in significant performance gains and reduction in hardware resources.

The SDSoCprogram execution model includes the following steps:
  1. Initialization of thesds_liblibrary occurs during the program constructor before enteringmain().
  2. Within a program, every call to a hardware function is intercepted by a function call into a stub function with the same function signature (other than name) as the original function. Within the stub function, the following steps occur:
    1. A synchronous accelerator task control command is sent to the hardware.
    2. For each argument to the hardware function, an asynchronous data transfer request is sent to the appropriate data mover, with an associatedwait()handle. A non-void return value is treated as an implicit output scalar argument.
    3. A barrierwait()is issued for each transfer request. If a data transfer between accelerators is implemented as a direct hardware stream, the barrierwait()for this transfer occurs in the stub function for the last in the chain of accelerator functions for this argument.
  3. Clean up of thesds_liblibrary occurs during the program destructor, upon exitingmain().
TIP:Steps 2a–2c ensure that program correctness is preserved at the entrance and exit of accelerator pipelines while enabling concurrent execution within the pipelines.

Sometimes, the programmer has insight of the potential concurrent execution of accelerator tasks that cannot be automatically inferred by the system compiler. In this case, thesds++system compiler supports a#pragma SDS async(ID)that can be inserted immediately preceding a call to a hardware function. This pragma instructs the compiler to generate a stub function without any barrierwait()calls for data transfers. As a result, after issuing all data transfer requests, control returns to the program, enabling concurrent execution of the program while the accelerator is running. In this case, it is your responsibility to insert a#pragma SDS wait(ID)within the program at appropriate synchronization points, which are resolved intosds_wait(ID)API calls to correctly synchronize hardware accelerators, their implicit data movers, and the CPU.

IMPORTANT:Every async(ID)pragma requires a matching wait(ID)pragma.

SDSoC Build Process

TheSDSoCbuild process uses a standard compilation and linking process. Similar tog++, thesds++system compiler invokes sub-processes to accomplish compilation and linking.

As shown in the following figure, compilation is extended not only to object code that runs on the CPU, but it also includes compilation and linking of hardware functions into IP blocks using theVivadoHigh-Level Synthesis (HLS) tool, and creating standard object files (.o) using the target CPU toolchain. System linking consists of program analysis of caller/callee relationships for all hardware functions, and the generation of an application-specific hardware/software network to implement every hardware function call. Thesds++system compiler invokes all necessary tools, includingVivadoHLS (function compiler), theVivado Design Suiteto implement the generated hardware system, and theArmcompiler andsds++linker to create the application binaries that run on the CPU invoking the accelerator (stubs) for each hardware function by outputting a complete bootable system for an SD card.

Figure:SDSoCBuild Process

The compilation process includes the following tasks:

  1. Analyzing the code and running a compilation for the main application on theArmcore, as well as a separate compilation for each of the hardware accelerators.
  2. Compiling the application code through standard GNUArmcompilation tools with an object (.o) file produced as final output.
  3. Running the hardware accelerated functions through the HLS tool to start the process of custom hardware creation with an object (.o) file as output.

After compilation, the linking process includes the following tasks:

  1. Analyzing the data movement through the design and modifying the hardware platform to accept the accelerators.
  2. Implementing the hardware accelerators into the programmable logic (PL) region using theVivado Design Suiteto run synthesis and implementation, and generate the bitstream for the device.
  3. Updating the software images with hardware access APIs to call the hardware functions from the embedded processor application.
  4. Producing an integrated SD card image that can boot the board with the application in an Executable and Linkable Format (ELF) file.

SDSoC Debug Flow Overview

The systems produced by theSDSoCenvironment are high-performance, complex, and composed of hardware and software components. It can be difficult to understand the execution of applications in such systems with portions of software running in a processor, hardware accelerators executing in the programmable fabric, and many simultaneous data transfers between them. TheSDSoCenvironment lets you create and debug projects using theXilinxSystem Debugger (XSDB), and provides sophisticated hardware/software event tracing, offering an integrated timeline view of data transfers and accelerator tasks, including driver software setup and execution in hardware. Outside theSDxIDE, you can use command line or scripting options to debug your projects.

The SDSoC development environment lets you target the build process of the compilation, linking commands to either a system emulation target, or to the hardware target of the specified platform. As an alternative to building a complete system, you can create a system emulation model that consists of the target platform and application binaries. For the emulation target, thesds++system compiler creates a simulation model using the source files for the accelerator functions.

System emulation is one of the most capable debug features in theSDSoCenvironment. It can help debug functional issues and determine why an application is hanging. This feature is only available onXilinxbase platforms, including the ZC702, ZC706, ZCU102, ZCU104, ZCU106, and ZedBoard base platforms.

After you identify the hardware functions, you can use system emulation to quickly compile the logic, and verify the entire system. This provides a Quick Emulator (QEMU)-based emulator that runs the cross-compiledArmcode, interacting with the hardware accelerator being run in theVivadosimulator. The RTL simulator can display waveforms, or it can be run without waveforms for faster simulation. The emulator can be run within theSDxIDE or on the command line (sdsoc_emulator), providing accurate visibility of the final hardware implementation without the need to compile the system into a bitstream, and program the device on the board.

Figure:System Emulation Flow

When targeting the hardware platform, you can also enable hardware and software event tracing to analyze the execution of events, and identify any issues (seeHardware/Software Event Tracing). If there are problems with respect to the hardware design itself, you can use hardware debug from theVivadoLab Edition tools by inserting debug cores in the hardware functions implemented in theSDSoCenvironment. The following flow chart shows a typical hardware build and debug process.

Figure:Hardware Build and Debug Flow

Xilinxbase platforms support both system emulation and hardware target builds. Custom and third-party platforms, without emulation capabilities, support only the hardware build and debug flow.

System Emulation

OnXilinxbase platforms, you can use system emulation to debug register transfer level (RTL) transactions in the entire system (PS and PL). Running your application on theSDSoCemulator (sdsoc_emulator) gives you visibility of data transfers with a debugger. You can debug system hangs and inspect associated data transfers in the simulation waveform view, which gives you visibility into signals on the hardware blocks associated with the data transfer.

Hardware Execution Flow

During hardware execution, you can use the actual hardware platform to run the accelerated application. You can create a debug configuration of the hardware that includes special debug logic in the accelerators, such as theSystem Integrated Logic Analyzer(System ILA),Virtual Input/Output(VIO) debug cores, and AXI performance monitors. TheSDSoCenvironment provides specific hardware debug capabilities using theVivadohardware manager, with waveform analysis, kernel activity reports, and memory access analysis to provide visibility into these critical hardware issues.

In-system debugging lets you debug your design in real time, on your target hardware. This is an essential step in design completion. Invariably, there are situations that are extremely hard to replicate in a simulator. Therefore, there is a need to debug the problem in the running hardware. In this step, you place debug cores into your design to provide you the ability to observe and control the design. After the debugging process is complete, you can remove the debug cores to increase performance and reduce resource usage of the device.

TheSDxIDE and command line options provide ways to instrument your design for debugging. The--dkcompiler switch lets you add ILA debug cores to the interfaces of your hardware function. To debug C-callable IP that are used in your application code, you must have instantiated the required debug cores into the RTL code of the IP prior to packaging it as a C-callable IP.

IMPORTANT:Debugging the hardware function on the SDSoCplatform hardware requires additional logic to be incorporated into the overall hardware model. This means that if hardware debugging is enabled, there is some impact on resource utilization of the Xilinxdevice, as well as some impact on the performance of the hardware function.

Connecting to the Hardware

The board connection requirements are slightly different depending on the operating system: standalone, FreeRTOS, or Linux.
  • For standalone and FreeRTOS, you must download the ELF file to the board using the USB/JTAG interface. Trace data is read out over the same USB/JTAG interface as well.
  • For Linux, theSDxenvironment assumes the OS boots from the SD card. It then copies the.elffile and runs it using the TCP/TCF agent running in Linux over the Ethernet connection between the board and host PC. The trace data is read out over the USB/JTAG interface. Both USB/JTAG and TCP/TCF agent interfaces are needed for tracing Linux applications.
The figure below shows the connections required.

Figure:Connections Required When Using Trace with Different Operating Systems



Event Tracing

The event tracing feature provides a detailed view of what is happening in the system during the execution of an application. Trace events are produced and gathered into a timeline view, giving you a perspective of the running application. This detailed view can help you understand the performance of your application given the workload, hardware/software partitioning, and system design choices. This view enables event tracing of software running on the processor, as well as hardware accelerators and data transfer links in the system. Such information helps you to identify problems, optimize the design, and improve system implementation.

Tracing an application produces a log that records information about system execution. Compared to event logging, event tracing shows the correlation between events for the duration of the event, rather than an instantaneous event at a particular time. The goal of tracing is to help debug execution by observing what happened when, and how long events took. This is best used to analyze performance and get an indication of whether there is an application hang.