Debug Techniques

This chapter describes the different styles of debugging techniques applicable toSDSoC™applications. It highlights different approaches for software-based debugging and hardware-oriented techniques. In the software-based approaches, a full understanding of the implementation of the design in the FPGA is not required. However, this concept can only be extended to a certain degree, at which point a hardware-based detailed analysis should be performed. Highlighting pure software debugging techniques is not the intent of this document.

When debuggingSDSoCapplications, you can use the same methods and techniques as applications used for debugging standard C/C++. MostSDSoCapplications consist of specific functions tagged for hardware acceleration and surrounded by standard C/C++ code.

When debugging anSDSoCapplication with a board attached to the debug host machine, you can right-click on a build configuration in the Assistant view, and select theDebug>Launch on Hardwareoption to begin a debug session.

You can select options other than the default settings by using theDebug>Debug Configurationscommand to create a new custom debug configuration. As the debug environment is initialized,Xilinxrecommends that you switch to the Debug perspective when prompted. The debug perspective view provides the ability to debug the standard C/C++ portions of the application by single-stepping code, setting and removing breakpoints, displaying variables, dumping registers, viewing memory, and controlling the code flow with “run until” and “jump to” debugging directives. Inputs and outputs can be observed before and after the function call to determine the correct behavior.

You can determine if a hardware accelerated application meets its real-time requirements by placing debug statements to start and stop a counter just before and just after a hardware accelerated function. TheSDx™environment provides thesds_clock_counter()function, which is typically used to calculate the elapsed time for a hardware accelerated function.

You can also perform debugging without a target board connected to the debug host by building theSDxproject for emulation. During emulation, you can control and observe the software and data just as before through the debug perspective view, but you can also view the hardware accelerated functions through aVivado®simulator waveform viewer. You can observe accelerator signaling for conditions such as accelerator start and accelerator done, and you can monitor data buses for inputs and outputs. Building a project for emulation also avoids a possibly longVivadoimplementation step to generate an FPGA bitstream.

See theSDSoC Environment Debugging Guidefor information on using the interactive debuggers in theSDxIDE.

Debugging System Hangs and Runtime Errors

Programs compiled usingsds++can be debugged using the standard debuggers supplied with theSDxenvironment orVivado. Typical runtime errors are incorrect results, premature program exits, and program hangs. The first two kinds of error are familiar to C/C++ programmers, and can be debugged by stepping through the code using a debugger.

Note:Applications might hang when you are running on the board. Hangs commonly happen due to a mismatch of data size between the producer and the consumer.

A program hang is a runtime error caused by specifying an incorrect amount of data to be transferred across a streaming connection created using#pragma SDS data access_pattern(A:SEQUENTIAL), by specifying a streaming interface in a synthesizable function within theVivadoHigh-Level Synthesis (HLS) tool, or by a C-callable hardware function in a pre-built library that has streaming hardware interfaces. A program hangs when the consumer of a stream is waiting for more data from the producer, but the producer has stopped sending data. Consider the following code fragment that results in streaming input/output from a hardware function:

#pragma SDS data access_pattern(in_a:SEQENTIAL, out_b:SEQUENTIAL) void f1(int in_a[20], int out_b[20]); // declaration void f1(int in_a[20], int out_b[20]) { // definition int i; for (i=0; i < 19; i++) { out_b[i] = in_a[i]; } }

In_a[]has 20 elements, but the loop only reads 19 of them. Anything callingf1would appear to hang, waiting indefinitely forf1to consume the final element. Program errors that lead to hangs can be detected by using system emulation to ascertain whether the data signals are static by reviewing the associated protocol signals such asTLAST,ap_ready,ap_done, andTREADY. Program errors causing hangs can also be detected by instrumenting the code to flag streaming access errors such as non-sequential access or incorrect access counts within a function and running in software. Streaming access issues are typically flagged asimproper streaming accesswarnings in the log file, and you can determine if these are actual errors. Running your application on theSDSoCemulator is a good way to gain visibility of data transfers with a debugger. You can see where in software the system is hanging (often within acf_wait()call), and can then inspect associated data transfers in the simulation waveform view, which gives you access to signals on the hardware blocks associated with the data transfer.

System Hang Debugging Example

As another example, consider the following code that results in streaming input/output from the hardware function:

#pragma SDS data access_pattern(in:SEQUENTIAL, out:SEQUENTIAL) #pragma SDS data copy(in[0:large], out[0:small]) void too_large_copy(int* in, int* out, int small, int large) { for(int i = 0; i < small; i++) {out[i] = in[i];} } int main() { int* temp_var1 = new int[1024 * 1024]; int* temp_var2 = new int[1024 * 1024]; too_large_copy(temp_var1, temp_var2, 1024, 1024 * 1024); //hangs because the input DMA continues to try to feed data to a halted HLS core }

In this case, the direct memory access (DMA) continues to try to send data to the hardware function, whereas the hardware function is already done and is not accepting any data. This results in a system hang.

To debug this type of issue, build the code for emulation on the base platform. When the application is compiled, start the emulator by selectingXilinx>Start/Stop Emulator. Alternatively, you can start the emulator from theAssistantwindow as shown below. Right-click theActive build configurationfor the application and selectStart/Stop Emulator.
In the Emulation dialog box, ensure that theShow Waveform (Programmable Logic only)check box is checked. This brings up theVivadoSimulator where the state of different interfaces can be viewed in the Waveform window. To monitor the interfaces of the hardware function, right-click on the function and selectAdd to Wave window. This adds all the I/O ports of the selected function to the Waveform window.
Start the simulator by clicking theRun Allicon in the toolbar.
Go back to theSDxIDE, and then launch the application on the debugger. To do this, select the application to be debugged, right-click, and then selectLaunch on Emulator (SDx Application Debugger).

In the Confirm Perspective Switch dialog box, clickYes. The Debug Perspective opens with the application running on the hardware. The code execution stops at the main program entry.
Click theResumebutton on the toolbar to execute the application.

The application is now stuck: a system hang has been encountered.
To determine the cause of the system hang, go back toVivado Design Suite. Look at the state of theap_done,ap_start,ap_idleandap_readysignals for the function. The state of these signals indicates that a transaction was started at the instance when theap_startsignal went High, followed by the transaction ending when theap_donesignal went Low. Theap_readyandap_idlesignals likewise indicate the state of the function.

Analyzing the state of the DMA at the same point of time, you can see that while the hardware function has finished accepting data, the DMA is still writing to it, as indicated by theM00_AXIS_treadyand theM00_AXIS_tvalidsignals.

Now that you know the cause of the system hang, you can go back to the hardware function code and fix any outstanding issues.

Causes of System Hangs

There are other situations where a system hang can occur as listed below:

If you canCtrl+Cout of the application, there was probably not enough data from the accelerator. TheArm®processor is expecting more data than the accelerator is sending. Review latencies if there is more than one path from a producer to a consumer. Designs where there are multiple paths with equal latencies between two accelerators (for example, A -> B ... -> Z, while there is also A -> Z direct) need to be fixed at the design level equalizing the branches.
IfCtrl+Cdoes not work, but you canpingorsshinto the board, there is not enough data in a Scatter Gather DMA (SGDMA) operation. Review the data movers (copy or zero-copy) and the access pattern.
If you cannotpingthe board and it has hard locked, only coming back to life after a power cycle, common causes are interaction between the following:
1. TheSDSoCenvironment design and IP on the platform. Debug with theChipScope™feature and peeking and poking of registers; seeHardware Debugging in SDSoC Using ChipScopeandPeeking and Poking IP Registers.
2. TheSDSoCenvironment design and C-callable IP libraries. Debug with theChipScopefeature and peeking and poking of registers; seeHardware Debugging in SDSoC Using ChipScopeandPeeking and Poking IP Registers.
3. The RTL or the SW driver generated in theSDSoCflow. If you have enoughVivado Design Suiteor C driver experience you might be able to debug this; otherwise, contact theXilinx forums.

Causes of Runtime Errors

The following list shows other sources of runtime errors:

Improper placement ofwait()statements could result in the following issues:
- The software might read invalid data before a hardware accelerator has written the correct value.
- A blockingwait()might be called before a related accelerator is started, resulting in a system hang.
Inconsistent use of the memory consistencySDS data mem_attributepragma can result in incorrect results.

Unexpected Data Values

When the application is running, it is possible to get unexpected data. The hardware function might not be returning the expected data, or it might be returning expected data at the wrong time. This can be caused by hardware and/or software issues. If hardware is the suspected root cause, check data inputs to your board using the ChipScopefeature if needed. If software is the suspected root cause, perform the following steps:

Go back to software debug and confirm that your software is good.
If the software debug is good, you need to visually inspect the code. Two common causes for unexpected data are from the use of the#SDS dataor the#SDS zero copypragmas.
If you are using#SDS datapragmas, the tools trust what you write. Confirm that the data access pattern in the code matches the data access pattern specified by the pragma.
An incorrectly sized (normally too large)#SDS zero copycan pull invalid data from cache. This is seen in hardware. Emulation is likely to pass as there is no cache controller in software.

Peeking and Poking IP Registers

With theXilinx®System Debugger tool (XSDB), you can understand what is happening with the IP blocks included with the platform or the various C-callable IP blocks. From theXilinxSoftware Command Line Tool (XCST) console, you can read and write registers within various IP blocks in the integrated design. Registers can be read by typing the memory read command,mrd. Likewise, a writable register in any IP in the design can be written to by typing themwrcommand in the XSCT console. For help with commands, type -help.

You need to be familiar with the memory map of the various IP blocks within the design to be able to perform reads and writes to the registers. You can access this information by opening the Vivadoproject and looking at the address editor. The Vivadoproject can be found at //_sds/p0/vivado/prj/prj.xpr. Double-clicking prj.xpropens up the project in Vivado. In the VivadoIDE, click on IP Integrator>Open Block Designunder Flow Navigator. Click on the Address Editortab to view the memory map information.

For details on XSDB, refer toSDK Online Help(UG782).

CAUTION:

Trying to access an address that is not mapped results in a BUS ERROR. Addresses that are mapped, but lack proper backing, result in a system hang.

Event Tracing

This section describes how traces are collected and displayed in theSDSoCenvironment.

Runtime Trace Collection

Software traces are inserted into the same storage path as the hardware traces and receive a time stamp using the same timer/counter as hardware traces. This single-trace data stream is buffered in the hardware system and accessed over JTAG by the host PC.

In theSDSoCenvironment, traces are read back constantly while the program executes attempting to empty the hardware buffer as quickly as possible and prevent buffer overflow. However, trace data only displays when the application is finished.

Trace data is collected in real time when you are running on the hardware. For information about connecting to the hardware, refer toConnecting to the Hardware.

Trace Visualization

TheSDSoCenvironment displays a graphical rendering of the hardware and software trace stream. Each trace point in the user application is given a unique name, and its own axis on the timeline. In general, a trace point can create multiple trace events throughout the execution of the application, for example, if the same block of code is executed in a loop, or if an accelerator is invoked more than once.

Each trace event has a few different attributes: name, type, start time, stop time, and duration. This data is shown as a tool-tip when the cursor hovers above one of the event rectangles in the view.

Troubleshooting

The following section provides general information on troubleshooting the different conditions encountered during event tracing.

Incremental build flow: The SDSoCenvironment does not support any incremental build flow using the trace feature. To ensure the correct build of your application and correct trace collection, do a project clean first, followed by a build after making any changes to your source code. Even if the source code you change does not relate to or impact any function marked for hardware, you can see incorrect results.

Programming and bitstream: The trace functionality is a single-use type of analysis. The timer used for time-stamping events is not started until the first event occurs, and runs indefinitely afterward. If you run your software application once after programming the bitstream, the timer is in an unknown state after your program is finished running. Running your software for a second time results in incorrect timestamps for events. Be sure to program the bitstream first, followed by downloading your software application, each and every time you run your application to take advantage of the trace feature. Your application will run correctly a second time, but the trace data will not be correct. For Linux, you need to reboot because the bitstream is loaded during boot time by U-Boot.

Buffering up traces: In the SDSoCenvironment, traces are buffered up and read out in real time as the application executes (although at a slower speed than they are created on the device), but are displayed after the application finishes in a post-processing fashion. This relies on having enough buffer space to store traces until they can be read out by the host PC. By default, there is enough buffer space for 1024 traces. After the buffer fills up, subsequent traces that are produced are dropped and lost. An error condition is set when the buffer overflows. Any traces created after the buffer overflows are not collected, and traces just prior to the overflow might be displayed.

Errors: In the SDSoCenvironment, traces are buffered up in hardware before being read out over JTAG by the host PC. If traces are produced faster than they are consumed, a buffer overflow event might occur. The trace infrastructure is recognizes this and sets an error flag that is detected during the collection on the host PC. After the error flag is parsed during trace data collection, collection is halted and the trace data that was read successfully is prepared for display. However, some data read successfully just prior to the buffer overflow might appear incorrectly in the visualization.

After an overflow occurs, an error file is created in the/_sds/tracedirectory with the name in the following format:archive_DAY_MON_DD_HH_MM_SS_-GMT_YEAR_ERROR. You must reprogram the device (reboot Linux and so on) prior to running the application and collecting trace data again. The only way to reset the trace hardware in the design is with reprogramming.

Debugging with Software/Hardware Cross Probing

After anSDxenvironment application has been created and functions are marked for hardware acceleration, build the design with the appropriate settings. Then, connect to the target board (seeConnecting to the Hardware).

Setting Debug Configurations

In the Project Explorer view, click the ELF (.elf) file in theDebugfolder in the project.
In the toolbar, clickDebug, or use theDebugdrop-down list to selectDebug As>Launch on Hardware (SDx Application Debugger).
Alternatively, right-click the project and selectDebug As>Launch on Hardware (SDx Application Debugger). The Confirm Perspective Switch dialog box appears.
Ensure that the board is switched on before debugging the project. ClickYesto switch to the debug perspective. You are now in theDebug Perspectiveof theSDxIDE.
Note:The debugger resets the system, programs and initializes the device, and then breaks at the main function. The source code is shown in the center panel, and local variables are shown in the top right corner panel. The SDxenvironment log at the bottom right panel shows the Debug Configuration log.
Before you run the application, connect a serial terminal to the board so that you can see the output from your program. As an example, the following settings can be used:
- Connection Type: Serial
- Port: COM
- Baud Rate: 115200

Running the Application

ClickResumeto run your application and observe the output in the terminal window. The source code window shows the_exitfunction, and theTerminaltab shows the output from the application.

The code stops execution at the main function, as can be seen in the Debug tab. Additional breakpoints can be set in the code at specific points to stop the execution of the code at that specific point. Breakpoints can be enabled or disabled by double-clicking on the vertical blue bar adjacent to the line numbers in the code. Execution of the code can be resumed by clicking theResumeicon on the toolbar.

Tips for Debugging Performance

The SDSoCenvironment provides some basic performance monitoring capabilities with the following functions:

sds_clock_counter(): Use this function to determine how much time different code sections, such as the accelerated code and the non-accelerated code, take to execute.
sds_clock_frequency(): This function returns the number of CPU cycles per second.

You can estimate the actual hardware acceleration time by looking at the latency numbers in theVivado Design SuiteHigh-level Synthesis (HLS) tool report files (_sds/vhls/…/*.rpt) or in the IDE underReports>HLS Report. The latency of X accelerator clock cycles equalsX * (processor_clock_freq/accelerator_clock_freq)processor clock cycles. Compare this with the time spent on the actual function call to determine the overhead of setup and data transfers.

For best performance improvement, the time required for executing the accelerated function must be much smaller than the time required for executing the original software function. If this is not true, try to run the accelerator at a higher frequency by selecting a differentclkidon thesds++command line. If that does not work, try to determine whether the data transfer overhead is a significant part of the accelerated function execution time, and reduce the data transfer overhead. Note that the defaultclkidis 100 MHz for all platforms. More details about theclkidvalues for the given platform can be obtained by running-sds-pf-info /.

If the data transfer overhead is large, the following changes might help:

Move more code into the accelerated function so that the computation time increases, and the ratio of computation to data transfer time is improved.
Reduce the amount of data to be transferred by modifying the code or using pragmas to transfer only the required data.
Sequentialize the access pattern as observed from the accelerator code, because it is more efficient to burst transfers than to make a series of unrelated random accesses.
Ensure that data transfers make use of system ports that are appropriate for the cache-ability of the data being transferred. Cache flushing can be an resource-intensive procedure, and using coherent ports to access coherent data, and non-coherent ports to access non-coherent ports makes a significant impact.
Usesds_alloc()instead ofmalloc, where possible. The memory thatsds_alloc()issues is physically contiguous, and enables the use of data movers that are faster to configure that require physically contiguous memory. Also, pinning virtual pages, which is necessary when transferring data issue bymalloc()data, is very costly.

Troubleshooting Compile and Link Time Errors

Typical compile/link time errors are indicated by error messages issued when runningmake. To analyze further, look at the log files andrptfiles in the_sds/reportssub-directory created by theSDSoCenvironment in the build directory. The most recently generated log file usually indicates the cause of the error, such as a syntax error in the corresponding input file, or an error generated by the tool chain while synthesizing accelerator hardware or the data motion network.

The following are tips and strategies to address errors specific to theSDSoCenvironment.

Tool Errors Are Reported by Tools in theSDSoCEnvironment Chain

Try the following troubleshooting steps:

Check whether the corresponding code adheres to the Coding Guidelines inSDSoC Environment Programmers Guide.
Check the syntax of pragmas. See the for more details.
Check for typos in pragmas that might prevent them from being applied as intended.

Vivado Design SuiteHigh-Level Synthesis (HLS) Cannot Meet Timing Requirement

Try the following troubleshooting steps:

Select a slower clock frequency for the accelerator in theSDxIDE (or with thesdscc/sds++command line parameter).
Modify the code structure to allow HLS to generate a faster implementation. See the Improving Hardware Function Parallelism section inSDSoC Profiling and Optimization Guidefor more information on how to do this.

VivadoTools Cannot Meet Timing

Try the following troubleshooting steps:

In theSDxIDE, select a slower clock frequency for the data motion network or accelerator, or both (from the command line, usesdscc/sds++command line parameters).
Use the-xpoption to specify a Vivado implementation strategy to improve results. For example:
```
-impl-strategy Performance_Explore
```
Provide an example/resource to help the user synthesize the HLS block to a higher clock frequency so that the synthesis/implementation tools have a bigger margin.
Modify the C/C++ code passed to HLS, or add more HLS directives to make the HLS block go faster.
Reduce the size of the design in cases where the resource usage exceeds 80%. Refer to theVivadotools reports in the_sdsfolder.

The Design Is Too Large to Fit

Try the following troubleshooting steps:

Reduce the number of accelerated functions.
Change the coding style for an accelerator function to produce a more compact accelerator. You can reduce the amount of parallelism using the mechanisms described in the Improving Hardware Function Parallelism section inSDSoC Profiling and Optimization Guide.
Modify pragmas and coding styles (pipelining) that cause multiple instances of accelerators to be created.
Use pragmas to select smaller data movers such asAXIFIFOinstead ofAXIDMA_SG.
Rewrite hardware functions to have fewer input and output parameters/arguments, especially in cases where the inputs/outputs are continuous stream (sequential access array argument) types that prevent the sharing of data mover hardware.

Troubleshooting Performance Issues

TheSDSoCenvironment provides some basic performance monitoring capabilities in the form of thesds_clock_counter()function. Use this function to determine how much time different code sections, such as the accelerated and the non-accelerated code, take to execute.

To estimate the actual hardware acceleration time, you need to know the latency numbers from theVivadoHLS report, the clock frequency for the accelerator, and theArmCPU clock frequency. To open theVivadoHLS report for the latency numbers, in the Assistant view, go to>>>HLS report. To view the clock frequency for the accelerator, go to the Hardware Functions section of the Project Settings. Click on thePlatformlink in the Project Overview to open the Platform Summary dialog. The CPU frequency is shown under Clock Frequencies. A latency of X accelerator clock cycles is equal to X * (/) processor clock cycles. Compare this with the time spent on the actual function call to determine the data transfer overhead.

For best performance improvement, the time required for executing the accelerated function must be much smaller than the time required for executing the original software function. If this is not true, try to run the accelerator at a higher frequency by selecting a differentclkidon thesdscc/sds++command line. If that does not work, try to determine whether the data transfer overhead is a significant part of the accelerated function execution time, and reduce the data transfer overhead.

Note:More details about the clkidvalues for a given platform can be obtained by running the following command:

sds++ -sds-pf-info

If the data transfer overhead is large, the following changes might help:

Move more code into the accelerated function so that the computation time increases, and the ratio of computation to data transfer time is improved.
Reduce the amount of data to be transferred by modifying the code or using pragmas to transfer only the required data.