Estimating Performance

Profiling and Instrumenting Code to Measure Performance

The first major task in profiling and instrumenting code is to identify portions of application code that are suitable for implementation in hardware, and that significantly improve overall performance when run in hardware. Compute intensive regions of code are good candidates for hardware acceleration, especially when it is possible to stream data between hardware, the CPU, and memory to overlap the computation with the communication. Software profiling is a standard way to identify the most CPU-intensive portions of your program. An example of a function that would not do well for acceleration is one that takes more time to transfer data to/from the accelerator than to compute the result. TheSDSoC™environment includes all performance and profiling capabilities that are included in theXilinx®SDKtool, includinggprof, the non-intrusive Target Communication Framework (TCF) profiler, and the Performance Analysis perspective within Eclipse.

To run the TCF Profiler for a standalone application, use the following steps:

  1. Set the active build configuration toDebugby right-clicking the project in the Project Explorer and selectingBuild Configurations>Set Active>Debug.
  2. Launch the debugger by right-clicking the project name in the Project Explorer and selectingDebug As>Launch on hardware (SDx Application Debugger).
    Note:The board must be connected to your computer and powered on. The application automatically breaks at the entry to main().
  3. Launch the TCF Profiler by selectingWindow>Show View>Other. In the window that is produced, expandDebug, and selectTCF profiler.
  4. To start the TCF Profiler, click the greenStartbutton at the top of the TCF Profiler tab.
  5. EnableAggregate per functionin the Profiler Configuration dialog box.
  6. To start the profiling, click theResumebutton or pressF8. The program runs to completion and breaks at theexit()function.
  7. View the results in the TCF Profiler tab.

Profiling provides a statistical method for finding highly used regions of code based on sampling the CPU program counter and correlating to the program in execution. Another way to measure program performance is to instrument the application to determine the actual duration between different parts of a program in execution.

Using the TCF Profiler provides more in-depth information related to either a standalone or a Linux OS application. As seen in the previous steps, no additional compilation flags were needed to use the Profiler.

Note:This type of profiling for hardware requires a JTAG connection.

Thesds_liblibrary included in theSDSoCenvironment provides a simple, source code annotation-based, time-stamping API that can be used to measure application performance, as shown in the following example:

/* * @return value of free-running 64-bit Zynq(TM) global counter */ unsigned long long sds_clock_counter(void);

Using this API to collect timestamps and differences between them, you can determine duration of key parts of your program. For example, you can measure data transfer or overall round trip execution time for hardware functions, as shown in the following code snippet:

class perf_counter { public: uint64_t tot, cnt, calls; perf_counter() : tot(0), cnt(0), calls(0) {}; inline void reset() { tot = cnt = calls = 0; } inline void start() { cnt = sds_clock_counter(); calls++; }; inline void stop() { tot += (sds_clock_counter() - cnt); }; inline uint64_t avg_cpu_cycles() { return (tot / calls); }; }; extern void f(); void measure_f_runtime() { perf_counter f_ctr; f_ctr.start(); f() f_ctr.stop(); std::cout << "Cpu cycles f(): " << f_ctr.avg_cpu_cycles() << std::endl; }

The performance estimation feature within theSDSoCenvironment employs this API by automatically instrumenting functions selected for hardware implementation, measuring actual runtimes by running the application on the target, and then comparing actual times with estimated times for the hardware functions.

Note:While off-loading CPU-intensive functions is one of the most reliable heuristics to partition your application, it is not guaranteed to improve system performance without algorithmic modification to optimize memory accesses. A CPU almost always has much faster random access to external memory than you can achieve from programmable logic, due to multi-level caching and a faster clock speed (typically 2x to 8x faster than programmable logic). Extensive manipulation of pointer variables over a large address range, for example, a sort routine that sorts indices over a large index set, while very well-suited for a CPU, could become a liability when moving a function into programmable logic. This does not mean that such compute functions are not good candidates for hardware, only that code or algorithm restructuring could be required. This is a known issue for DSP and GPU coprocessors.

SDSCC/SDS++ Performance Estimation Flow Options

A full bitstream compile can take much more time than a software compile, so thesds++/sdscc(referred to assds++) applications provide performance estimation options to compute the estimated runtime improvement for a set of hardware function calls.

In the Application Project Settings pane, to invoke the estimator, select theEstimate Performancecheck box. This enables performance estimation for the current build configuration and builds the project.

Figure:Setting Estimate Performance in Application Project Settings

Estimating the speed-up is a two phase process:

  1. TheSDSoCenvironment compiles the hardware functions and generates the system. Instead of synthesizing the system to bitstream, thesds++computes an estimate of the performance based on estimated latencies for the hardware functions and data transfer time estimates for the callers of hardware functions.
  2. In the generated Performance Report, to determine a performance baseline and the performance estimate, selectClick Hereto run an instrumented version of the software on the target.

See theSDSoC Environment Getting Started Tutorial(UG1028)for a tutorial on how to use the Performance Report.

You can also generate a performance estimate from the command line. As a first pass to gather data about software runtime, use the-perf-funcsoption to specify functions to profile and-perf-rootto specify the root function encompassing calls to the profiled functions.

Thesds++system compiler then automatically instruments these functions to collect runtime data when the application is run on a board. When you run an instrumented application on the target, the program creates a file on the SD card calledswdata.xml, which contains the runtime performance data for the run.

Copy theswdata.xmlto the host, and run a build that estimates the performance gain on a per hardware function caller basis and for the top-level function specified by the–perf-rootfunction in the first pass run. Use the–perf-estoption to specifyswdata.xmlas input data for this build.

The following table specifies thesds++system compiler options normally used to build an application.

Table 1.Commonly used sds++ options
Option Description
-perf-funcs function_name_list Specifies a comma separated list of all functions to be profiled in the instrumented software application.
-perf-root function_name Specifies the root function encompassing all calls to the profiled functions. The default is the function main.
-perf-est data_file Specifies the file containing runtime data generated by the instrumented software application when run on the target. Estimate performance gains for hardware accelerated functions. The default name for this file isswdata.xml.
-perf-est-hw-only Runs the estimation flow without running the first pass to collect software run data. Using this option provides hardware latency and resource estimates without providing a comparison against baseline.
CAUTION:
After running the sd_cardimage on the board for collecting profile data, type cd /; sync; umount /mnt;. This ensures that the swdata.xmlfile is written out to the SD card.