Programming the Application

Creating an application for an embedded processor in the SDSoC™environment is similar to creating an application for any other SoC or embedded platform. However, there are some added considerations when accelerating an embedded processor application from SDK, for example. Programming for the SDSoCenvironment should include the following tasks:

Identifying the appropriate function(s) in the processor application for acceleration in the programmable logic (PL) region.
Allocating memory for the embedded processor application code and software functions running on the processing system (PS), and for the accelerated function(s) running on the PL regions of the device.
Enabling task-level parallelism with multiple accelerators running concurrently to optimize system performance.
Validating the software-to-hardware function code conversion, to insure things work as intended.

To identify the functions that should be accelerators or turned into hardware functions, you should determine what kind of computational load would be required. For example, functions where large amounts of data are computed, or modified, would be good candidates for hardware functions. However, functions written for a typical processor application might not benefit from hardware acceleration in theSDSoCenvironment, and might need to be restructured to get real performance improvements from acceleration.

Although you might not typically allocate memory for standard processor applications, leaving it to the compiler to define, when programming for hardware acceleration you should manually define the memory allocation in the code. The hardware functions require physically contiguous memory to meet the performance requirements of the hardware.

In addition, when you have defined more than one software function for acceleration, you can also manage the scheduling of these accelerators, deciding if they can run concurrently, or need to be sequential. Understanding and managing the dataflow between the hardware functions and the processor application is a key element of this process.

While converting software functions into hardware functions, you should test early and test often by checking the results of the algorithm in the hardware function. Validate the data returned by the hardware function with the expected results returned by the original software function. Restructuring code into hardware functions might yield better performance, but you must check for equivalent results.

As for the remaining application functions, it is a matter of determining if the application is running on Linux, FreeRTOS, or standalone. Each type has its own pros and cons; for example, standalone is the easiest to use because only theArm®processor is running the application host, but using features that are only for Linux or FreeRTOS are not allowed.

Memory Allocation

Knowing what data is going to be processed by the accelerators can help you write the application code to better allocate the memory being used. Generally, allocating memory usingmalloc/freein themainfunction is suggested and can be beneficial for overall runtime, but can be used anywhere except for the functions designated to be accelerators. However, allocating memory specific to an accelerator usingsds_alloc/sds_freeyields better performance due to the data being allocated and stored in physically contiguous memory that yields faster reads and writes to the memory. Generally, the compiler will use the scatter-gather approach when it cannot safely infer that the data allocated is physically contiguous. This can occur when local variables and buffers usemalloc. However, the scatter-gather data mover has been highly optimized to minimize the software overhead associated with scatter-gather transfers.Xilinxstrongly recommends that you allocate memory usingsds_allocfor data going to the hardware functions.

The types of memory used are classified as contiguous/non-contiguous, and cacheable/non-cacheable. For contiguous memory, an array would have all the elements of the array allocated physically next to each other allowing for faster access times (think sequential read/writes). Using non-cacheable memory means that the data being transferred is not intended to be used by the PS, allowing for a higher transaction speeds. When using a cached memory allocation, there is a performance hit for flushing the cache, as well as CPU access latencies.

You must allocate the data before calling the hardware function. The runtime sets up data movers for you, with consideration for how memory is allocated. For example, in a matrix multiplication design that contains 1024 elements (32 x 32), you must explicitly allocate memory for the hardware function in the mainfunction. The following code directs the compiler to allocate memory on the heap in a physically contiguous, cacheable fashion:

int MatA[1024] = (int*)sds_alloc(1024*sizeof(int)); int MatB[1024] = (int*)sds_alloc(1024*sizeof(int)); int MatC[1024] = (int*)sds_alloc(1024*sizeof(int));

Allocating the memory on the heap allows for a lot more data to be processed, and to be executed with better performance. When execution of this code is complete, you can release the memory usingsds_free.

Examples of memory allocation can be found in the SDSoC Examplesavailable on the XilinxGitHub repository. The following code is from the mmultaddexample available in the /SDx//samplesfolder. The code shows allocating the memory in the mainfunction, and performing a quick check to make sure it was properly allocated, and releases the allocated memory if there was a problem:

int main(int argc, char* argv[]){ int test_passed = 0; float *A, *B, *C, *D, *D_sw; A = (float *)sds_alloc(N * N * sizeof(float)); B = (float *)sds_alloc(N * N * sizeof(float)); C = (float *)sds_alloc(N * N * sizeof(float)); D = (float *)sds_alloc(N * N * sizeof(float)); D_sw = (float *)malloc(N * N * sizeof(float)); if (!A || !B || !C || !D || !D_sw) { if (A) sds_free(A); if (B) sds_free(B); if (C) sds_free(C); if (D) sds_free(D); if (D_sw) free(D_sw); return 2; } ... }

In the example above, you can see that variables used by the hardware functions are allocated using thesds_allocfunction to insure physically contiguous memory is allocated, while the software-only variable (D_sw) is allocated usingmalloc.

At the end of the main()function, all of the allocated memory is released using sds_freeor freeas appropriate:

sds_free(A); sds_free(B); sds_free(C); sds_free(D); free(D_sw);

Thesds_allocfunction, and otherSDSoCspecific functions for memory allocation/deallocation can be found insds_lib.h. More information on these APIs can be found in theSDSoC Environment API.

Sequential/Parallel Accelerator Execution

After defining the memory allocations needed for the accelerators, you should determine how to call the accelerators from the application code. There are multiple ways for the accelerators to operate in the context of the main application. For example in an application in which there is only one accelerator, calling the hardware function like any other function achieves the desired results of a sequential dataflow. However, for multiple accelerators, knowing whether and how the data is shared between the accelerators lets you choose between two distinct flows:

Sequential (synchronous): Accelerators operate in sequence with one execution followed by the next, providing some benefit of acceleration in the hardware implementation.

Parallel (asynchronous): Both accelerators can operate concurrently, granting your application task-level parallelism for significant performance improvement.

See the for more information on the pragmas discussed here.

To implement asynchronous dataflow, you must specify #pragma SDS async(id)and #pragma SDS wait(id)in your embedded processor application. You must place these pragmas in the application code, before and after the hardware function call, as shown in the following example:

#pragma SDS async(1) mmult(A, B, C); #pragma SDS async(2) madd(D, E, F); // Do other SW functions #pragma SDS wait(1) #pragma SDS wait(2)

TIP:The advantage of using async/ waitis that it lets the application perform other operations while the hardware functions are running; and lets you hold the application at the appropriate point to wait for a hardware function to return.

The preceding code example demonstrates a typical asynchronous method. Here, the provided IDs correspond to their respective function (id = 1 formmult, id = 2 formadd). Themmultfunction is loaded with the inputs values of A and B, and processed. Notice in this case where the accelerators are data independent (data is not being shared between the accelerators), asynchronous execution is beneficial. If you determine that the data for an accelerator is not needed by other functions on either the CPU or another accelerator, then async execution with non-cacheable physically contiguous data can provide the best performance.

IMPORTANT:In cases where the data from one accelerator is required by a second accelerator, you should not use async/ wait. The asyncpragma forgoes compiler driven syncing, and thus you could end up with incorrect results if one accelerator requires syncing prior to the start of another.

An example of direct connection is provided with theSDSoC Examplesavailable on theXilinxGitHub repository. Theparallel_accelcode offers a simple example of two hardware functions, matrix addition and matrix multiplication, to demonstrateasyncandwaitwhich helps to achieve greater performance through system parallelism and concurrency.

The parallel_accelexample provides both a sequential dataflow form of the two accelerators, and a parallel dataflow form of the two accelerators, and uses performance monitor functions ( seq_hw_ctr, par_hw_ctr) from the included sds_utils.hto measure the performance difference. The relevant code is provided below for examination:

//Two hw functions are called back to back. First the //vadd_accel is executed, then vmul_accel is executed. //The execution of both accelerators is sequential here. //To prevent automatic dataflow between calls to the two //hw functions, async and wait pragma is used here so as //to ensure that the two hw functions will be running sequentially. seq_hw_ctr.start(); // Launch Hardware Solution for(int itr = 0; itr < MAX_NUM_TIMES; itr++) { #pragma SDS async(1) vadd_accel(source_in1, source_in2, source_vadd_hw_results, size); #pragma SDS wait(1) #pragma SDS async(2) vmul_accel(source_in1, source_in2, source_vmul_hw_results, size); #pragma SDS wait(2) } seq_hw_ctr.stop(); //Two hw functions are called back to back. //The program running on the hardware first transfers in1 and in2 //to the vadd_accel hardware and returns immediately. Then the program //transfers in1 and in2 to the vmul_accel hardware and returns //immediately. When the program later executes to the point of //#pragma SDS wait(id), it waits for the particular output to be ready. par_hw_ctr.start(); // Launch Hardware Solution #pragma SDS async(1) vadd_accel(source_in1, source_in2, source_vadd_hw_results, size); #pragma SDS async(2) vmul_accel(source_in1, source_in2, source_vmul_hw_results, size); for(int itr = 0; itr < MAX_NUM_TIMES; itr++) { #pragma SDS wait(1) #pragma SDS async(1) vadd_accel(source_in1, source_in2, source_vadd_hw_results, size); #pragma SDS wait(2) #pragma SDS async(2) vmul_accel(source_in1, source_in2, source_vmul_hw_results, size); } #pragma SDS wait(1) #pragma SDS wait(2) par_hw_ctr.stop();

In the sequential dataflow example, theasyncandwaitpragmas are used to insure that the two hardware functions are run sequentially. The key is the use of thewaitpragma before the call to the multiplier function,vmul_accel, which insures that the addition function,vadd_accel, completes before matrix multiplication begins. Notice also the use of theasync(2)andwait(2)pragmas to insure that the application waits for the completion of thevmul_accelhardware function before proceeding.

TIP:The async/ waitpragmas are not actually needed in the preceding example, as the compiler automatically synchronizes these functions in the manner described.

In the parallel dataflow example, thevadd_accelandvmul_accelfunctions are started in sequence, not waiting for one to complete before calling the next. This results in nearly parallel execution of the two hardware functions. These function calls are labeledasync(1)andasync(2). Then theforloop is called to repeat the functions a number of times (MAX_NUM_TIMES), butwait(1)andwait(2)are used to wait for the prior executions to complete before calling the functions again.

As with parallel code, you must explicitly synchronize the function calls so that the data is available for the application to complete the function. Failure to program this properly can result in deadlocks, or non-deterministic behavior. However, in some instances running concurrent accelerators might not provide the best performance compared to other means. An example of this is pipelining concurrent accelerators that are data dependent on each other. This would require the data to be synced on a pipeline stage before the accelerator can begin to process the data.

Validating the Software to Hardware Conversion

Testing accelerators in theSDSoCenvironment is similar to testing any other function on a software platform. Generally, you can write a test bench to exercise and validate the application code, or this testing can be implemented as a function call from themainfunction with a golden dataset, and then comparing the outputs. Converting the C/C++ code of the software function to the HDL code of the hardware function can cause the behavior of the hardware function to change. It is a good idea to always run a verification test between the converted hardware code and the known good software code to make sure the algorithm is maintained through the complete build process.

TIP:For an application that has multiple accelerators, it is best to do a bottom-up testing approach, testing each accelerator individually, and then testing all accelerators together. This should shorten debug time. See the SDSoC Environment Debugging Guidefor more information.

Examples of verification code can be found in the SDSoC Examplesavailable on the XilinxGitHub repository. The following code is from the mmultaddexample available in the /SDx//samplesfolder. The main.cppfile defines methods to calculate golden data for the matrix addition ( madd_golden) and multiplication ( mmult_golden). The code for mmult_goldenis provided below:

void mmult_golden(float *A, float *B, float *C) { for (int row = 0; row < N; row++) { for (int col = 0; col < N; col++) { float result = 0.0; for (int k = 0; k < N; k++) { result += A[row*N+k] * B[k*N+col]; } C[row*N+col] = result; } } }

Note that the function is essentially the same as the hardware function,mmult, which accelerates the matrix multiplication in the PL region of the device, while adding a few techniques such as array partitioning and pipelining to achieve optimal performance. Themmult_goldensimply calculates the expected value as golden data to be compared against the results returned by the accelerated function.

Finally, within the mmult_testfunction, the verification process is called to generate the golden data and compare it to the results generated by the accelerated hardware functions. This section of the code is provided below:

int mmult_test(float *A, float *B, float *C, float *D, float *D_sw) { std::cout << "Testing " << NUM_TESTS << " iterations of " << N << "x" << N << " floating point mmultadd..." << std::endl; perf_counter hw_ctr, sw_ctr; for (int i = 0; i < NUM_TESTS; i++) { init_arrays(A, B, C, D, D_sw); float tmp[N*N], tmp1[N*N]; sw_ctr.start(); mmult_golden(A, B, tmp); madd_golden(tmp, C, D_sw); sw_ctr.stop(); hw_ctr.start(); mmult(A, B, tmp1); madd(tmp1, C, D); hw_ctr.stop(); if (result_check(D, D_sw)) return 1; } //Example performance measurement code removed }

Performance Estimation

In some cases, knowing the wall-clock time of the functions that can be turned into hardware functions might be necessary. You can accurately measure the execution time of functions by using specialSDSoCAPI calls that measure activity based off of the free running clock of theArmprocessor. The API functions includesds_clock_counter()andsds_clock_frequency(). These functions can be used to log the start and end times of a function. The functionsds_clock_counter()returns the value of the free running clock register, while the functionsds_clock_frequency()returns the speed in ticks/second of theArmprocessor. See theSDSoC Environment APIfor more information on these functions.

Note: sds_clock_frequency()is a high performance counter and offers a fine-grained measurement of events.

A performance counter class is provided in the sds_util.havailable with the SDSoC Exampleson the XilinxGitHub repository. The perf_counterincludes methods for capturing the start and stop clock times, and the number of function calls, as shown below:

#include "sds_lib.h" class perf_counter { public: uint64_t tot, cnt, calls; perf_counter() : tot(0), cnt(0), calls(0) {}; inline void reset() { tot = cnt = calls = 0; } inline void start() { cnt = sds_clock_counter(); calls++; }; inline void stop() { tot += (sds_clock_counter() - cnt); }; inline uint64_t avg_cpu_cycles() { return ((tot+(calls>>1)) / calls); }; };

You can also use theavg_cpu_cycles()method to return the equivalent number of average cycles the task took in CPU cycle count.