Programming for SDAccel

The custom processing architecture generated by theSDAccel™environment for a kernel running on aXilinx®FPGA provides opportunities for significant performance gains. However, you must take advantage of these opportunities by writing your host and kernel code specifically for acceleration on an FPGA.

The host application is running on x86 servers and uses theSDAccelruntime to manage interactions with the FPGA kernels. The host application is written in C/C++ usingOpenCL™APIs. The custom kernels are running within aXilinx®FPGA on anSDAccelplatform.

TheSDAccelhardware platform contains global memory banks which are used to transfer data between the host and kernel. In addition, on supported platforms, direct streaming between the host and kernel can also be used to transfer between the host and kernel. Communication between the host x86 machine and theSDAccelaccelerator board occurs across thePCIe®bus.

The following topics discuss how to write code for the host application to setup theXilinxRuntime (XRT), load the kernel binary into theSDAccelplatform, pass data efficiently between the host application and the kernel, and trigger the kernel on the FPGA at the appropriate time in the host application.

The FPGA fabric can support multiple kernels running simultaneously. Therefore, you can create multiple instances of a single kernel, or configure multiple kernels on the same device, to increase the performance of the host application. Kernels running on the FPGA can have one or more interfaces to connect to the platform or other kernels. Specifying the number of kernels running on the FPGA, memory bank connections accessed by the kernel, and streaming connections between the host and kernel or between kernels is done usingxocclinking options during the build process.

For more information, see theBuilding the Hardwaresection, or for greater detail see theSDAccel Environment Programmers Guide. Refer to that guide for details of the host application, kernel code, and the interactions between them.

Coding the Host Application

When creating the host application, you must manage the required overhead to setup and configure theSDAccelruntime, program and launch the kernel, pass data back and forth between the host application and the kernel, as well as address the primary function of the application.

Setting Up the Runtime

Within every host application you must set up the environment to identify the OpenCLplatform and the device IDs, specify a context, create a command queue, build a program, and spawn one or more kernels. The program identifies and configures the kernel, and transfers data between the host code and the kernel. In the host code, this process could use the following steps below.
TIP:The following code examples are taken from the IDCT exampledesign.
  1. To set up theOpenCLruntime environment, you need to identify theXilinxplatform using theclGetPlatformIDsandclGetPlatformInfocommands. For example:
    cl_platform_id platform_id; // platform id err = clGetPlatformIDs(16, platforms, &platform_count); // Find Xilinx Platform for (unsigned int iplat=0; iplat
  2. Identify theXilinxdevices on the platform available for enqueuing kernels, using theclGetDeviceIDscommand. Finding the device IDs requires the platform ID discovered in the prior step. For example:
    clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_ACCELERATOR, 1, &device_id, NULL);
  3. Setup the context usingclCreateContext. The context is the environment that work-items execute, and identifies devices to be assigned transactions from the command queue. The example below shows the creation of the context:
    cl_context cntxt = clCreateContext(0, 1, &device_id, NULL, NULL, &err);
  4. Define the command queue usingclCreateCommandQueue. The command queue is a list of commands waiting to be executed to a device. You can setup the command queue to handle commands in the order submitted, or to be out-of-order so that a command can be executed as soon as possible. Use the out-of-order command queue, or multiple in-order command queues, for concurrent kernel execution on the FPGA. An example follows:
    // Create out-of-order Command Queue cl_command_queue commands = clCreateCommandQueue(context, device_id, CL_QUEUE_PROFILING_ENABLE | CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE , &err);
  5. Finally, in the host code you need to set up the program, which contains and configures the kernels to be passed to the command queue by the host application. Theload_file_to_memoryfunction is used to load the file contents in the host machine memory space. TheclCreateProgramWithBinarycommand downloads the FPGA binary (.xclbin) to the device and returns acl_programhandle. The following example shows the creation of the program using these API calls:
    char *fpga_bin; size_t fpga_bin_size; fpga_bin_size = load_file_to_memory(binaryName, &fpga_bin); cl_program program = clCreateProgramWithBinary(context, 1, (const cl_device_id* ) &device_id, &fpga_bin_size, (const unsigned char**) &fpga_bin, NULL, &err);

Transferring Data to/from the FPGA Device

With the program established, you can transfer the data required by the kernel to the SDAccelplatform prior to triggering the kernel. The simplest way to send data back and forth from the kernel is using clCreateBuffer, clEnqueueReadBuffer, and clEnqueueWriteBuffercommands. However, to transfer the data required ahead of the transaction, use the clEnqueueMigrateMemObjectscommand. Using this command results reduced latency in the application. The following code example demonstrates this:
int host_mem_ptr[MAX_LENGTH]; // host memory for input vector // Fill the memory input for(int i=0; i } cl_mem dev_mem_ptr = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, sizeof(int) * number_of_words, host_mem_ptr, NULL); clSetKernelArg(kernel, 0, sizeof(cl_mem), &dev_mem_ptr); err = clEnqueueMigrateMemObjects(commands, 1, dev_mem_ptr, 0, 0, NULL, NULL);
TIP:By default, all the memory interfaces from all the kernels are connected to a single global memory bank. You can customize the global memory bank connections by modifying the default connection. This improves kernel performance by enabling multiple kernels to concurrently read and write data from separate global memory banks. See Mapping Kernel Interfaces to Memory Resourcesfor more information.

Setting Up the Kernel

With the program established, you can setup the kernel, execute the kernel, and manage event synchronization between the host application and the kernel.
  1. Create a kernel from the program and the loaded FPGA binary using theclCreateKernelcommand:
    // Create Kernel cl_kernel krnl = clCreateKernel(program, "krnl_idct", &err);
  2. Set the kernel arguments using theclSetKernelArg. You can use this command to set the arguments for the kernel.
    // Set the kernel arguments clSetKernelArg(mKernel, 0, sizeof(cl_mem), &mInBuffer[0]); clSetKernelArg(mKernel, 1, sizeof(cl_mem), &mInBuffer[1]); clSetKernelArg(mKernel, 2, sizeof(cl_mem), &mOutBuffer[0]); clSetKernelArg(mKernel, 3, sizeof(int), &m_dev_ignore_dc); clSetKernelArg(mKernel, 4, sizeof(unsigned int), &mNumBlocks64);
  3. The kernel is scheduled to run on the FPGA by using theclEnqueueTask. The request to execute the kernel is placed into the command queue and either waits for its turn, or is executed when ready, depending on the nature of the queue.
    clEnqueueTask(mQ, mKernel, 1, &inEvVec[mCount], &runEvVec[mCount]);
  4. Because theclEnqueueTask(andclEnqueueMigrateMemObjects) command is asynchronous in nature, and will return immediately after the command is enqueued in the command queue, you might need to manage the scheduling of events within the host application. To resolve the dependencies among the commands in the host application, you can useclWaitForEventsorclFinishcommands to pause or block execution of the host program. For example:
    // Execution waits until all commands in the command queue are finished clFinish(command_queue); clWaitForEvents(1, &readevent); // Wait for clEnqueueReadBuffer event to finish

For more information on setting up the kernel, see theSDAccel Environment Programmers Guide(UG1277).

Kernel Language Support

TheSDAccelenvironment supports kernels expressed inOpenCLC, C/C++, and RTL (SystemVerilog, Verilog, or VHDL). You can use different kernel types in the same application. However, each kernel has specific requirements and coding styles that should be used.

Kernels created fromOpenCLC and C/C++ are well-suited to software and algorithm developers. It makes it easier to start from an existing C/C++ application and accelerate portions of it.

All kernels require the following:

  • A single slaveAXI4-Liteinterface used to access control registers (to pass scalar arguments and to start/stop the kernel)
  • At least one of the following interfaces (can have both interfaces):
    • AXI4master interface to communicate with global memory.
    • AXI4-Streaminterface for transferring data between kernels or directly with the host.

Writing OpenCL C Kernels

TheSDAccelenvironment supports theOpenCLC language constructs and built-in functions from theOpenCL1.0 embedded profile. The following is an example of anOpenCLC kernel for matrix multiplication that can be compiled with theSDAccelenvironment.

__kernel __attribute__ ((reqd_work_group_size(16,16,1))) void mult(__global int* a, __global int* b, __global int* output) { int r = get_local_id(0); int c = get_local_id(1); int rank = get_local_size(0); int running = 0; for(int index = 0; index < 16; index++){ int aIndex = r*rank + index; int bIndex = index*rank + c; running += a[aIndex] * b[bIndex]; } output[r*rank + c] = running; return; }
IMPORTANT:Standard C libraries such as math.hcannot be used in the OpenCLC kernel. Use OpenCLbuilt-in C functions instead.

In the case ofOpenCLC kernels, theAXI4-Liteinterface is generated automatically while theAXI4-Litememory map interfaces are generated based on the__globaldirective in the function definition.

Writing C/C++ Kernels

Kernels written in C/C++ are supported by theSDAccelenvironment. The above matrix multiplication kernel can be expressed in C/C++ code as shown below. For kernels captured in this way, theSDAccelenvironment supports all of the optimization techniques available inVivado®HLS. The only thing that you must keep in mind is that expressing kernels in this way requires compliance with a specific function signature style.

It is important to keep in mind that by default, kernels captured in C/C++ for HLS do not have any inherent assumptions on the physical interfaces that will be used to transport the function parameter data. HLS uses pragmas embedded in the code to direct the compiler as to which physical interface to generate for a function port. For the function to be treated as a valid HLS C/C++ kernel, each function argument should have a valid HLS interface pragma.

void mmult(int *a, int *b, int *output) { #pragma HLS INTERFACE m_axi port=a offset=slave bundle=gmem #pragma HLS INTERFACE m_axi port=b offset=slave bundle=gmem #pragma HLS INTERFACE m_axi port=output offset=slave bundle=gmem #pragma HLS INTERFACE s_axilite port=a bundle=control #pragma HLS INTERFACE s_axilite port=b bundle=control #pragma HLS INTERFACE s_axilite port=output bundle=control #pragma HLS INTERFACE s_axilite port=return bundle=control const int rank = 16; int running = 0; int bufa[256]; int bufb[256]; int bufc[256]; memcpy(bufa, (int *) a, 256*4); memcpy(bufb, (int *) b, 256*4); for (unsigned int c=0;c

When a kernel is defined in C++, use extern "C" { ... } around the functions targeted to be kernels. The use of extern "C" instructs the compiler/linker to use the C naming and calling conventions.

For C/C++ kernels, use interface pragmas to map toAXI4-LiteandAXI4memory map interface. While for RTL kernels, you are responsible for adding these interfaces.

Pointer Arguments

All pointers are mapped to global memory. The data is accessed through AXI interfaces which can be mapped to different banks. The memory interface specification needs the following two pragmas:

  1. The first is to define which argument the AXI memory map interface is accessed. An offset is always required. Theoffset=slavemeans that the offset of the arraywill be made available through the AXI slave interface of the kernel.
    #pragma HLS INTERFACE m_axi port= offset=slave bundle=
  2. The second pragma for the AXI Slave interface. Scalars (and pointer offsets) are mapped to one AXI Slave control interface which must be namedcontrol.
    #pragma HLS INTERFACE s_axilite port= bundle=control
Note:Using platforms version 4.x or earlier, the interface name M_AXI_ARG_NAMEwas used by making arg_nameuppercase irrelevant of the original capitalization and prefixing with M_AXI_.

Using current platforms (version 5.x or later) the interface namem_axi_arg_nameis used; the original capitalization ofarg_namemust belowercase and prefixed bym_axi_.

Scalars

Scalars are considered constant inputs and should also be mapped tos_axilite. The control interface specification is generated by the following command:

#pragma HLS INTERFACE s_axilite port= bundle=control

Detailed information on how these pragmas are used is available in theSDx Pragma Reference Guide.

C++ arbitrary precision data types can be used for global memory pointers on a kernel. They are not supported for scalar kernel inputs that are passed by value.

Streaming

Streaming provides the capability to stream data directly to kernels without using global memory. Since global memory is not used, streaming can provide improved performance and power, but requires additional FPGA memory (block RAM).

Streaming can be broken down into two distinct types:

  1. Host to card (H2C) and card to host (C2H) streaming
  2. Kernel to kernel (K2K) streaming

In H2C and C2H, data is streamed between the host and the kernels on the card. H2C and C2H streaming is only available on select QDMA platforms such asxilinx_u200_qdma_201910_1. For streaming between H2C and C2H, no explicit connections are made by you – connections are made by the system linker automatically.

In addition to streaming data between host and card, kernel to kernel (K2K) streaming is also supported. It provides direct streams between kernels. Unlike H2C and C2H streaming, K2K streaming is supported on all platforms. However, with K2K streaming, it is necessary to specify the connections between source and destination kernel stream interfaces. This is done duringxocclinking.

It is necessary to specify the following pragma for each streaming interface as shown below:

#pragma HLS interface axis port=

The “SDAccel Streaming Platform” appendix in theSDAccel Environment Programmers Guide(UG1277)provides complete details on streaming including Host and Kernel Coding Guidelines for transfers between Host/Kernel and Kernel/Kernel. Also, for more information about C/C++ kernels, see the "Programming C/C++ Kernels" chapter in theSDAccel Environment Programmers Guide(UG1277).

Writing RTL Kernels

RTL kernels have both software and hardware requirements for it to be used in theSDAccelenvironment framework. On the software side, the RTL kernel must operate and adhere to the register definitions described inKernel Software Requirements.

On the hardware side, it requires the interfaces outlined in theKernel Interface Requirements.

For complete details on creating and using RTL kernels, seeRTL Kernels.