Programming the Host Application

In theSDAccel™environment, host code is written in C or C++ language using the industry standardOpenCL™API. TheSDAccelenvironment provides anOpenCL1.2 embedded profile conformant runtime API.

IMPORTANT:The SDAccelenvironment supports the OpenCLInstallable Client Driver (ICD) extension ( cl_khr_icd). This extension allows multiple implementations of OpenCLto co-exist on the same system. Refer to OpenCL Installable Client Driver Loaderfor details and installation instructions.

TheSDAccelenvironment consists of a host x86 CPU and compute devices running on aXilinx®FPGA.

In general, the structure of the host code can be divided into three sections:
  1. Setting up the environment.
  2. Core command execution including executing one or more kernels.
  3. Post processing and FPGA release.
The following sections discuss each of the above topics in detail.
Note:For multithreading the host program, exercise caution when calling a fork()system call from an SDAccelenvironment application. The fork()does not duplicate all the runtime threads. Hence the child process cannot run as a complete application in the SDAccelenvironment. It is advisable to use the posix_spawn()system call to launch another process from the SDAccelenvironment application.

Setting Up the OpenCL Environment

The host code in theSDAccelenvironment followsOpenCLprogramming paradigm. To set the environment properly, the host application should identify the standardOpenCLmodels. They are: platform, devices, context, command queue, and program.

TIP:The host code examples and API commands used in this document follow the OpenCLC API. The IDCT examplereferred to in SDAccel Example Designsis also written with the C API. However, the SDAccelruntime environment also supports the OpenCLC++ wrapper API, and many of the examples in the GitHub repositoryare written using the C++ API. Refer to https://www.khronos.org/registry/OpenCL/specs/opencl-cplusplus-1.2.pdffor more information on this C++ wrapper API.

Platform

From the very beginning the host code should identify the platform composed of XilinxFPGA as one or more devices. The host code segment below is standard coding to identify the Xilinxdevice based platform.
cl_platform_id platform_id; // platform id err = clGetPlatformIDs(16, platforms, &platform_count); // Find Xilinx Platform for (unsigned int iplat=0; iplat

TheOpenCLAPI callclGetPlatformIDsis used to discover the set of availableOpenCLplatforms for a given system. Thereafter,clGetPlatformInfois used to identify theXilinxdevice based platform by matchingcl_platform_vendorwith the string"Xilinx".

Note:Though it is not explicitly shown in the preceding code, or in other host code examples used throughout this chapter, it is always a good coding practice to use error checking after each of the OpenCLAPI calls. This can help debugging and improve productivity when you are debugging the host and kernel code in the emulation flow, or during hardware execution. Below is an error checking code example for clGetPlatformIDscommand:
err = clGetPlatformIDs(16, platforms, &platform_count); if (err != CL_SUCCESS) { printf("Error: Failed to find an OpenCL platform!\n"); printf("Test failed\n"); exit(1); }

Devices

After the platform detection, theXilinxFPGA devices attached to the platform are identified. TheSDAccelenvironment supports one or moreXilinxFPGA devices working together.

The following code demonstrates finding all the Xilinxdevices (with a upper limit of 16) by using API clGetDeviceIDsand printing their names.
cl_device_id devices[16]; // compute device id char cl_device_name[1001]; err = clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_ACCELERATOR, 16, devices, &num_devices); printf("INFO: Found %d devices\n", num_devices); //iterate all devices to select the target device. for (uint i=0; i
IMPORTANT:The clGetDeviceIDsAPI is called with the device_type CL_DEVICE_TYPE_ACCELERATORto get all the available Xilinxdevices.

Sub-devices

In theSDAccelenvironment, sometimes devices contain multiple kernel instances, of a single kernel or of different kernels. TheOpenCLAPIclCreateSubDevicesallows the host code to divide the device into multiple sub-devices containing one kernel instance per sub-device. CurrentlySDAccelenvironment supports equally divided sub-devices each containing only one kernel instance.

The following example shows:
  1. The sub-devices are created by equal partition to execute one kernel instance per sub-device.
  2. Iterating over the sub-device list and using a separate context and command queue to execute the kernel on each of them.
  3. The API related to kernel execution (and corresponding buffer related) code is not shown for the sake of simplicity, but would be described inside the functionrun_cu.
cl_uint num_devices = 0; cl_device_partition_property props[3] = {CL_DEVICE_PARTITION_EQUALLY,1,0}; // Get the number of sub-devices clCreateSubDevices(device,props,0,nullptr,&num_devices); // Container to hold the sub-devices std::vector devices(num_devices); // Second call of clCreateSubDevices // We get sub-device handles in devices.data() clCreateSubDevices(device,props,num_devices,devices.data(),nullptr); // Iterating over sub-devices std::for_each(devices.begin(),devices.end(),[kernel](cl_device_id sdev) { // Context for sub-device auto context = clCreateContext(0,1,&sdev,nullptr,nullptr,&err); // Command-queue for sub-device auto queue = clCreateCommandQueue(context,sdev, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE,&err); // Execute the kernel on the sub-device using local context and queue run_cu(context,queue,kernel); // Function not shown });

Currently, if a kernel has multiple hardware instances (can be specified during the kernel compilation phase), theSDAccelenvironment execution model assumes all those hardware instances have the same global memory connectivity. If not, then you need to use sub-devices to allocate separatecl_kernelfor each of those hardware instances.

Context

The OpenCLcontext creation process is straightforward. The API clCreateContextis used to create a context that contains one or more Xilinxdevices that will communicate with the host machine.
context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);

In the code example above, the APIclCreateContextis used to create a context that contains oneXilinxdevice. You can create only one context for a device from a host program. However, the host program should use multiple contexts if sub-devices are used; one context for each sub-device.

Command Queues

One or more command queues for each device is created using theclCreateCommandQueueAPI. The FPGA device can contain multiple kernels. When developing the host application, there are two main programming approaches to execute kernels on a device:

  1. Single out-of-order command queue: Multiple kernel executions can be requested through the same command queue. TheSDAccelruntime environment dispatches those kernels as soon as possible in any order allowing concurrent kernel execution on the FPGA.
  2. Multiple in-order command queue: Each kernel execution will be requested from different in-order command queues. In such cases, theSDAccelruntime environment can dispatch kernels from any command queue with the intention of improving performance by running them concurrently on the FPGA.
The following is an example of standard API calls to create in-order and out-of-order command queues.
// Out-of-order Command queue commands = clCreateCommandQueue(context, device_id, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &err); // In-order Command Queue commands = clCreateCommandQueue(context, device_id, 0, &err);

Program

As described in theSDAccel Build Process, the host and kernel code are compiled separately to create separate executable files: the host application (.exe) and the FPGA binary (.xclbin). When the host application is executed it must load the.xclbinusing theclCreateProgramWithBinaryAPI.

The following code example shows how the standard OpenCLAPI is used to build the program from the .xclbinfile:
unsigned char *kernelbinary; char *xclbin = argv[1]; printf("INFO: loading xclbin %s\n", xclbin); int size=load_file_to_memory(xclbin, (char **) &kernelbinary); size_t size_var = size; cl_program program = clCreateProgramWithBinary(context, 1, &device_id, &size_var, (const unsigned char **) &kernelbinary, &status, &err); err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); // Function int load_file_to_memory(const char *filename, char **result) { uint size = 0; FILE *f = fopen(filename, "rb"); if (f == NULL) { *result = NULL; return -1; // -1 means file opening fail } fseek(f, 0, SEEK_END); size = ftell(f); fseek(f, 0, SEEK_SET); *result = (char *)malloc(size+1); if (size != fread(*result, sizeof(char), size, f)) { free(*result); return -2; // -2 means file reading fail } fclose(f); (*result)[size] = 0; return size; }
The above example performs the following steps:
  1. The kernel binary file,.xclbin, is passed in from the command line argument,argv[1].
    TIP:Passing the .xclbinthrough a command line argument is specific to this example. You can also hardcode the kernel binary file in the application.
  2. Theload_file_to_memoryfunction is used to load the file contents in the host machine memory space.
  3. The APIclCreateProgramWithBinaryandclBuildProgramare used to complete the program creation process.

Executing Commands in the FPGA Device

Once the OpenCLenvironment is initialized, the host application is ready to issue commands to the device and interact with the kernels. Such commands include:
  1. Memory data transfer to and from the FPGA device.
  2. Kernel execution on FPGA.
  3. Event synchronization.

Buffer Transfer to/from the FPGA Device

Interactions between the host application and kernels rely on transferring data to and from global memory in the device. The simplest way to send data back and forth from the FPGA is usingclCreateBuffer,clEnqueueWriteBufferandclEnqueueReadBuffercommands. The following code example demonstrates this:

int host_mem_ptr[MAX_LENGTH]; // host memory for input vector // Fill the memory input for(int i=0; i } cl_mem dev_mem_ptr = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(int) * number_of_words, NULL, NULL); err = clEnqueueWriteBuffer(commands, dev_mem_ptr, CL_TRUE, 0, sizeof(int) * number_of_words, host_mem_ptr, 0, NULL, NULL);
IMPORTANT:A single buffer cannot be bigger than 4 GB.

For the majority of applications the example code above would be sufficient to transfer data from the host to the device memory. However, there are a number of coding practices you should adopt in order to maximize performance and fine-grain control.

UsingclEnqueueMigrateMemObjects

Another consideration when transferring data is usingclEnqueueMigrateMemObjectsinstead ofclEnqueueWriteBufferorclEnqueueReadBufferto improve the performance. Typically, memory objects are implicitly migrated to a device for enqueued kernels. Using this API call results in data transfer ahead of kernel execution to reduce latency, particularly when a kernel is called multiple times.

The following code example is modified to useclEnqueueMigrateMemObjects:

int host_mem_ptr[MAX_LENGTH]; // host memory for input vector // Fill the memory input for(int i=0; i } cl_mem_ext_ptr_t d_bank0_ext; d_bank0_ext.flags = XCL_MEM_DDR_BANK0; d_bank0_ext.obj = host_mem_ptr; d_bank0_ext.param = 0; cl_mem dev_mem_ptr = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_EXT_PTR_XILINX, sizeof(int) * number_of_words, &d_bank0_ext, NULL); err = clEnqueueMigrateMemObjects(commands, 1, dev_mem_ptr, 0, 0, NULL, NULL);

Usingposix_memalignfor Host Memory Space

SDAccelruntime allocates the memory space in 4K boundary for internal memory management. If the host memory pointer is not aligned to a 4K word boundary, the runtime performs extramemcpyto make it aligned. It does not significantly impact performance, but you should align the host memory pointer with the 4K boundary to follow theSDAccelruntime memory management.

The following is an example of howposix_memalignis used instead ofmallocfor the host memory space pointer.

int *host_mem_ptr; // = (int*) malloc(MAX_LENGTH*sizeof(int)); // Aligning memory in 4K boundary posix_memalign(&host_mem_ptr,4096,MAX_LENGTH*sizeof(int)); // Fill the memory input for(int i=0; i } cl_mem_ext_ptr_t d_bank0_ext; d_bank0_ext.flags = XCL_MEM_DDR_BANK0; d_bank0_ext.obj = host_mem_ptr; d_bank0_ext.param = 0; cl_mem dev_mem_ptr = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_EXT_PTR_XILINX, sizeof(int) * number_of_words, &d_bank0_ext, NULL); err = clEnqueueMigrateMemObjects(commands, 1, dev_mem_ptr, 0, 0, NULL, NULL);

Enhanced Buffer Allocation

By default, all the memory interfaces from all the kernels are connected to a single global memory bank when kernels are linked. As a result, only one memory interface can transfer data to and from the global memory bank at a time, limiting the overall performance of the application. If the FPGA device contains only one global memory bank, this is the only option. However, if the device contains multiple global memory banks, you can customize the global memory bank connections by modifying the default connection. This topic is discussed in greater detail inCustomization of DDR Bank to Kernel Connection. This improves overall performance by enabling multiple kernel memory interfaces to concurrently read and write data from separate global memory banks.

When kernel ports are mapped to memory banks other than the default one, it is necessary to use the enhanced buffer allocation pattern when creating theOpenCLbuffers.

The enhanced buffer allocation pattern uses aXilinxvendor extension,cl_mem_ext_ptr_t, pointer to help theXilinxruntime determine which global memory bank the buffer should be allocated.

Thecl_mem_ext_ptr_ttype is a struct as defined below:

typedef struct{ unsigned flags; void *obj; void *param; } cl_mem_ext_ptr_t;

Use the explicit bank name method to operatecl_mem_ext_ptr_tfor enhanced buffer allocation.

Explicit Bank Name Method

In this approach, the struct fieldflagsis used to denote the DDR bank (XCL_MEM_DDR_BANK1, XCL_MEM_DDR_BANK2, etc.). The struct fieldparamshould not be used and set toNULL.

The following code example usescl_mem_ext_ptr_tto assign the device buffer to DDR Bank 2.

int host_mem_ptr[MAX_LENGTH]; // host memory for input vector // Fill the memory input for(int i=0; i } cl_mem_ext_ptr_t d_bank0_ext; d_bank0_ext.flags = XCL_MEM_DDR_BANK2; d_bank0_ext.obj = NULL; d_bank0_ext.param = 0; cl_mem dev_mem_ptr = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_EXT_PTR_XILINX, sizeof(int) * number_of_words, &d_bank0_ext, NULL); err = clEnqueueWriteBuffer(commands, dev_mem_ptr, CL_TRUE, 0, sizeof(int) * number_of_words, host_mem_ptr, 0, NULL, NULL);
IMPORTANT:Starting from the 2018.3 release, the new method of specifying a bank name is:
var_ext.flags =  | XCL_MEM_TOPOLOGY
Where 0, 1, 2, and 3 stand for different DDR banks. However, the older naming style of XCL_MEM_DDR_BANK0, etc. would still work for the existing platform.

Kernel Setup and Execution

This section focuses on how a typical host application performs the following kernel related tasks in the SDAccelenvironment:
  1. Identifying the kernels.
  2. Setting kernel arguments.
  3. Executing kernels on the FPGA.

Identifying the kernels

At the beginning, the individual kernels present in the .xclbinfile should be mapped to the kernel handles (denoted by cl_kerneltype) in the host code. This is done by the clCreateKernelcommand with the kernel name as an argument:
kernel1 = clCreateKernel(program, "", &err); kernel2 = clCreateKernel(program, "", &err); // etc

Setting Kernel Arguments

In the SDAccelenvironment framework two types of kernel arguments can be set.
  1. The scalar arguments are used for small data transfer, such as for constant, or configuration type data. These are write-only arguments.
  2. The buffer arguments are used for large data transfer as discussed inBuffer Transfer to/from the FPGA Device.
The kernel arguments can be set using the clSetKernelArgcommand as shown below. The following example shows setting kernel arguments for two scalar arguments, and three buffer arguments.
int err = 0; // Setting up scalar arguments cl_uint scalar_arg_image_width = 3840; err |= clSetKernelArg(kernel, 0, sizeof(cl_uint), &scaler_arg_image_width); cl_uint scaler_arg_image_height = 2160; err |= clSetKernelArg(kernel, 1, sizeof(cl_uint), &scaler_arg_image_height); // Setting up buffer arguments err |= clSetKernelArg(kernel, 2, sizeof(cl_mem), &dev_mem_ptr0); err |= clSetKernelArg(kernel, 3, sizeof(cl_mem), &dev_mem_ptr1); err |= clSetKernelArg(kernel, 4, sizeof(cl_mem), &dev_mem_ptr2);

Enqueing the Kernels

The kernel is enqueued to run on FPGA either by the clEnqueueTaskor clEnqueueNDRangeKernelcommands. Xilinxrecommends using the clEnqueueTaskcommand to execute the kernel over the entire range of input data set using the maximum number of work group items:
err = clEnqueueTask(commands, kernel, 0, NULL, NULL);
TIP: clEnqueueTaskis the same as calling clEnqueueNDRangeKernelwith work_dimset to 1, global_work_offsetset to NULL, global_work_size[0]set to 1, and local_work_size[0]set to 1.

Just like all theenqueuecommands, theclEnqueueTaskandclEnqueueNDRangeKernelare asynchronous in nature. The host code continues executing without waiting for the kernel computation to complete on the FPGA device. This allows the host program to execute more kernels, either the same kernel multiple times over a different set of data, or different kernel. After finishing its work, the kernel writes the result data to the global memory bank. This data is read back to the host memory space by usingclEnqueueReadBufferor theclEnqueueMigrateMemObjectscommand.

Event Synchronization

AllOpenCLclEnqueueXXXAPI calls are asynchronous. In other words, these commands will return immediately after the command is enqueued in the command queue. To resolve the dependencies among the commands, an API call such asclWaitForEventsorclFinishcan be used to pause or block execution of the host program.

Example usage of clWaitForEventsand clFinishcommands are shown below:
err = clEnqueueTask(command_queue, kernel, 0, NULL, NULL); // Execution will wait here until all commands in the command queue are finished clFinish(command_queue); // Read back the results from the device to verify the output cl_event readevent; int host_mem_output_ptr[MAX_LENGTH]; // host memory for output vector clEnqueueReadBuffer(command_queue, dev_mem_ptr, CL_TRUE, 0, sizeof(int) * number_of_words, host_mem_output_ptr, 0, NULL, &readevent ); clWaitForEvents(1, &readevent); // Wait for clEnqueueReadBuffer event to finish // Check Results // Compare Golden values with host_mem_output_ptr
Note how the synchronization APIs have been added in the above example.
  1. TheclFinishAPI has been explicitly used to block the host execution until the Kernel execution is finished. This is necessary otherwise the host can attempt to read back from the FPGA buffer too early and may read garbage data.
  2. The data transfer from FPGA memory to the local host machine is done throughclEnqueueReadBuffer. Here the last argument ofclEnqueueReadBufferreturns an event object that identifies this particular read command and can be used to query the event, or wait for this particular command to complete. TheclWaitForEventsspecifies that one event, and waits to ensure the data transfer is finished before checking the data from the host side memory.

Post Processing and FPGA Cleanup

At the final stage of the host program, it is good practice to confirm the FPGA functionality by comparing the output data from the FPGA with golden data. This action greatly helps identify and debug any issues with the kernel.
bool failed = false; for (i=0; i
At the end of the host code, all the allocated resources should be released by using proper release functions. The SDAccelenvironment may not able to generate a correct performance related profile and analysis report if resources are not properly released.
clReleaseCommandQueue(Command_Queue); clReleaseContext(Context); clReleaseDevice(Target_Device_ID); clReleaseKernel(Kernel); clReleaseProgram(Program); free(Platform_IDs); free(Device_IDs);

Summary

As discussed in earlier topics, the recommended coding style for the host application in theSDAccelenvironment includes the following points:

  1. Add error checking after eachOpenCLAPI call for debugging purpose, if required.
  2. In theSDAccelenvironment, one or more kernels are separately pre-compiled to the.xclbinfile. The APIclCreateProgramWithBinaryis used to build the program from the kernel binary.
  3. Ensure usingcl_mem_ext_ptr_tto match custom kernel memory interface to the DDR bank connection that has been used to build the kernel binary.
  4. Transfer data back and forth from the host code to the FPGAs by usingclEnqueueMigrateMemObjects.
  5. Useposix_memalignto align the host memory pointer at 4K boundary.
  6. Use the out-of-order command queue, or multiple in-order command queues, for concurrent kernel execution on the FPGA.
  7. Execute the whole workload withclEnqueTask, rather than splitting the workload by usingclEnqueueNDRangeKernel.
  8. Use synchronization commands to resolve dependencies of the asynchronousOpenCLAPI calls.