Programming the Host Application
In theSDAccel™environment, host code is written in C or C++ language using the industry standardOpenCL™API. TheSDAccelenvironment provides anOpenCL1.2 embedded profile conformant runtime API.
cl_khr_icd
). This extension allows multiple implementations of
OpenCLto co-exist on the same system. Refer to
OpenCL Installable Client Driver Loaderfor details and installation instructions.
TheSDAccelenvironment consists of a host x86 CPU and compute devices running on aXilinx®FPGA.
- Setting up the environment.
- Core command execution including executing one or more kernels.
- Post processing and FPGA release.
fork()
system call from an
SDAccelenvironment application. The
fork()
does not duplicate all the runtime threads. Hence the child process cannot run as a complete application in the
SDAccelenvironment. It is advisable to use the
posix_spawn()
system call to launch another process from the
SDAccelenvironment application.
Setting Up the OpenCL Environment
The host code in theSDAccelenvironment followsOpenCLprogramming paradigm. To set the environment properly, the host application should identify the standardOpenCLmodels. They are: platform, devices, context, command queue, and program.
Platform
cl_platform_id platform_id; // platform id err = clGetPlatformIDs(16, platforms, &platform_count); // Find Xilinx Platform for (unsigned int iplat=0; iplat
TheOpenCLAPI callclGetPlatformIDs
is used to discover the set of availableOpenCLplatforms for a given system. Thereafter,clGetPlatformInfo
is used to identify theXilinxdevice based platform by matchingcl_platform_vendor
with the string"Xilinx"
.
clGetPlatformIDs
command:
err = clGetPlatformIDs(16, platforms, &platform_count); if (err != CL_SUCCESS) { printf("Error: Failed to find an OpenCL platform!\n"); printf("Test failed\n"); exit(1); }
Devices
After the platform detection, theXilinxFPGA devices attached to the platform are identified. TheSDAccelenvironment supports one or moreXilinxFPGA devices working together.
clGetDeviceIDs
and printing their names.
cl_device_id devices[16]; // compute device id char cl_device_name[1001]; err = clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_ACCELERATOR, 16, devices, &num_devices); printf("INFO: Found %d devices\n", num_devices); //iterate all devices to select the target device. for (uint i=0; i
clGetDeviceIDs
API is called with the
device_type
CL_DEVICE_TYPE_ACCELERATOR
to get all the available
Xilinxdevices.
Sub-devices
In theSDAccelenvironment, sometimes devices contain multiple kernel instances, of a single kernel or of different kernels. TheOpenCLAPIclCreateSubDevices
allows the host code to divide the device into multiple sub-devices containing one kernel instance per sub-device. CurrentlySDAccelenvironment supports equally divided sub-devices each containing only one kernel instance.
- The sub-devices are created by equal partition to execute one kernel instance per sub-device.
- Iterating over the sub-device list and using a separate context and command queue to execute the kernel on each of them.
- The API related to kernel execution (and corresponding buffer related) code is not shown for the sake of simplicity, but would be described inside the function
run_cu
.
cl_uint num_devices = 0; cl_device_partition_property props[3] = {CL_DEVICE_PARTITION_EQUALLY,1,0}; // Get the number of sub-devices clCreateSubDevices(device,props,0,nullptr,&num_devices); // Container to hold the sub-devices std::vector devices(num_devices); // Second call of clCreateSubDevices // We get sub-device handles in devices.data() clCreateSubDevices(device,props,num_devices,devices.data(),nullptr); // Iterating over sub-devices std::for_each(devices.begin(),devices.end(),[kernel](cl_device_id sdev) { // Context for sub-device auto context = clCreateContext(0,1,&sdev,nullptr,nullptr,&err); // Command-queue for sub-device auto queue = clCreateCommandQueue(context,sdev, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE,&err); // Execute the kernel on the sub-device using local context and queue run_cu(context,queue,kernel); // Function not shown });
Currently, if a kernel has multiple hardware instances (can be specified during the kernel compilation phase), theSDAccelenvironment execution model assumes all those hardware instances have the same global memory connectivity. If not, then you need to use sub-devices to allocate separatecl_kernel
for each of those hardware instances.
Context
clCreateContext
is used to create a context that contains one or more
Xilinxdevices that will communicate with the host machine.
context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);
In the code example above, the APIclCreateContext
is used to create a context that contains oneXilinxdevice. You can create only one context for a device from a host program. However, the host program should use multiple contexts if sub-devices are used; one context for each sub-device.
Command Queues
One or more command queues for each device is created using theclCreateCommandQueue
API. The FPGA device can contain multiple kernels. When developing the host application, there are two main programming approaches to execute kernels on a device:
- Single out-of-order command queue: Multiple kernel executions can be requested through the same command queue. TheSDAccelruntime environment dispatches those kernels as soon as possible in any order allowing concurrent kernel execution on the FPGA.
- Multiple in-order command queue: Each kernel execution will be requested from different in-order command queues. In such cases, theSDAccelruntime environment can dispatch kernels from any command queue with the intention of improving performance by running them concurrently on the FPGA.
// Out-of-order Command queue commands = clCreateCommandQueue(context, device_id, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &err); // In-order Command Queue commands = clCreateCommandQueue(context, device_id, 0, &err);
Program
As described in theSDAccel Build Process, the host and kernel code are compiled separately to create separate executable files: the host application (.exe) and the FPGA binary (.xclbin). When the host application is executed it must load the.xclbinusing theclCreateProgramWithBinary
API.
unsigned char *kernelbinary; char *xclbin = argv[1]; printf("INFO: loading xclbin %s\n", xclbin); int size=load_file_to_memory(xclbin, (char **) &kernelbinary); size_t size_var = size; cl_program program = clCreateProgramWithBinary(context, 1, &device_id, &size_var, (const unsigned char **) &kernelbinary, &status, &err); err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); // Function int load_file_to_memory(const char *filename, char **result) { uint size = 0; FILE *f = fopen(filename, "rb"); if (f == NULL) { *result = NULL; return -1; // -1 means file opening fail } fseek(f, 0, SEEK_END); size = ftell(f); fseek(f, 0, SEEK_SET); *result = (char *)malloc(size+1); if (size != fread(*result, sizeof(char), size, f)) { free(*result); return -2; // -2 means file reading fail } fclose(f); (*result)[size] = 0; return size; }
- The kernel binary file,.xclbin, is passed in from the command line argument,
argv[1]
.TIP:Passing the .xclbinthrough a command line argument is specific to this example. You can also hardcode the kernel binary file in the application. - The
load_file_to_memory
function is used to load the file contents in the host machine memory space. - The API
clCreateProgramWithBinary
andclBuildProgram
are used to complete the program creation process.
Executing Commands in the FPGA Device
- Memory data transfer to and from the FPGA device.
- Kernel execution on FPGA.
- Event synchronization.
Buffer Transfer to/from the FPGA Device
Interactions between the host application and kernels rely on transferring data to and from global memory in the device. The simplest way to send data back and forth from the FPGA is usingclCreateBuffer
,clEnqueueWriteBuffer
andclEnqueueReadBuffer
commands. The following code example demonstrates this:
int host_mem_ptr[MAX_LENGTH]; // host memory for input vector // Fill the memory input for(int i=0; i } cl_mem dev_mem_ptr = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(int) * number_of_words, NULL, NULL); err = clEnqueueWriteBuffer(commands, dev_mem_ptr, CL_TRUE, 0, sizeof(int) * number_of_words, host_mem_ptr, 0, NULL, NULL);
For the majority of applications the example code above would be sufficient to transfer data from the host to the device memory. However, there are a number of coding practices you should adopt in order to maximize performance and fine-grain control.
UsingclEnqueueMigrateMemObjects
Another consideration when transferring data is usingclEnqueueMigrateMemObjects
instead ofclEnqueueWriteBuffer
orclEnqueueReadBuffer
to improve the performance. Typically, memory objects are implicitly migrated to a device for enqueued kernels. Using this API call results in data transfer ahead of kernel execution to reduce latency, particularly when a kernel is called multiple times.
The following code example is modified to useclEnqueueMigrateMemObjects
:
int host_mem_ptr[MAX_LENGTH]; // host memory for input vector // Fill the memory input for(int i=0; i } cl_mem_ext_ptr_t d_bank0_ext; d_bank0_ext.flags = XCL_MEM_DDR_BANK0; d_bank0_ext.obj = host_mem_ptr; d_bank0_ext.param = 0; cl_mem dev_mem_ptr = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_EXT_PTR_XILINX, sizeof(int) * number_of_words, &d_bank0_ext, NULL); err = clEnqueueMigrateMemObjects(commands, 1, dev_mem_ptr, 0, 0, NULL, NULL);
Usingposix_memalign
for Host Memory Space
SDAccelruntime allocates the memory space in 4K boundary for internal memory management. If the host memory pointer is not aligned to a 4K word boundary, the runtime performs extramemcpy
to make it aligned. It does not significantly impact performance, but you should align the host memory pointer with the 4K boundary to follow theSDAccelruntime memory management.
The following is an example of howposix_memalign
is used instead ofmalloc
for the host memory space pointer.
int *host_mem_ptr; // = (int*) malloc(MAX_LENGTH*sizeof(int)); // Aligning memory in 4K boundary posix_memalign(&host_mem_ptr,4096,MAX_LENGTH*sizeof(int)); // Fill the memory input for(int i=0; i } cl_mem_ext_ptr_t d_bank0_ext; d_bank0_ext.flags = XCL_MEM_DDR_BANK0; d_bank0_ext.obj = host_mem_ptr; d_bank0_ext.param = 0; cl_mem dev_mem_ptr = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_EXT_PTR_XILINX, sizeof(int) * number_of_words, &d_bank0_ext, NULL); err = clEnqueueMigrateMemObjects(commands, 1, dev_mem_ptr, 0, 0, NULL, NULL);
Enhanced Buffer Allocation
By default, all the memory interfaces from all the kernels are connected to a single global memory bank when kernels are linked. As a result, only one memory interface can transfer data to and from the global memory bank at a time, limiting the overall performance of the application. If the FPGA device contains only one global memory bank, this is the only option. However, if the device contains multiple global memory banks, you can customize the global memory bank connections by modifying the default connection. This topic is discussed in greater detail inCustomization of DDR Bank to Kernel Connection. This improves overall performance by enabling multiple kernel memory interfaces to concurrently read and write data from separate global memory banks.
When kernel ports are mapped to memory banks other than the default one, it is necessary to use the enhanced buffer allocation pattern when creating theOpenCLbuffers.
The enhanced buffer allocation pattern uses aXilinxvendor extension,cl_mem_ext_ptr_t
, pointer to help theXilinxruntime determine which global memory bank the buffer should be allocated.
Thecl_mem_ext_ptr_t
type is a struct as defined below:
typedef struct{ unsigned flags; void *obj; void *param; } cl_mem_ext_ptr_t;
Use the explicit bank name method to operatecl_mem_ext_ptr_t
for enhanced buffer allocation.
Explicit Bank Name Method
In this approach, the struct fieldflags
is used to denote the DDR bank (XCL_MEM_DDR_BANK1, XCL_MEM_DDR_BANK2
, etc.). The struct fieldparam
should not be used and set toNULL
.
The following code example usescl_mem_ext_ptr_t
to assign the device buffer to DDR Bank 2.
int host_mem_ptr[MAX_LENGTH]; // host memory for input vector // Fill the memory input for(int i=0; i } cl_mem_ext_ptr_t d_bank0_ext; d_bank0_ext.flags = XCL_MEM_DDR_BANK2; d_bank0_ext.obj = NULL; d_bank0_ext.param = 0; cl_mem dev_mem_ptr = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_EXT_PTR_XILINX, sizeof(int) * number_of_words, &d_bank0_ext, NULL); err = clEnqueueWriteBuffer(commands, dev_mem_ptr, CL_TRUE, 0, sizeof(int) * number_of_words, host_mem_ptr, 0, NULL, NULL);
var_ext.flags = | XCL_MEM_TOPOLOGY
Where 0, 1, 2, and 3 stand for different DDR banks. However, the older naming style of
XCL_MEM_DDR_BANK0
, etc. would still work for the existing platform.
Kernel Setup and Execution
- Identifying the kernels.
- Setting kernel arguments.
- Executing kernels on the FPGA.
Identifying the kernels
cl_kernel
type) in the host code. This is done by the
clCreateKernel
command with the kernel name as an argument:
kernel1 = clCreateKernel(program, "", &err); kernel2 = clCreateKernel(program, "", &err); // etc
Setting Kernel Arguments
- The scalar arguments are used for small data transfer, such as for constant, or configuration type data. These are write-only arguments.
- The buffer arguments are used for large data transfer as discussed inBuffer Transfer to/from the FPGA Device.
clSetKernelArg
command as shown below. The following example shows setting kernel arguments for two scalar arguments, and three buffer arguments.
int err = 0; // Setting up scalar arguments cl_uint scalar_arg_image_width = 3840; err |= clSetKernelArg(kernel, 0, sizeof(cl_uint), &scaler_arg_image_width); cl_uint scaler_arg_image_height = 2160; err |= clSetKernelArg(kernel, 1, sizeof(cl_uint), &scaler_arg_image_height); // Setting up buffer arguments err |= clSetKernelArg(kernel, 2, sizeof(cl_mem), &dev_mem_ptr0); err |= clSetKernelArg(kernel, 3, sizeof(cl_mem), &dev_mem_ptr1); err |= clSetKernelArg(kernel, 4, sizeof(cl_mem), &dev_mem_ptr2);
Enqueing the Kernels
clEnqueueTask
or
clEnqueueNDRangeKernel
commands.
Xilinxrecommends using the
clEnqueueTask
command to execute the kernel over the entire range of input data set using the maximum number of work group items:
err = clEnqueueTask(commands, kernel, 0, NULL, NULL);
clEnqueueTask
is the same as calling
clEnqueueNDRangeKernel
with
work_dim
set to 1,
global_work_offset
set to NULL,
global_work_size[0]
set to 1, and
local_work_size[0]
set to 1.
Just like all theenqueuecommands, theclEnqueueTask
andclEnqueueNDRangeKernel
are asynchronous in nature. The host code continues executing without waiting for the kernel computation to complete on the FPGA device. This allows the host program to execute more kernels, either the same kernel multiple times over a different set of data, or different kernel. After finishing its work, the kernel writes the result data to the global memory bank. This data is read back to the host memory space by usingclEnqueueReadBuffer
or theclEnqueueMigrateMemObjects
command.
Event Synchronization
AllOpenCLclEnqueueXXX
API calls are asynchronous. In other words, these commands will return immediately after the command is enqueued in the command queue. To resolve the dependencies among the commands, an API call such asclWaitForEvents
orclFinish
can be used to pause or block execution of the host program.
clWaitForEvents
and
clFinish
commands are shown below:
err = clEnqueueTask(command_queue, kernel, 0, NULL, NULL); // Execution will wait here until all commands in the command queue are finished clFinish(command_queue); // Read back the results from the device to verify the output cl_event readevent; int host_mem_output_ptr[MAX_LENGTH]; // host memory for output vector clEnqueueReadBuffer(command_queue, dev_mem_ptr, CL_TRUE, 0, sizeof(int) * number_of_words, host_mem_output_ptr, 0, NULL, &readevent ); clWaitForEvents(1, &readevent); // Wait for clEnqueueReadBuffer event to finish // Check Results // Compare Golden values with host_mem_output_ptr
- The
clFinish
API has been explicitly used to block the host execution until the Kernel execution is finished. This is necessary otherwise the host can attempt to read back from the FPGA buffer too early and may read garbage data. - The data transfer from FPGA memory to the local host machine is done through
clEnqueueReadBuffer
. Here the last argument ofclEnqueueReadBuffer
returns an event object that identifies this particular read command and can be used to query the event, or wait for this particular command to complete. TheclWaitForEvents
specifies that one event, and waits to ensure the data transfer is finished before checking the data from the host side memory.
Post Processing and FPGA Cleanup
bool failed = false; for (i=0; i
clReleaseCommandQueue(Command_Queue); clReleaseContext(Context); clReleaseDevice(Target_Device_ID); clReleaseKernel(Kernel); clReleaseProgram(Program); free(Platform_IDs); free(Device_IDs);
Summary
As discussed in earlier topics, the recommended coding style for the host application in theSDAccelenvironment includes the following points:
- Add error checking after eachOpenCLAPI call for debugging purpose, if required.
- In theSDAccelenvironment, one or more kernels are separately pre-compiled to the.xclbinfile. The API
clCreateProgramWithBinary
is used to build the program from the kernel binary. - Ensure using
cl_mem_ext_ptr_t
to match custom kernel memory interface to the DDR bank connection that has been used to build the kernel binary. - Transfer data back and forth from the host code to the FPGAs by using
clEnqueueMigrateMemObjects
. - Use
posix_memalign
to align the host memory pointer at 4K boundary. - Use the out-of-order command queue, or multiple in-order command queues, for concurrent kernel execution on the FPGA.
- Execute the whole workload with
clEnqueTask
, rather than splitting the workload by usingclEnqueueNDRangeKernel
. - Use synchronization commands to resolve dependencies of the asynchronousOpenCLAPI calls.