Programming the Host Application

In theSDAccel™environment, host code is written in C or C++ language using the industry standardOpenCL™API. TheSDAccelenvironment provides anOpenCL1.2 embedded profile conformant runtime API.

IMPORTANT:The SDAccelenvironment supports the OpenCLInstallable Client Driver (ICD) extension ( cl_khr_icd). This extension allows multiple implementations of OpenCLto co-exist on the same system. Refer to OpenCL Installable Client Driver Loaderfor details and installation instructions.

TheSDAccelenvironment consists of a host x86 CPU and compute devices running on aXilinx®FPGA.

In general, the structure of the host code can be divided into three sections:

Setting up the environment.
Core command execution including executing one or more kernels.
Post processing and FPGA release.

The following sections discuss each of the above topics in detail.

Note:For multithreading the host program, exercise caution when calling a fork()system call from an SDAccelenvironment application. The fork()does not duplicate all the runtime threads. Hence the child process cannot run as a complete application in the SDAccelenvironment. It is advisable to use the posix_spawn()system call to launch another process from the SDAccelenvironment application.

Setting Up the OpenCL Environment

The host code in theSDAccelenvironment followsOpenCLprogramming paradigm. To set the environment properly, the host application should identify the standardOpenCLmodels. They are: platform, devices, context, command queue, and program.

TIP:The host code examples and API commands used in this document follow the OpenCLC API. The IDCT examplereferred to in SDAccel Example Designsis also written with the C API. However, the SDAccelruntime environment also supports the OpenCLC++ wrapper API, and many of the examples in the GitHub repositoryare written using the C++ API. Refer to https://www.khronos.org/registry/OpenCL/specs/opencl-cplusplus-1.2.pdffor more information on this C++ wrapper API.

Platform

From the very beginning the host code should identify the platform composed of XilinxFPGA as one or more devices. The host code segment below is standard coding to identify the Xilinxdevice based platform.

cl_platform_id platform_id; // platform id err = clGetPlatformIDs(16, platforms, &platform_count); // Find Xilinx Platform for (unsigned int iplat=0; iplat


             TheOpenCLAPI callclGetPlatformIDsis used to discover the set of availableOpenCLplatforms for a given system. Thereafter,clGetPlatformInfois used to identify theXilinxdevice based platform by matchingcl_platform_vendorwith the string"Xilinx".
             
              Note:Though it is not explicitly shown in the preceding code, or in other host code examples used throughout this chapter, it is always a good coding practice to use error checking after each of the
              OpenCLAPI calls. This can help debugging and improve productivity when you are debugging the host and kernel code in the emulation flow, or during hardware execution. Below is an error checking code example for
              clGetPlatformIDscommand:
              err = clGetPlatformIDs(16, platforms, &platform_count); if (err != CL_SUCCESS) { printf("Error: Failed to find an OpenCL platform!\n"); printf("Test failed\n"); exit(1); }


           
            Devices
            
             After the platform detection, theXilinxFPGA devices attached to the platform are identified. TheSDAccelenvironment supports one or moreXilinxFPGA devices working together.
             
              The following code demonstrates finding all the
              Xilinxdevices (with a upper limit of 16) by using API
              clGetDeviceIDsand printing their names.
              cl_device_id devices[16]; // compute device id char cl_device_name[1001]; err = clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_ACCELERATOR, 16, devices, &num_devices); printf("INFO: Found %d devices\n", num_devices); //iterate all devices to select the target device. for (uint i=0; i

             

             
              IMPORTANT:The
              clGetDeviceIDsAPI is called with the
              device_type
              CL_DEVICE_TYPE_ACCELERATORto get all the available
              Xilinxdevices.
             
             
              Sub-devices
              In theSDAccelenvironment, sometimes devices contain multiple kernel instances, of a single kernel or of different kernels. TheOpenCLAPIclCreateSubDevicesallows the host code to divide the device into multiple sub-devices containing one kernel instance per sub-device. CurrentlySDAccelenvironment supports equally divided sub-devices each containing only one kernel instance.
              
               The following example shows:
               
                The sub-devices are created by equal partition to execute one kernel instance per sub-device.
                Iterating over the sub-device list and using a separate context and command queue to execute the kernel on each of them.
                The API related to kernel execution (and corresponding buffer related) code is not shown for the sake of simplicity, but would be described inside the functionrun_cu.
               
              
              cl_uint num_devices = 0; cl_device_partition_property props[3] = {CL_DEVICE_PARTITION_EQUALLY,1,0}; // Get the number of sub-devices clCreateSubDevices(device,props,0,nullptr,&num_devices); // Container to hold the sub-devices std::vector devices(num_devices); // Second call of clCreateSubDevices // We get sub-device handles in devices.data() clCreateSubDevices(device,props,num_devices,devices.data(),nullptr); // Iterating over sub-devices std::for_each(devices.begin(),devices.end(),[kernel](cl_device_id sdev) { // Context for sub-device auto context = clCreateContext(0,1,&sdev,nullptr,nullptr,&err); // Command-queue for sub-device auto queue = clCreateCommandQueue(context,sdev, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE,&err); // Execute the kernel on the sub-device using local context and queue run_cu(context,queue,kernel); // Function not shown });
              Currently, if a kernel has multiple hardware instances (can be specified during the kernel compilation phase), theSDAccelenvironment execution model assumes all those hardware instances have the same global memory connectivity. If not, then you need to use sub-devices to allocate separatecl_kernelfor each of those hardware instances.
             
            

           

           
            Context
            
             
              The
              OpenCLcontext creation process is straightforward. The API
              clCreateContextis used to create a context that contains one or more
              Xilinxdevices that will communicate with the host machine.
              context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);
             
             In the code example above, the APIclCreateContextis used to create a context that contains oneXilinxdevice. You can create only one context for a device from a host program. However, the host program should use multiple contexts if sub-devices are used; one context for each sub-device.
            
           
           
            Command Queues
            
             
              One or more command queues for each device is created using theclCreateCommandQueueAPI. The FPGA device can contain multiple kernels. When developing the host application, there are two main programming approaches to execute kernels on a device:
              
               Single out-of-order command queue: Multiple kernel executions can be requested through the same command queue. TheSDAccelruntime environment dispatches those kernels as soon as possible in any order allowing concurrent kernel execution on the FPGA.
               Multiple in-order command queue: Each kernel execution will be requested from different in-order command queues. In such cases, theSDAccelruntime environment can dispatch kernels from any command queue with the intention of improving performance by running them concurrently on the FPGA.
              
              
               The following is an example of standard API calls to create in-order and out-of-order command queues.
               // Out-of-order Command queue commands = clCreateCommandQueue(context, device_id, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &err); // In-order Command Queue commands = clCreateCommandQueue(context, device_id, 0, &err);
              
             
            
           
           
            Program
            
             As described in theSDAccel Build Process, the host and kernel code are compiled separately to create separate executable files: the host application (.exe) and the FPGA binary (.xclbin). When the host application is executed it must load the.xclbinusing theclCreateProgramWithBinaryAPI.
             
              The following code example shows how the standard
              OpenCLAPI is used to build the program from the
              .xclbinfile:
              unsigned char *kernelbinary; char *xclbin = argv[1]; printf("INFO: loading xclbin %s\n", xclbin); int size=load_file_to_memory(xclbin, (char **) &kernelbinary); size_t size_var = size; cl_program program = clCreateProgramWithBinary(context, 1, &device_id, &size_var, (const unsigned char **) &kernelbinary, &status, &err); err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); // Function int load_file_to_memory(const char *filename, char **result) { uint size = 0; FILE *f = fopen(filename, "rb"); if (f == NULL) { *result = NULL; return -1; // -1 means file opening fail } fseek(f, 0, SEEK_END); size = ftell(f); fseek(f, 0, SEEK_SET); *result = (char *)malloc(size+1); if (size != fread(*result, sizeof(char), size, f)) { free(*result); return -2; // -2 means file reading fail } fclose(f); (*result)[size] = 0; return size; }
             
             
              The above example performs the following steps:
              
               The kernel binary file,.xclbin, is passed in from the command line argument,argv[1].
                
                 TIP:Passing the
                 .xclbinthrough a command line argument is specific to this example. You can also hardcode the kernel binary file in the application.
                
               Theload_file_to_memoryfunction is used to load the file contents in the host machine memory space.
               The APIclCreateProgramWithBinaryandclBuildProgramare used to complete the program creation process.


          
           Executing Commands in the FPGA Device
           
            
             Once the
             OpenCLenvironment is initialized, the host application is ready to issue commands to the device and interact with the kernels. Such commands include:
             
              Memory data transfer to and from the FPGA device.
              Kernel execution on FPGA.
              Event synchronization.
             
            
           
           
            Buffer Transfer to/from the FPGA Device
            
             
              Interactions between the host application and kernels rely on transferring data to and from global memory in the device. The simplest way to send data back and forth from the FPGA is usingclCreateBuffer,clEnqueueWriteBufferandclEnqueueReadBuffercommands. The following code example demonstrates this:
              int host_mem_ptr[MAX_LENGTH]; // host memory for input vector // Fill the memory input for(int i=0; i } cl_mem dev_mem_ptr = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(int) * number_of_words, NULL, NULL); err = clEnqueueWriteBuffer(commands, dev_mem_ptr, CL_TRUE, 0, sizeof(int) * number_of_words, host_mem_ptr, 0, NULL, NULL);
              
               IMPORTANT:A single buffer cannot be bigger than 4 GB.
              
              For the majority of applications the example code above would be sufficient to transfer data from the host to the device memory. However, there are a number of coding practices you should adopt in order to maximize performance and fine-grain control.
             
             
              UsingclEnqueueMigrateMemObjects
              Another consideration when transferring data is usingclEnqueueMigrateMemObjectsinstead ofclEnqueueWriteBufferorclEnqueueReadBufferto improve the performance. Typically, memory objects are implicitly migrated to a device for enqueued kernels. Using this API call results in data transfer ahead of kernel execution to reduce latency, particularly when a kernel is called multiple times.
              The following code example is modified to useclEnqueueMigrateMemObjects:
              int host_mem_ptr[MAX_LENGTH]; // host memory for input vector // Fill the memory input for(int i=0; i } cl_mem_ext_ptr_t d_bank0_ext; d_bank0_ext.flags = XCL_MEM_DDR_BANK0; d_bank0_ext.obj = host_mem_ptr; d_bank0_ext.param = 0; cl_mem dev_mem_ptr = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_EXT_PTR_XILINX, sizeof(int) * number_of_words, &d_bank0_ext, NULL); err = clEnqueueMigrateMemObjects(commands, 1, dev_mem_ptr, 0, 0, NULL, NULL);
             
             
              Usingposix_memalignfor Host Memory Space
              SDAccelruntime allocates the memory space in 4K boundary for internal memory management. If the host memory pointer is not aligned to a 4K word boundary, the runtime performs extramemcpyto make it aligned. It does not significantly impact performance, but you should align the host memory pointer with the 4K boundary to follow theSDAccelruntime memory management.
              The following is an example of howposix_memalignis used instead ofmallocfor the host memory space pointer.
              int *host_mem_ptr; // = (int*) malloc(MAX_LENGTH*sizeof(int)); // Aligning memory in 4K boundary posix_memalign(&host_mem_ptr,4096,MAX_LENGTH*sizeof(int)); // Fill the memory input for(int i=0; i } cl_mem_ext_ptr_t d_bank0_ext; d_bank0_ext.flags = XCL_MEM_DDR_BANK0; d_bank0_ext.obj = host_mem_ptr; d_bank0_ext.param = 0; cl_mem dev_mem_ptr = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_EXT_PTR_XILINX, sizeof(int) * number_of_words, &d_bank0_ext, NULL); err = clEnqueueMigrateMemObjects(commands, 1, dev_mem_ptr, 0, 0, NULL, NULL);
             
             
              Enhanced Buffer Allocation
              By default, all the memory interfaces from all the kernels are connected to a single global memory bank when kernels are linked. As a result, only one memory interface can transfer data to and from the global memory bank at a time, limiting the overall performance of the application. If the FPGA device contains only one global memory bank, this is the only option. However, if the device contains multiple global memory banks, you can customize the global memory bank connections by modifying the default connection. This topic is discussed in greater detail inCustomization of DDR Bank to Kernel Connection. This improves overall performance by enabling multiple kernel memory interfaces to concurrently read and write data from separate global memory banks.
              When kernel ports are mapped to memory banks other than the default one, it is necessary to use the enhanced buffer allocation pattern when creating theOpenCLbuffers.
              The enhanced buffer allocation pattern uses aXilinxvendor extension,cl_mem_ext_ptr_t, pointer to help theXilinxruntime determine which global memory bank the buffer should be allocated.
              Thecl_mem_ext_ptr_ttype is a struct as defined below:
              typedef struct{ unsigned flags; void *obj; void *param; } cl_mem_ext_ptr_t;
              Use the explicit bank name method to operatecl_mem_ext_ptr_tfor enhanced buffer allocation.
             
             
              Explicit Bank Name Method
              In this approach, the struct fieldflagsis used to denote the DDR bank (XCL_MEM_DDR_BANK1, XCL_MEM_DDR_BANK2, etc.). The struct fieldparamshould not be used and set toNULL.
              The following code example usescl_mem_ext_ptr_tto assign the device buffer to DDR Bank 2.
              int host_mem_ptr[MAX_LENGTH]; // host memory for input vector // Fill the memory input for(int i=0; i } cl_mem_ext_ptr_t d_bank0_ext; d_bank0_ext.flags = XCL_MEM_DDR_BANK2; d_bank0_ext.obj = NULL; d_bank0_ext.param = 0; cl_mem dev_mem_ptr = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_EXT_PTR_XILINX, sizeof(int) * number_of_words, &d_bank0_ext, NULL); err = clEnqueueWriteBuffer(commands, dev_mem_ptr, CL_TRUE, 0, sizeof(int) * number_of_words, host_mem_ptr, 0, NULL, NULL);
              
               IMPORTANT:Starting from the 2018.3 release, the new method of specifying a bank name is:
               var_ext.flags =  | XCL_MEM_TOPOLOGYWhere 0, 1, 2, and 3 stand for different DDR banks. However, the older naming style of
               XCL_MEM_DDR_BANK0, etc. would still work for the existing platform.
              
             
            
           
           
            Kernel Setup and Execution
            
             
              
               This section focuses on how a typical host application performs the following kernel related tasks in the
               SDAccelenvironment:
               
                Identifying the kernels.
                Setting kernel arguments.
                Executing kernels on the FPGA.
               
              
             
             
              Identifying the kernels
              
               At the beginning, the individual kernels present in the
               .xclbinfile should be mapped to the kernel handles (denoted by
               cl_kerneltype) in the host code. This is done by the
               clCreateKernelcommand with the kernel name as an argument:
               kernel1 = clCreateKernel(program, "", &err); kernel2 = clCreateKernel(program, "", &err); // etc
              
             
             
              Setting Kernel Arguments
              
               In the
               SDAccelenvironment framework two types of kernel arguments can be set.
               
                The scalar arguments are used for small data transfer, such as for constant, or configuration type data. These are write-only arguments.
                The buffer arguments are used for large data transfer as discussed inBuffer Transfer to/from the FPGA Device.
               
              
              
               The kernel arguments can be set using the
               clSetKernelArgcommand as shown below. The following example shows setting kernel arguments for two scalar arguments, and three buffer arguments.
               int err = 0; // Setting up scalar arguments cl_uint scalar_arg_image_width = 3840; err |= clSetKernelArg(kernel, 0, sizeof(cl_uint), &scaler_arg_image_width); cl_uint scaler_arg_image_height = 2160; err |= clSetKernelArg(kernel, 1, sizeof(cl_uint), &scaler_arg_image_height); // Setting up buffer arguments err |= clSetKernelArg(kernel, 2, sizeof(cl_mem), &dev_mem_ptr0); err |= clSetKernelArg(kernel, 3, sizeof(cl_mem), &dev_mem_ptr1); err |= clSetKernelArg(kernel, 4, sizeof(cl_mem), &dev_mem_ptr2);
              
             
             
              Enqueing the Kernels
              
               The kernel is enqueued to run on FPGA either by the
               clEnqueueTaskor
               clEnqueueNDRangeKernelcommands.
               Xilinxrecommends using the
               clEnqueueTaskcommand to execute the kernel over the entire range of input data set using the maximum number of work group items:
               err = clEnqueueTask(commands, kernel, 0, NULL, NULL);
              
              
               TIP:
               clEnqueueTaskis the same as calling
               clEnqueueNDRangeKernelwith
               work_dimset to 1,
               global_work_offsetset to NULL,
               global_work_size[0]set to 1, and
               local_work_size[0]set to 1.
              
              Just like all theenqueuecommands, theclEnqueueTaskandclEnqueueNDRangeKernelare asynchronous in nature. The host code continues executing without waiting for the kernel computation to complete on the FPGA device. This allows the host program to execute more kernels, either the same kernel multiple times over a different set of data, or different kernel. After finishing its work, the kernel writes the result data to the global memory bank. This data is read back to the host memory space by usingclEnqueueReadBufferor theclEnqueueMigrateMemObjectscommand.
             
            
           
           
            Event Synchronization
            
             AllOpenCLclEnqueueXXXAPI calls are asynchronous. In other words, these commands will return immediately after the command is enqueued in the command queue. To resolve the dependencies among the commands, an API call such asclWaitForEventsorclFinishcan be used to pause or block execution of the host program.
             
              Example usage of
              clWaitForEventsand
              clFinishcommands are shown below:
              err = clEnqueueTask(command_queue, kernel, 0, NULL, NULL); // Execution will wait here until all commands in the command queue are finished clFinish(command_queue); // Read back the results from the device to verify the output cl_event readevent; int host_mem_output_ptr[MAX_LENGTH]; // host memory for output vector clEnqueueReadBuffer(command_queue, dev_mem_ptr, CL_TRUE, 0, sizeof(int) * number_of_words, host_mem_output_ptr, 0, NULL, &readevent ); clWaitForEvents(1, &readevent); // Wait for clEnqueueReadBuffer event to finish // Check Results // Compare Golden values with host_mem_output_ptr
             
             
              Note how the synchronization APIs have been added in the above example.
              
               TheclFinishAPI has been explicitly used to block the host execution until the Kernel execution is finished. This is necessary otherwise the host can attempt to read back from the FPGA buffer too early and may read garbage data.
               The data transfer from FPGA memory to the local host machine is done throughclEnqueueReadBuffer. Here the last argument ofclEnqueueReadBufferreturns an event object that identifies this particular read command and can be used to query the event, or wait for this particular command to complete. TheclWaitForEventsspecifies that one event, and waits to ensure the data transfer is finished before checking the data from the host side memory.
              
             
            
           
          
          
           Post Processing and FPGA Cleanup
           
            
             At the final stage of the host program, it is good practice to confirm the FPGA functionality by comparing the output data from the FPGA with golden data. This action greatly helps identify and debug any issues with the kernel.
             bool failed = false; for (i=0; i

            

            
             At the end of the host code, all the allocated resources should be released by using proper release functions. The
             SDAccelenvironment may not able to generate a correct performance related profile and analysis report if resources are not properly released.
             clReleaseCommandQueue(Command_Queue); clReleaseContext(Context); clReleaseDevice(Target_Device_ID); clReleaseKernel(Kernel); clReleaseProgram(Program); free(Platform_IDs); free(Device_IDs);
            
           

          

          
           Summary
           
            As discussed in earlier topics, the recommended coding style for the host application in theSDAccelenvironment includes the following points:
            
             Add error checking after eachOpenCLAPI call for debugging purpose, if required.
             In theSDAccelenvironment, one or more kernels are separately pre-compiled to the.xclbinfile. The APIclCreateProgramWithBinaryis used to build the program from the kernel binary.
             Ensure usingcl_mem_ext_ptr_tto match custom kernel memory interface to the DDR bank connection that has been used to build the kernel binary.
             Transfer data back and forth from the host code to the FPGAs by usingclEnqueueMigrateMemObjects.
             Useposix_memalignto align the host memory pointer at 4K boundary.
             Use the out-of-order command queue, or multiple in-order command queues, for concurrent kernel execution on the FPGA.
             Execute the whole workload withclEnqueTask, rather than splitting the workload by usingclEnqueueNDRangeKernel.
             Use synchronization commands to resolve dependencies of the asynchronousOpenCLAPI calls.

Programming the Host Application

Setting Up the OpenCL Environment

Platform

Devices

Sub-devices

Context

Command Queues

Program

Executing Commands in the FPGA Device

Buffer Transfer to/from the FPGA Device

Using`clEnqueueMigrateMemObjects`

Using`posix_memalign`for Host Memory Space

Enhanced Buffer Allocation

Explicit Bank Name Method

Kernel Setup and Execution

Identifying the kernels

Setting Kernel Arguments

Enqueing the Kernels

Event Synchronization

Post Processing and FPGA Cleanup

Summary