Programming the Host Application

In theSDAccel™environment, host code is written in C or C++ language using the industry standardOpenCL™API. TheSDAccelenvironment provides anOpenCL1.2 embedded profile conformant runtime API.

Note:The SDAccelenvironment supports the OpenCLInstallable Client Driver (ICD) extension ( cl_khr_icd). This extension allows multiple implementations of OpenCLto co-exist on the same system. Refer to OpenCL Installable Client Driver Loaderfor details and installation instructions.

TheSDAccelenvironment consists of a host x86 CPU and compute devices running on aXilinx®FPGA.

In general, the structure of the host code can be divided into three sections:

Setting up the environment.
Core command execution including executing one or more kernels.
Post processing and FPGA release.

The following sections discuss each of the above topics in detail.

Note:For multithreading the host program, exercise caution when calling a fork()system call from an SDAccelenvironment application. The fork()does not duplicate all the runtime threads. Hence the child process cannot run as a complete application in the SDAccelenvironment. It is advisable to use the posix_spawn()system call to launch another process from the SDAccelenvironment application.

Setting Up the OpenCL Environment

The host code in theSDAccelenvironment followsOpenCLprogramming paradigm. To set the environment properly, the host application should identify the standardOpenCLmodels. They are: platform, devices, context, command queue, and program.

TIP:The host code examples and API commands used in this document follow the OpenCLC API. The IDCT examplereferred to in SDAccel Example Designsis also written with the C API. However, the SDAccelruntime environment also supports the OpenCLC++ wrapper API, and many of the examples in the GitHub repositoryare written using the C++ API. Refer to https://www.khronos.org/registry/OpenCL/specs/opencl-cplusplus-1.2.pdffor more information on this C++ wrapper API.

Platform

From the very beginning the host code should identify the platform composed of XilinxFPGA as one or more devices. The host code segment below is standard coding to identify the Xilinxdevice based platform.

cl_platform_id platform_id; // platform id err = clGetPlatformIDs(16, platforms, &platform_count); // Find Xilinx Platform for (unsigned int iplat=0; iplat


             TheOpenCLAPI callclGetPlatformIDsis used to discover the set of availableOpenCLplatforms for a given system. Thereafter,clGetPlatformInfois used to identify theXilinxdevice based platform by matchingcl_platform_vendorwith the string"Xilinx".
             
              Note:Though it is not explicitly shown in the preceding code, or in other host code examples used throughout this chapter, it is always a good coding practice to use error checking after each of the
              OpenCLAPI calls. This can help debugging and improve productivity when you are debugging the host and kernel code in the emulation flow, or during hardware execution. Below is an error checking code example for
              clGetPlatformIDscommand:
              err = clGetPlatformIDs(16, platforms, &platform_count); if (err != CL_SUCCESS) { printf("Error: Failed to find an OpenCL platform!\n"); printf("Test failed\n"); exit(1); }


           
            Devices
            
             After the platform detection, theXilinxFPGA devices attached to the platform are identified. TheSDAccelenvironment supports one or moreXilinxFPGA devices working together.
             
              The following code demonstrates finding all the
              Xilinxdevices (with a upper limit of 16) by using API
              clGetDeviceIDsand printing their names.
              cl_device_id devices[16]; // compute device id char cl_device_name[1001]; err = clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_ACCELERATOR, 16, devices, &num_devices); printf("INFO: Found %d devices\n", num_devices); //iterate all devices to select the target device. for (uint i=0; i

             

             
              IMPORTANT:The
              clGetDeviceIDsAPI is called with the
              device_typeand
              CL_DEVICE_TYPE_ACCELERATORto receive all the available
              Xilinxdevices.
             
            

            
             Sub-Devices
             
              In theSDAccelenvironment, sometimes devices contain multiple kernel instances of a single kernel or of different kernels. TheOpenCLAPIclCreateSubDevicesallows the host code to divide the device into multiple sub-devices containing one kernel instance per sub-device. Currently, theSDAccelenvironment supports equally divided sub-devices each containing only one kernel instance.
              The following example shows:
              
               The sub-devices are created by equal partition to execute one kernel instance per sub-device.
               Iterating over the sub-device list and using a separate context and command queue to execute the kernel on each of them.
               The API related to kernel execution (and corresponding buffer related) code is not shown for the sake of simplicity, but would be described inside the functionrun_cu.
              
              cl_uint num_devices = 0; cl_device_partition_property props[3] = {CL_DEVICE_PARTITION_EQUALLY,1,0}; // Get the number of sub-devices clCreateSubDevices(device,props,0,nullptr,&num_devices); // Container to hold the sub-devices std::vector devices(num_devices); // Second call of clCreateSubDevices // We get sub-device handles in devices.data() clCreateSubDevices(device,props,num_devices,devices.data(),nullptr); // Iterating over sub-devices std::for_each(devices.begin(),devices.end(),[kernel](cl_device_id sdev) { // Context for sub-device auto context = clCreateContext(0,1,&sdev,nullptr,nullptr,&err); // Command-queue for sub-device auto queue = clCreateCommandQueue(context,sdev, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE,&err); // Execute the kernel on the sub-device using local context and queue run_cu(context,queue,kernel); // Function not shown });
              
               IMPORTANT:As shown in the above example, create separate context for each and every sub-devices. Though
               OpenCLallows to create a context that can hold multiple devices and sub-devices, for
               XilinxRuntime it is recommended to separate each device and sub-device into a separate context.
              
             
            
           

           
            Context
            
             
              The
              OpenCLcontext creation process is straightforward. The API
              clCreateContextis used to create a context that contains one or more
              Xilinxdevices that will communicate with the host machine.
              context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);
             
             In the code example above, the APIclCreateContextis used to create a context that contains oneXilinxdevice. You can create only one context for a device from a host program. However, the host program should use multiple contexts if sub-devices are used; one context for each sub-device.
            
           
           
            Command Queues
            
             
              One or more command queues for each device is created using theclCreateCommandQueueAPI. The FPGA device can contain multiple kernels. When developing the host application, there are two main programming approaches to execute kernels on a device:
              
               Single out-of-order command queue: Multiple kernel executions can be requested through the same command queue. TheSDAccelruntime environment dispatches those kernels as soon as possible in any order allowing concurrent kernel execution on the FPGA.
               Multiple in-order command queue: Each kernel execution will be requested from different in-order command queues. In such cases, theSDAccelruntime environment can dispatch kernels from any command queue with the intention of improving performance by running them concurrently on the FPGA.
              
              
               The following is an example of standard API calls to create in-order and out-of-order command queues.
               // Out-of-order Command queue commands = clCreateCommandQueue(context, device_id, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &err); // In-order Command Queue commands = clCreateCommandQueue(context, device_id, 0, &err);
              
             
            
           
           
            Program
            
             As described in theSDAccel Build Process, the host and kernel code are compiled separately to create separate executable files: the host application (.exe) and the FPGA binary (.xclbin). When the host application is executed it must load the.xclbinusing theclCreateProgramWithBinaryAPI.
             
              The following code example shows how the standard
              OpenCLAPI is used to build the program from the
              .xclbinfile:
              unsigned char *kernelbinary; char *xclbin = argv[1]; printf("INFO: loading xclbin %s\n", xclbin); int size=load_file_to_memory(xclbin, (char **) &kernelbinary); size_t size_var = size; cl_program program = clCreateProgramWithBinary(context, 1, &device_id, &size_var,(const unsigned char **) &kernelbinary, &status, &err); // Function int load_file_to_memory(const char *filename, char **result) { uint size = 0; FILE *f = fopen(filename, "rb"); if (f == NULL) { *result = NULL; return -1; // -1 means file opening fail } fseek(f, 0, SEEK_END); size = ftell(f); fseek(f, 0, SEEK_SET); *result = (char *)malloc(size+1); if (size != fread(*result, sizeof(char), size, f)) { free(*result); return -2; // -2 means file reading fail } fclose(f); (*result)[size] = 0; return size; }
             
             The above example performs the following steps:
             
              The kernel binary file,.xclbin, is passed in from the command line argument,argv[1].
               
                TIP:Passing the
                .xclbinthrough a command line argument is specific to this example. You can also hardcode the kernel binary file in the application, use an environment variable, read it from a custom initialization file or any other suitable mechanism.
               
              Theload_file_to_memoryfunction is used to load the file contents in the host machine memory space.
              The APIclCreateProgramWithBinaryis used to complete the program creation process.


          
           Executing Commands in the FPGA Device
           
            
             Once the
             OpenCLenvironment is initialized, the host application is ready to issue commands to the device and interact with the kernels. Such commands include:
             
              Setting up the kernels.
              Buffer transfer to/from the FPGA.
              Kernel execution on FPGA.
              Event synchronization.
             
            
           
           
            Setting Up the Kernels
            
             After the initialization of all the preliminaries such as context, command queues, and program, the host application should identify the kernels required to execute on the device and setting up their arguments.
            
            
             Kernels Identification
             
              
               At the beginning, the
               clCreateKernelAPI should be used to access the kernels present inside the
               .xclbinfile. The kernel handle (
               cl_kerneltype) denotes a kernel object that now can be used in the rest of the host program.
               kernel1 = (program, "", &err); kernel2 = clCreateKernel(program, "", &err); // etc
              
             
            
            
             Setting Kernel Arguments
             
              In theSDAccelenvironment framework, two types of kernel arguments can be set:
              
               The scalar arguments are used for small data transfer, such as constant or configuration type data. These are write-only arguments.
               The buffer arguments are used for large data transfer.
              
              
               The kernel arguments can be set using the
               clSetKernelArgcommand as shown below. The following example shows setting kernel arguments for two scalar and two buffer arguments.
               cl_mem dev_buf1 = clCreateBuffer(context, CL_MEM_WRITE, size, &host_mem_ptr1, NULL); cl_mem dev_buf2 = clCreateBuffer(context, CL_MEM_READ, size, &host_mem_ptr2, NULL); int err = 0; // Setting up scalar arguments cl_uint scalar_arg_image_width = 3840; err |= clSetKernelArg(kernel, 0, sizeof(cl_uint), &scalar_arg_image_width); cl_uint scalar_arg_image_height = 2160; err |= clSetKernelArg(kernel, 1, sizeof(cl_uint), &scalar_arg_image_height); // Setting up buffer arguments err |= clSetKernelArg(kernel, 2, sizeof(cl_mem), &dev_buf1); err |= clSetKernelArg(kernel, 3, sizeof(cl_mem), &dev_buf2);
              
              
               IMPORTANT:Though
               OpenCLallows setting kernel arguments any time before the kernel enqueuing,
               Xilinxrecommends setting kernel arguments as early as possible. This helps
               XilinxRuntime determine the buffer location on the device. Therefore, set the kernel arguments before performing any enqueue operation (for example,
               clEmqueueMigrateMemObjects) on any buffer.
              
             
            
           
           
            Buffer Transfer to/from the FPGA Device
            
             
              Interactions between the host application and kernels rely on transferring data to and from global memory in the device. The method to send data back and forth from the FPGA is using
              clCreateBuffer,
              clEnqueueWriteBuffer, and
              clEnqueueReadBuffercommands.
              
               Note:
               Xilinxrecommends using
               clEnqueueMigrateMemObjectsinstead of
               clEnqueueReadBufferand
               clEnqueueWriteBuffer.
              
             
             The following code example demonstrates this:
             int host_mem_ptr[MAX_LENGTH]; // host memory for input vector // Fill the memory input for(int i=0; i } cl_mem dev_mem_ptr = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(int) * number_of_words, NULL, NULL); err = clEnqueueWriteBuffer(commands, dev_mem_ptr, CL_TRUE, 0, sizeof(int) * number_of_words, host_mem_ptr, 0, NULL, NULL);
             
              IMPORTANT:A single buffer cannot be bigger than 4 GB.
             
             
              Note:To maximize throughput from host to global memory, sending very small buffers is not very effective.
              Xilinxrecommends keeping the buffer size at least 2 MB if possible.
             
             For simple applications the example code above would be sufficient to transfer data from the host to the device memory. However, there are a number of coding practices you should adopt to maximize performance and fine-grain control.
            
            
             Using clEnqueueMigrateMemObjects
             
              Xilinxrecommends usingclEnqueueMigrateMemObjectsinstead ofclEnqueueWriteBufferorclEnqueueReadBufferto improve the performance. There are two main parts of acl_memobject: host side pointer and device side pointer. Before the kernel starts its operation, the device side pointer is implicitly allocated on the device side memory (for example, on a specific location inside the device global memory) and the buffer becomes a resident on the device. However, by usingclEnqueueMigrateMemObjectsthis allocation and data transfer occur upfront, much ahead of the kernel execution. This especially helps to enablesoftware pipeliningif the host is executing the same kernel multiple times, because data transfer for the next transaction can happen when kernel is still operating on the previous data set, and thus hide the data transfer latency of successive kernel executions.
              The following code example is modified to useclEnqueueMigrateMemObjects:
              int host_mem_ptr[MAX_LENGTH]; // host memory for input vector // Fill the memory input for(int i=0; i } cl_mem dev_mem_ptr = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, sizeof(int) * number_of_words, host_mem_ptr, NULL); clSetKernelArg(kernel, 0, sizeof(cl_mem), &dev_mem_ptr); err = clEnqueueMigrateMemObjects(commands, 1, dev_mem_ptr, 0, 0, NULL, NULL);
             
            
            
             Allocating Page-Aligned Host Memory
             
              XilinxRuntime allocates the memory space in 4K boundary for internal memory management. If the host memory pointer is not aligned to a page boundary, theXilinxRuntime performs extramemcpyto make it aligned. Hence you should align the host memory pointer with the 4K boundary to save the extra memory copy operation.
              The following is an example of howposix_memalignis used instead ofmallocfor the host memory space pointer.
              int *host_mem_ptr; // = (int*) malloc(MAX_LENGTH*sizeof(int)); // Aligning memory in 4K boundary posix_memalign(&host_mem_ptr,4096,MAX_LENGTH*sizeof(int)); // Fill the memory input for(int i=0; i } cl_mem dev_mem_ptr = clCreateBuffer(context, CL_MEM_READ_WRITE , sizeof(int) * number_of_words, host_mem_ptr, NULL); err = clEnqueueMigrateMemObjects(commands, 1, dev_mem_ptr, 0, 0, NULL, NULL);
             
            
            
             Using clEnqueueMapBuffer
             
              Another approach for creating and managing buffers is to useclEnqueueMapBuffer. With this approach, it is not necessary to create a host space pointer aligned to 4K boundary. TheclEnqueueMapBufferAPI maps the specified buffer and returns a pointer created by theXilinxRuntime to this mapped region. Then, fill the host side pointer with your data, followed byclEnqueueMigrateMemObjectto transfer the data to and from the device. Below is an example that uses this style. NoteCL_MEM_USE_HOST_PTRis not used forclCreateBuffer.
              // Two cl_mem buffer, for read and write by kernel cl_mem dev_mem_read_ptr = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(int) * number_of_words, NULL, NULL); cl_mem dev_mem_write_ptr = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(int) * number_of_words, NULL, NULL); // Setting arguments clSetKernelArg(kernel, 0, sizeof(cl_mem), &dev_mem_read_ptr); clSetKernelArg(kernel, 1, sizeof(cl_mem), &dev_mem_write_ptr); // Get Host side pointer of the cl_mem buffer object auto host_write_ptr = clEnqueueMapBuffer(queue,dev_mem_read_ptr,true,CL_MAP_WRITE,0,bytes,0,nullptr,nullptr,&err); auto host_read_ptr = clEnqueueMapBuffer(queue,dev_mem_write_ptr,true,CL_MAP_READ,0,bytes,0,nullptr,nullptr,&err); // Fill up the host_write_ptr to send the data to the FPGA for(int i=0; i< MAX; i++) { host_write_ptr[i] = <.... > } // Migrate cl_mem mems[2] = {host_write_ptr,host_read_ptr}; clEnqueueMigrateMemObjects(queue,2,mems,0,0,nullptr,&migrate_event)); // Schedule the kernel clEnqueueTask(queue,kernel,1,&migrate_event,&enqueue_event); // Migrate data back to host clEnqueueMigrateMemObjects(queue, 1, &dev_mem_write_ptr, CL_MIGRATE_MEM_OBJECT_HOST,1,&enqueue_event, &data_read_event); clWaitForEvents(1,&data_read_event); // Now use the data from the host_read_ptr
             
            
            
             Buffer Allocation on the Device
             
              By default, all the memory interfaces from all the kernels are connected to a single default global memory bank when kernels are linked. As a result, only one compute unit (CU) can transfer data to and from the global memory bank at a time, limiting the overall performance of the application. If the FPGA contains only one global memory bank, then this is the only option. However, if the device contains multiple global memory banks, you can customize the global memory bank connections by modifying the default connection. This topic is discussed in greater detail inConnecting Kernel Ports to Global Memory. Overall performance is improved by enabling multiple kernel memory interfaces to concurrently read and write data from separate global memory banks.
              As in theSDAccelenvironment the host code and the kernel code are compiled independently.XilinxRuntime needs to detect the kernel's memory connection to send the data to the correct memory location from the host code. The latest 2019.1XilinxRuntime will automatically find the buffer location from the kernel binary files ifclSetKernelArgsis used before any enqueue operation on the buffer, for exampleclEnqueueMigrateMemObject.
              Before the 2019.1 release, theOpenCLhost code required aXilinxextension (cl_mem_ext_ptr) to specify the exact buffer location on the device. Though this method is still supported, it is no longer necessary and is not documented in this version of the guide. For more information on specifying buffer location usingcl_mem_ext_ptr, see the earlier version of this guide.
             
            
            
             Sub-Buffers
             
              Though not very common, using sub-buffer can be very useful in specific situations. The following sections discuss the scenarios where using sub-buffers can be beneficial.
             
             
              Reading a Specific Portion from the Device Buffer
              
               
                Consider a kernel that is producing different amounts of data depending on the input to the kernel. A simple example can be a compression engine where the output size varies depending on the input data pattern and similarity. The host can still read the whole output buffer by using
                clEnqueueMigrateMemObjects, but that is a suboptimal approach as more than required memory transfer would take place. Ideally the host only needs to read the exact amount of data that kernel has written. One of the techniques can be adopted by kernel is to write a size information of the output data at the start of the output written data. If using
                clEnqueueReadBuffer, the host code can use
                clEnqueueReadBuffertwo times, first for reading the size of the data, and the second to read exact amount of data by using the size information from the first read.
                clEnqueueReadBuffer(command_queue,device_write_ptr, CL_FALSE, 0, sizeof(int) * 1, &kernel_write_size, 0, nullptr, &size_read_event); clEnqueueReadBuffer(command_queue,device_write_ptr, CL_FALSE, DATA_READ_OFFSET, kernel_write_size, host_ptr, 1, &size_read_event, &data_read_event);
               
               WithclEnqueueMigrateMemObject, which is recommended overclEnqueueReadBufferorclEnqueueWriteBuffer, you can adopt a similar approach by using sub-buffers. The following sample code is shown below (note that this is not the complete API arguments).
               //Create a small subbuffer cl_buffer_region buffer_info_1={0,1*sizeof(int)}; cl_mem size_info = clCreateSubBuffer (device_write_ptr, CL_MEM_WRITE_ONLY, CL_BUFFER_CREATE_TYPE_REGION, &buffer_info_1,&err); // Map the subbuffer into the host space auto size_info_host_ptr = clEnqueueMapBuffer(queue,size_info,,,, ); // Read only the subbuffer portion clEnqueueMigrateMemObjects(queue, 1, &size_info,CL_MIGRATE_MEM_OBJECT_HOST,,,); // Retrive size information from the already mapped size_info_host_ptr kernel_write_size = ........... // Create sub-buffer again for required amount cl_buffer_region buffer_info_2={DATA_READ_OFFSET, kernel_write_size}; cl_mem buffer_seg = clCreateSubBuffer (device_write_ptr, CL_MEM_WRITE_ONLY, CL_BUFFER_CREATE_TYPE_REGION, &buffer_info_2,&err); // Map the subbuffer into the host space auto read_mem_host_ptr = clEnqueueMapBuffer(queue, buffer_seg,,,); // Migrate the subbuffer clEnqueueMigrateMemObjects(queue, 1, &buffer_seg,CL_MIGRATE_MEM_OBJECT_HOST,,,); // Now use the read data from already mapped read_mem_host_ptr
              
             
             
              Device Buffer Shared by Multiple Memory Ports or Multiple Kernels
              
               Sometimes memory ports of the kernels only require small amounts of data. Managing and sending multiple number of small sized buffers can have potential performance issues. Alternatively, the host can create a large size buffer divided into small sub-buffers. Each sub-buffer assigns a kernel argument for each of the memory ports which requires small amounts of data. This can improve performance asXilinxRuntime handles a large buffer instead of several small buffers.
               Once sub-buffers are created they are used in the host code similar to regular buffers.
              
             
            
           
           
            Kernel Execution
            
             Assuming the kernel is compiled to a single hardware instance (or CU) on the FPGA, the simplest method of executing the kernel is usingclEnqueueTaskas shown below.
             err = clEnqueueTask(commands, kernel, 0, NULL, NULL);
             TheXilinxRuntime schedules the workload (the data passed throughOpenCLbuffers through the kernel arguments) and schedules the kernel to compute intensive tasks on the FPGA.
             
              IMPORTANT:Though using
              clEnqueueNDRangeKernelis supported (only for
              OpenCLkernel),
              Xilinxrecommends using
              clEnqueueTask.
             
             There are various methods you can execute the kernel, multiple kernels, or multiple instances of the same kernel on the FPGA. Those are discussed in the next section.
            
            
             Single Kernel Invocation for the Entire Task and Data Parallelism
             
              Often the complete compute intensive task is defined inside a single kernel and the kernel is executed only one time to work on the entire data range. As there is a overhead of multiple kernel executions, this approach certainly helps in many cases. Though the kernel is executed only one time and works on the entire range of the data, the parallelism (and thereby acceleration) is achieved on the FPGA inside the kernel hardware. Most of the time (if properly coded), kernel is capable of achieving parallelism by various technique such as instruction-level parallelism (loop pipeline) and function-level parallelism (dataflow). You will learn about different kernel coding techniques inProgramming C/C++ Kernels.
              However, the above mentioned singleclEnqueueTaskis not always feasible due to various practical reasons. For example, the kernel code can become too big and complex to optimize if it attempts to perform all compute intensive task in a single execution. Another possible case is when the host is receiving data over time and not all the data at the same time. Therefore, depending on the situation and application, there are different ways to useclEnqueueTaskto break the data and the task into multipleclEnqueueTaskcommands as discussed in the next sections.
             
            
            
             Task Parallelism by Using Multiple Different Kernels
             
              Sometimes multiple kernels can be designed performing different task on the FPGA in parallel. By using the multipleclEnqueueTaskcommand (through a out-of-order command queue), it is possible to allow multiple kernels (performing different task) working in parallel on the FPGA. This enables the task parallelism on the FPGA.
             
            
            
             Spatial Data Parallelism: Increase Number of Compute Units
             
              If a single kernel has been compiled into multiple hardware instances (or CUs),clEnqueueTaskcan be called multiple times (using a out-of-order queue) to enable data parallelism. Each call ofclEnqueueTaskwould schedule a workload in different CUs working on the different data sets in parallel.
             
            
            
             Temporal Data Parallelism: Host to Kernel Dataflow
             
              To understand this approach, assume a kernel has only one CU on the FPGA and the host requires to use the CU multiple times on different sets of data. As shown inUsing clEnqueueMigrateMemObjects, by usingclEnqueueMigrateMemObjectsit is possible to send data to the device global memory ahead of time (the data is transferred for the next kernel execution), and thus hiding the data transfer latency by the kernel execution, enablingsoftware pipelining.
              However, by default, the kernel can start operating on the next set of data only when it is finished working on the current set of data. ThoughclEnqueueMigrateMemObjecthides the data transfer execution time, the kernel executions still remain sequential.
              By enabling the host to kernel dataflow, it is even possible to further improve the performance by restarting the kernel while the kernel is still working on the previous sets of data. If the kernel is optimized in a manner such that it is capable of accepting the new data (for the next kernel operation) even when it is still working on the previous data (to achieve this the kernel has to be compiled in a certain manner, seeEnabling Host to Kernel Dataflow), the XRT restarts the kernel as soon as possible, thus overlapping the multiple kernel executions.
              This allowstemporal parallelismbetween host to kernel where each section of the kernel hardware is working on a specific data set from the differentclEnqueueTaskcommand in a pipelined manner. However, the host still needs to fill the command queue ahead of the time (by software pipelining) so that kernel can restart as soon as it is ready to accept the new set of data.
              The following is a conceptual diagram for the host to kernel dataflow.
              
               Figure:Host to Kernel Dataflow
               
              
              For advanced designs, you can effectively use both the spatial parallelism (using more hardware resources or CUs) and software pipeline (clEnqueueMigrateMemObjects) combined with temporal parallelism (by host to kernel dataflow, particularly overlapping kernel executions on each CU). If needed, you can potentially combine all the techniques together.
             
            
            
             Symmetrical and Asymmetrical Compute Units
             
              During the kernel linking process, a kernel can have multiple CUs on the FPGA.
             
             
              Symmetrical Compute Units
              
               CUs are considered asymmetrical when they do not have identical connections to global memory (when they do not have exactly the same--spoptions). As a result, theXilinxRuntime can use them interchangeably. A call toclEnqueueTaskcan result in the invocation of any one instance in a group of symmetrical CUs.
              
             
             
              Asymmetrical Compute Units
              
               CUs are considered asymmetrical when they do not have identical connections to global memory (when they do not have exactly the same--spoptions). Using the same setup of the input (and output) buffers, it is not possible to execute both of these CUs interchangeably. So these are not execution agnostic from theXilinxRuntime perspective.
              
             
             
              Kernel Handle and Compute Units
              
               The first timeclSetKernelArgis called for a given kernel object, theXilinxRuntime selects a group of symmetrical CUs for the subsequent executions of this kernel. WhenclEnqueueTaskis called, any of the symmetrical CUs in that group can be used.
               If all CUs for a given kernel are symmetrical, a single kernel object is sufficient to access any of these CUs. If there are asymmetrical CUs, the application will need to create as many kernel objects as there are groups of asymmetrical CUs to ensure all of them can be used.
              
              
               Creating Kernel Objects for Specific Compute Units
               
                
                 From the 2019.1 release, the
                 XilinxRuntime provides the capability to get kernel handles for specific CUs or a group of CUs. The syntax of this style is shown below:
                 // Create kernel object only for a specific compute unit kernel1 = clCreateKernel(program, ":{comp_unit_name_1}", &err); // Create kernel object for two specific compute units kernel1 = clCreateKernel(program, ":{comp_unit_name_1,comp_unit_name_2}", &err);
                
                This gives control within the application over which specific CU instance is used. This can be useful in the case of asymmetrical CUs or to perform explicit load and priority management of CUs.
               
              
             
             
              Using Compute Unit Name to Get Handle of All Asymmetrical Compute Units
              
               If a kernel has CUs that are not all symmetrical, the enhancedclCreateKernelwith the CU name can be used. In this case, the host needs to manage each symmetrical CU group separately with differentcl_kernelhandle. The following shows a hypothetical example.
               Assume the kernelmykernelhas five CUs: K1, K2, K3, K4, and K5. Also consider the CUs K1, K2, and K3 are having symmetrical connection on the device and can be considered as a group of symmetrical CUs. Similarly, CUs named K4 and K5 form another group of symmetrical CU. The code segment below shows how twocl_kernelhandles are used to manage the two groups of symmetrical CUs.
               // Kernel handle for Symmetrical compute unit group 1: K1,K2,K3 cl_kernel kernel_handle1 = clCreateKernel(program,"mykernel:{K1,K2,K3}",&err); for(i=0; i<3; i++) { // Creating buffers for the kernel_handle1 ..... // Setting kernel arguments for kernel_handle1 ..... // Enqueue buffers for the kernel_handle1 ..... // Possible candidates of the executions K1,K2 or K3 clEnqueueTask(commands, kernel_handle1, 0, NULL, NULL); // } // Kernel handle for Symmetrical compute unit group 1: K4, K5 cl_kernel kernel_handle2 = clCreateKernel(program,"mykernel:{K4,K5}",&err); for(int i=0; i<2; i++) { // Creating buffers for the kernel_handle2 ..... // Setting kernel arguments for kernel_handle2 ..... // Enqueue buffers for the kernel_handle2 ..... // Possible candidates of the executions K4 or K5 clEnqueueTask(commands, kernel_handle2, 0, NULL, NULL); }
              
             
            
           
           
            Event Synchronization
            
             AllOpenCLclEnqueueXXXAPI calls are asynchronous. These commands will return immediately after the command is enqueued in the command queue. To resolve the dependencies among the commands, an API call such asclWaitForEventsorclFinishcan be used to pause or block execution of the host program.
             
              Example usage of
              clWaitForEventsand
              clFinishcommands are shown below:
              err = clEnqueueTask(command_queue, kernel, 0, NULL, NULL); // Execution will wait here until all commands in the command queue are finished clFinish(command_queue); // Read back the results from the device to verify the output cl_event readevent; int host_mem_output_ptr[MAX_LENGTH]; // host memory for output vector clEnqueueReadBuffer(command_queue, dev_mem_ptr, CL_TRUE, 0, sizeof(int) * number_of_words, host_mem_output_ptr, 0, NULL, &readevent ); clWaitForEvents(1, &readevent); // Wait for clEnqueueReadBuffer event to finish // Check Results // Compare Golden values with host_mem_output_ptr
             
             
              Note how the synchronization APIs have been added in the above example.
              
               TheclFinishAPI has been explicitly used to block the host execution until the Kernel execution is finished. This is necessary otherwise the host can attempt to read back from the FPGA buffer too early and may read garbage data.
               The data transfer from FPGA memory to the local host machine is done throughclEnqueueReadBuffer. Here the last argument ofclEnqueueReadBufferreturns an event object that identifies this particular read command and can be used to query the event, or wait for this particular command to complete. TheclWaitForEventsspecifies that one event, and waits to ensure the data transfer is finished before checking the data from the host side memory.
              
             
            
           
          
          
           Post Processing and FPGA Cleanup
           
            
             At the end of the host code, all the allocated resources should be released by using proper release functions. The
             SDAccelenvironment may not able to generate a correct performance related profile and analysis report if resources are not properly released.
             clReleaseCommandQueue(Command_Queue); clReleaseContext(Context); clReleaseDevice(Target_Device_ID); clReleaseKernel(Kernel); clReleaseProgram(Program); free(Platform_IDs); free(Device_IDs);
            
           
          
          
           Summary
           
            As discussed in earlier topics, the recommended coding style for the host application in theSDAccelenvironment includes the following points:
            
             Add error checking after eachOpenCLAPI call for debugging purpose, if required.
             In theSDAccelenvironment, one or more kernels are separately pre-compiled to the.xclbinfile. The APIclCreateProgramWithBinaryis used to build the program from the kernel binary.
             Use buffer for setting the kernel argument (clSetKernelArg) before any enqueue operation on the buffer.
             Transfer data back and forth from the host code to the FPGAs by usingclEnqueueMigrateMemObjects.
             Useposix_memalignto align the host memory pointer at 4K boundary.
             Preferably use the out-of-order command queue for concurrent command execution on the FPGA.
             Use synchronization commands to resolve dependencies of the asynchronousOpenCLAPI calls.