Host Optimization

This chapter focuses on host code optimization. The host code uses theOpenCL™API to schedule the individual compute unit executions and data transfers from and to the FPGA board. As a result, you need to think about concurrent execution through theOpenCLqueue(s). This section discusses in detail common pitfalls and how to recognize and address them.

Reducing Overhead of Kernel Enqueing

TheOpenCLAPI execution model supports data parallel and task parallel programming models. Kernels are usually enqueued by theOpenCLruntime multiple times and then scheduled for execution on the device. You must send the command to start the kernel in one of two ways:

For the data parallel case, use theclEnqueueNDRangeAPI.
For the task parallel case, use theclEnqueueTaskAPI.

The dispatching process is executed on the host processor, the actual commands, and kernel arguments must to be sent to the FPGA through thePCIe®link. In the currentXilinxruntime (XRT), the overhead of dispatching the command and arguments to the FPGA is between 30us and 60us, depending the number of arguments on the kernel. You can reduce the impact of this overhead by minimizing the number of times the kernel needs to be executed.

For the data parallel case,Xilinxrecommends that you carefully choose the global and local work sizes for your host code and kernel so that the global work size is a small multiple of the local work size. Ideally, the global work size is the same as the local work size as shown in the following code snippet:

size_t global = 1; size_t local = 1; clEnqueueNDRangeKernel(world.command_queue, kernel, 1, nullptr, &global, &local, 2, write_events.data(), &kernel_events[0]));

For the task parallel case,Xilinxrecommends that you minimize the calls toclEnqueueTask. Ideally, you should finish all the work load in a single call toclEnqueueTask.

Data Transfers

Memory Data Transfer Types

The FPGA memory hierarchy, as well as thePCIe-based host to compute unit data transfers, provide a vast set of different data transfer options. To achieve maximum performance, these options should be reviewed. The basic trade-offs are between:

Kind of memory to be deployed
Streaming data transfer
No host memory data transfers between compute units

Note:For more information about using the --spoption, the --scoption, xlcbin --infocommand, and the platforminfocommand, described in the sections below, refer to SDx Command and Utility Reference Guide(UG1279).

Memory Hierarchy

FPGA accelerator cards include several different memory hierarchies available for communication between host and kernels (CUs). For each memory hierarchy, there are different advantages and disadvantages to consider for optimization purposes.

DDR Memory

This memory is external to the FPGA. It is the largest of the memory options but as a result has the longest latency to access.

Usage (link flag): -sp [kernel|cu].[arg|port]:sptag
Values forsptagcan be looked up throughxclbin --infoor theplatforminfocommand.

PL Memory

This memory is internal to the FPGA. It is usually smaller than the external DDR memory, however, if supported by the platform, it has lower latency than the DDR memory.

Usage (link flag): -sp [kernel|cu].[arg|port]:sptag
Values forsptagcan be looked up throughxclbin.infoandplatforminfo.

Streaming Data Transfer

When streams are deployed between compute units or a compute unit and the host, a sequential data path is established. As a result, the transmitted data can only be received in the order sent. The tool distinguishes two types of streaming data transfers, one between compute units and the one between compute units and host.

Kernel-to-Kernel

In case of kernel to kernel streaming communication, the ports are connected through the -scoption:

Usage (link flag):-sc src.port:dst:port

This is supported in theSDAccelenvironment through the use of hls::streams based on theap_axiudata type on the kernel interfaces. This requires theap_axi_sdata.hheader file to be included.

Compute Unit to Host

Streaming data between compute units and host require QDMA queues. As a result, this flow requires a QDMA platform.

This is supported in theSDAccelenvironment through the use of hls::streams based on theqdma_axisdata type on the kernel interfaces. The member field “last” is used to indicate when a specified workload is completed. This allows streaming communication of various length. For more details on host code modeling, refer to theSDAccel Environment Programmers Guide(UG1277).

No Host Memory Data Transfers between Compute Units

In memory mapped data transfers, blocks of data are written to and from the memory associated with the port. This enables the reader and writer to perform random access within the memory buffer used for communication. TheSDAccelenvironment allows efficient buffer-based communication between kernels by using shared buffers between processes, which requires the same memory, or with the help of theclEnqueueCopyBufferAPI, which implements efficient copying between buffers without host memory interaction.

Peer-to-Peer

TheSDAcceldevelopment environment can support systems with multiple accelerator cards on a single host to work together to accelerate large software systems. Specifically, for communication between different accelerator cards, the development environment supports direct peer-to-peer communication. This can be achieved by enabling the direct access of one of the cards DDR memory space which enables direct communication (without host memory).

Enable Direct Access to DDR Memory Space

The following steps describe the setup from a local source buffer to an exported destination buffer. However, before actually modifying the host code, it is important to enable peer-to-peer communication. This is performed through the use of xbutil "xbutil p2p --enable" for the receiving device. The whole DDR address space of the device will be mapped to the host I/O memory space. For more information, refer to theSDx Command and Utility Reference Guide(UG1279).

After that it is necessary to modify the host code to prepare for direct peer-to-peer communication:

Create buf_dst with XCL_MEM_EXT_P2P_BUFFER flag.
There should not be any associated user space buffer (host buffer) for buf_dst.
Export and import the buf_dst, using thexclGetMemObjectFdandxclGetMemObjectFromFdAPIs, in context: buf_dst_exported.
Use the regularclEnqueueCopyBufferAPI (src_command_queue, buf_src, buf_dst_exported, 0,0, buffer_size,,,,) command to copy the buffers between the devices.

Note:As mentioned in the Step 1 and Step 2, the exported buffer should be a p2p buffer. If a non-p2p buffer is exported and a copy operation is executed, the buffer would still be copied but through the host memory and not directly from card-to-card. The card-to-card transfers will be reported separately in the Profile Summary report.

More information about peer-to-peer connectivity can be found in theXRTdocumentation on GitHub.

Overlapping Data Transfers with Kernel Computation

Applications, such as database analytics, have a much larger data set than the available memory on the acceleration device. They require the complete data to be transferred and processed in blocks. Techniques that overlap the data transfers with the computation are critical to achieve high performance for these applications.

The following code snippet shows the vector add kernel from theOpenCLOverlap Data Transfers with Kernel Computation Example in thehostcategory fromXilinx On-boarding Example GitHub.

kernel __attribute__((reqd_work_group_size(1, 1, 1))) void vadd(global int* c, global const int* a, global const int* b, const int offset, const int elements) { int end = offset + elements; vadd_loop: for (int x=offset; x


             For this example, there are four tasks to perform in the host application:
             
              Write buffer a (Wa)
              Write buffer b (Wb)
              Execute vadd kernel
              Read buffer c (Rc)
             
             The asynchronous nature ofOpenCLdata transfer and kernel execution APIs allows overlap of data transfers and kernel execution, as shown in the following figure. In this example, double buffering is used for all buffers so that the compute unit can process one set of buffers while the host can operate on the other set of buffers. TheOpenCLevent object provides an easy way to set up complex operation dependencies and synchronize host threads and device operations. The arrows in the following figure show how event triggering can be set up to achieve optimal performance.
             
              Figure:Event Triggering Set Up
              
             
             The following host code snippet enqueues the four tasks in a loop. It also sets up event synchronization between different tasks to ensure that data dependencies are met for each task. The double buffering is set up by passing different memory objects values to theclEnqueueMigrateMemObjectsAPI. The event synchronization is achieved by having each API call wait for other event as well as trigger its own event when the API completes.
             for (size_t iteration_idx = 0; iteration_idx < num_iterations; iteration_idx++) { int flag = iteration_idx % 2; if (iteration_idx >= 2) { clWaitForEvents(1, &map_events[flag]); OCL_CHECK(clReleaseMemObject(buffer_a[flag])); OCL_CHECK(clReleaseMemObject(buffer_b[flag])); OCL_CHECK(clReleaseMemObject(buffer_c[flag])); OCL_CHECK(clReleaseEvent(read_events[flag])); OCL_CHECK(clReleaseEvent(kernel_events[flag])); } buffer_a[flag] = clCreateBuffer(world.context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, bytes_per_iteration, &A[iteration_idx * elements_per_iteration], NULL); buffer_b[flag] = clCreateBuffer(world.context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, bytes_per_iteration, &B[iteration_idx * elements_per_iteration], NULL); buffer_c[flag] = clCreateBuffer(world.context, CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR, bytes_per_iteration, &device_result[iteration_idx * elements_per_iteration], NULL); array write_events; printf("Enqueueing Migrate Mem Object (Host to Device) calls\n"); // These calls are asynchronous with respect to the main thread // because are passing the CL_FALSE as the third parameter. // Because we are passing the events from the previous kernel call // into the wait list, it will wait for the previous operations // to complete before continuing OCL_CHECK(clEnqueueMigrateMemObjects( world.command_queue, 1, &buffer_a[iteration_idx % 2], 0 /* flags, 0 means from host */, 0, NULL, &write_events[0])); set_callback(write_events[0], "ooo_queue"); OCL_CHECK(clEnqueueMigrateMemObjects( world.command_queue, 1, &buffer_b[iteration_idx % 2], 0 /* flags, 0 means from host */, 0, NULL, &write_events[1])); set_callback(write_events[1], "ooo_queue"); OCL_CHECK(clSetKernelArg(kernel, 0, sizeof(cl_mem), &buffer_c[iteration_idx % 2])); OCL_CHECK(clSetKernelArg(kernel, 1, sizeof(cl_mem), &buffer_a[iteration_idx % 2])); OCL_CHECK(clSetKernelArg(kernel, 2, sizeof(cl_mem), &buffer_b[iteration_idx % 2])); OCL_CHECK(clSetKernelArg(kernel, 3, sizeof(int), &elements_per_iteration)); printf("Enqueueing NDRange kernel.\n"); // This event needs to wait for the write buffer operations to complete // before executing. We are sending the write_events into its wait list // to ensure that the order of operations is correct. OCL_CHECK(clEnqueueNDRangeKernel(world.command_queue, kernel, 1, nullptr, &global, &local, 2 , write_events.data(), &kernel_events[flag])); set_callback(kernel_events[flag], "ooo_queue"); printf("Enqueueing Migrate Mem Object (Device to Host) calls\n"); // This operation only needs to wait for the kernel call. This call will // potentially overlap the next kernel call as well as the next read // operations OCL_CHECK( clEnqueueMigrateMemObjects(world.command_queue, 1, &buffer_c[iteration_idx % 2], CL_MIGRATE_MEM_OBJECT_HOST, 1, &kernel_events[flag], &read_events[flag])); set_callback(read_events[flag], "ooo_queue"); clEnqueueMapBuffer(world.command_queue, buffer_c[flag], CL_FALSE, CL_MAP_READ, 0, bytes_per_iteration, 1, &read_events[flag], &map_events[flag], 0); set_callback(map_events[flag], "ooo_queue"); OCL_CHECK(clReleaseEvent(write_events[0])); OCL_CHECK(clReleaseEvent(write_events[1])); }
             The Application Timeline view clearly shows that the data transfer time is completely hidden, while thevadd_1compute unit is running constantly.
             
              Figure:Data Transfer Time Hidden in Application Timeline View


           
            Buffer Memory Segmentation
            
             Allocation and deallocation of memory buffers can lead to memory segmentation in the DDRs. This might result in sub-optimal performance of compute units, even if they could theoretically execute in parallel.
             This problem occurs most often when multiple pthreads for different compute units are used, and the threads allocate and release many device buffers with different sizes every time they enqueue the kernels. In this case, the timeline trace will exhibit gaps between kernel executions and it just seems the processes are sleeping.
             Each buffer allocated by runtime should be continuous in hardware. For large memory, it might take a lot of time to wait for that space to be freed, when many buffers are allocated and deallocated. This can be resolved by allocating device buffer, and reusing it between different enqueues of a kernel.


          
           Compute Unit Scheduling
           
            Scheduling kernel operations is key to overall system performance. This becomes even more important when implementing multiple compute units (of the same kernel or of different kernels). This section examines the different command queues responsible for scheduling the kernels.
           
           
            Multiple In-Order Command Queues
            
             The following figure shows an example with two in-order command queues, CQ0 and CQ1. The scheduler dispatches commands from each queue in order, but commands from CQ0 and CQ1 can be pulled out by the scheduler in any order. You must manage synchronization between CQ0 and CQ1, if required.
             
              Figure:Example with Two In-Order Command Queues
              
             
             The following code snippet from theConcurrent Kernel Execution Exampleinhostcategory fromSDAccel Getting Started Exampleson GitHub sets up multiple in-order command queues and enqueues commands into each queue:
             cl_command_queue ordered_queue1 = clCreateCommandQueue( world.context, world.device_id, CL_QUEUE_PROFILING_ENABLE, &err) cl_command_queue ordered_queue2 = clCreateCommandQueue( world.context, world.device_id, CL_QUEUE_PROFILING_ENABLE, &err); clEnqueueNDRangeKernel(ordered_queue1, kernel_mscale, 1, offset, global, local, 0, nullptr, &kernel_events[0])); clEnqueueNDRangeKernel(ordered_queue1, kernel_madd, 1, offset, global, local, 0, nullptr, &kernel_events[1]); clEnqueueNDRangeKernel(ordered_queue2, kernel_mmult, 1, offset, global, local, 0, nullptr, &kernel_events[2]);
            
           
           
            Single Out-of-Order Command Queue
            
             The following figure shows an example with a single out-of-order command queue. The scheduler can dispatch commands from the queue in any order. You must set up event dependencies and synchronizations explicitly, if required.
             
              Figure:Example with Single Out-of-Order Command Queue
              
             
             The following code snippet from theConcurrent Kernel Execution ExamplefromSDAccel Getting Started Exampleson GitHub sets up a single out-of-order command queue and enqueues commands:
             cl_command_queue ooo_queue = clCreateCommandQueue( world.context, world.device_id, CL_QUEUE_PROFILING_ENABLE | CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &err); clEnqueueNDRangeKernel(ooo_queue, kernel_mscale, 1, offset, global, local, 0, nullptr, &ooo_events[0]); clEnqueueNDRangeKernel(ooo_queue, kernel_madd, 1, offset, global, local, 1, &ooo_events[0], // Event from previous call &ooo_events[1]); clEnqueueNDRangeKernel(ooo_queue, kernel_mmult, 1, offset, global, local, 0, nullptr, // Does not depend on previous call &ooo_events[2])
             The following figure shows the Application Timeline view where that the compute unitmmult_1is running in parallel with the compute unitsmscale_1andmadd_1, using both multiple in-order queues and single out-of-order queue methods.
             
              Figure:Application Timeline View Showing mult_1 Running with mscale_1 and madd_1
              
             
            
           
          
          
           Using the clEnqueueMigrateMemObjects API to Transfer Data
           
            TheOpenCLframework provides a number of APIs for transferring data between the host and the device. Typically, data movement APIs, such asclEnqueueWriteBufferandclEnqueueReadBuffer, implicitly migrate memory objects to the device after they are enqueued. They do not guarantee when the data is transferred. This makes it difficult for the host application to overlap the placements of the memory objects onto the device with the computation carried out by kernels.
            TheOpenCL1.2 framework introduced a new API,clEnqueueMigrateMemObjects. Using this API, memory migration can be explicitly performed ahead of the dependent commands. This allows the application to preemptively change the association of a memory object, through regular command queue scheduling, to prepare for another upcoming command. This also permits an application to overlap the placement of memory objects with other unrelated operations before these memory objects are needed, potentially hiding transfer latencies. After the event associated by theclEnqueueMigrateMemObjectsAPI are marked CL_COMPLETE, the memory objects specified in mem_objects are successfully migrated to the device associated with command_queue.
            TheclEnqueueMigrateMemObjectsAPI can also be used to direct the initial placement of a memory object after creation, possibly avoiding the initial overhead of instantiating the object on the first enqueued command to use it.
            Another advantage of using theclEnqueueMigrateMemObjectsAPI is that it can migrate multiple memory objects in a single API call. This reduces the overhead of scheduling and calling functions for transferring data for more than one memory object.
            The following code snippet shows the usage of theclEnqueueMigrateMemObjectsAPI from theVector Multiplication for XPR Deviceexample in the host category fromSDAccel Getting Started Exampleson GitHub.
            int err = clEnqueueMigrateMemObjects( world.command_queue, 1, &d_mul_c, CL_MIGRATE_MEM_OBJECT_HOST, 0, NULL, NULL);