SDAccel Streaming Platform

Streaming Data Transfers Between Host and Kernel

Starting from theSDAccel™2019.1 release,SDAccelprovides a new programming model which supports the direct streaming of data from host to kernel and kernel to host without having to go through global memory. This feature is an addition to the existing host to kernel and kernel to host data transfer using global memories. By using streams, you can get some of the advantages such as:

  • The host application does not necessarily need to know the size of the data coming from the kernel.
  • Data resides on the host memory can be transferred to the kernel as soon as it is needed. Similarly, the processed data can be transferred back when it is required.

This programming model uses minimal storage compared to the larger and slower global memory bank, and thus improving the performance and power.

Host Coding Guidelines

Xilinx®provides newOpenCL™APIs for streaming operation as extension APIs.

clCreateStream()
Creates a read or write stream.
clReleaseStream()
Frees the created stream and its associated memory.
clWriteStream()
Writes data to stream.
clReadStream()
Gets data from stream.
clPollStreams()
Polls for any stream on the device to finish. Required only for non-blocking stream operation.

The typical API flow is described below:

  • Create the required number of the read/write streams byclCreateStream.
    • Streams should be directly attached to theOpenCLdevice object because it does not use any command queue. A stream itself is a command queue that only passes the data to a particular direction, either from host to kernel or from kernel to host.
    • An appropriate flag should be used to denote stream write/read operation (from the host perspective).
    • To specify how the stream is connected to the device, a predefined extension pointer (cl_mem_ext_ptr_t) should be used to denote the kernel and its argument the stream is associated with.

      In the code block below, a Read Stream (namedread_stream) and a Write Stream (namedwrite_stream) are created.

      #include  // Required for Xilinx Extension // Device connection specification of the stream through extension pointer cl_mem_ext_ptr_t ext; // Extension pointer ext.param = kernel; // The .param should be set to kernel (cl_kernel type) ext.obj = nullptr; // The .flag should be used to denote the kernel argument // Create write stream for argument 3 of kernel ext.flags = 3; cl_stream write_stream = clCreateStream(device_id, CL_STREAM_WRITE_ONLY, CL_STREAM, &ext, &ret); // Create read stream for argument 4 of kernel ext.flags = 4; cl_stream read_stream = clCreateStream(device_id, CL_STREAM_READ_ONLY, CL_STREAM, &ext,&ret);
  • Set the remaining non-stream kernel arguments and enqueue the kernel. The following code block shows typical kernel argument (non-stream arguments such as buffer and/or scalar) setting and kernel enqueuing.
    // Set kernel non-stream argument (if any) clSetKernelArg(kernel, 0,...,...); clSetKernelArg(kernel, 1,...,...); clSetKernelArg(kernel, 2,...,...); // Argument 3 and 4 are not set as those are already specified during the clCreateStream through extension pointer // Schedule kernel enqueue clEnqueueTask(commands, kernel, . .. . );
  • Initiate Read and Write transfer byclReadStreamandclWriteStream.
    • Note the usage of attributecl_stream_xfer_reqassociated with read and write request.
    • The.flagis used to denote transfer mechanism.
      CL_STREAM_EOT
      Currently, successful stream transfer mechanism depends on identifying the end of the transfer by an End of Transfersignal. This flag is mandatory in the current release.
      CL_STREAM_NONBLOCKING
      By default the Read and Write transfers are blocking. For non-blocking transfer, CL_STREAM_NONBLOCKING has to be set.
    • The.priv_datais used to specify a string (as a name for tagging purpose) associated with the transfer. This will help identify specific transfer completion when polling the stream completion. It is required when using the non-blocking version of the API.

      In the following code block, the stream read and write transfers are executed with the non-blocking approach.

      // Initiate the READ transfer cl_stream_xfer_req rd_req {0}; rd_req.flags = CL_STREAM_EOT | CL_STREAM_NONBLOCKING; rd_req.priv_data = (void*)"read"; // You can think this as tagging the transfer with a name clReadStream(read_stream, host_read_ptr, max_read_size, &rd_req, &ret); // Initiating the WRITE transfer cl_stream_xfer_req wr_req {0}; wr_req.flags = CL_STREAM_EOT | CL_STREAM_NONBLOCKING; wr_req.priv_data = (void*)"write"; clWriteStream(write_stream, host_write_ptr, write_size, &wr_req , &ret);
  • Poll all the streams for completion. For the non-blocking transfer, a polling API is provided to ensure the read/write transfers are completed. For the blocking version of the API, polling is not required.
    • The number of poll requests should be used throughcl_streams_poll_req_completions.
    • TheclPollStreamsis a blocking API. It returns the execution to the host code as soon as it receives the notification that all stream requests have been completed, or until you specify the timeout.
      // Checking the request completion cl_streams_poll_req_completions poll_req[2] {0, 0}; // 2 Requests auto num_compl = 2; clPollStreams(device_id, poll_req, 2, 2, &num_compl, 5000, &ret); // Blocking API, waits for 2 poll request completion or 5000ms, whichever occurs first
  • Read and use the stream data in host.
    • After the successful poll request is completed, the host can read the data from the host pointer.
    • Also, the host can check the size of the data transferred to the host. For this purpose, the host needs to find the correct poll request by matchingpriv_dataand then fetching nbytes (the number of bytes transferred) from thecl_streams_poll_req_completions structure.
      for (auto i=0; i<2; ++i) { if(rd_req.priv_data == poll_req[i].priv_data) { // Identifying the read transfer // Getting read size, data size from kernel is unknown ssize_t result_size=poll_req[i].nbytes; } }

The header file containing function prototype and argument description is available in theXilinx Runtime GitHub repository.

IMPORTANT:If the streaming kernel has multiple CUs, the host code needs to use a unique cl_kernelobject for each CU. The host code must use clCreateKernelwith :{compute_unit_name}to get each CU, creating streams for them, and enqueuing them individually.

Kernel Coding Guidelines

The basic guidelines to develop stream-based C kernel is as follows:

  • Usehls::streamwith theqdma_axisdata type. Theqdma_axisdata type needs the header fileap_axi_sdata.h.
  • Theqdma_axisis a special class used for data transfer between host and kernel when using the streaming platform. This is only used in the streaming kernel interface interacting with the host, not with another kernel. The template parameter denotes data width. The remaining three parameters should be set to 0 (not to be used in the current release).
  • The following code block shows a simple kernel interface with one input stream and one output stream.
    #include "ap_axi_sdata.h" #include "hls_stream.h" //qdma_axis is the HLS class for stream data transfer between host and kernel for streaming platform //It contains "data" and two sideband signals (last and keep) exposed to the user via class member function. typedef qdma_axis<64,0,0,0> datap; void kernel_top ( hls::stream &input, hls::stream &output, ..... , // Other Inputs/Outputs if any ) { #pragma HLS INTERFACE axis port=input #pragma HLS INTERFACE axis port=output }
  • Theqdma_axisdata type contains three variables which should be used inside the kernel code:
    data
    Internally qdma_axiscontains an ap_uint that should be accessed by the .get_data()and .set_data()method.
    • The D must be 8, 16, 32, 64, 128, 256, or 512 bits wide.
    last
    The lastvariable is used to indicate the last value of an incoming and outgoing stream. When reading from the input stream, lastis used to detect the end of the stream. Similarly when kernel writes to an output stream transferred to the host, the lastmust be set to indicate the end of stream.
    • get_last/set_last: Accesses/sets thelastvariable used to denote the last data in the stream.
    keep
    In some special situation, keepsignal can be used to truncate the last data to the fewer number of bytes. However, keepshould not be used to any data other than the last data from the stream. So, in most of the cases, you should set keepto -1 for all the outgoing data from the kernel.
    • get_keep/set_keep: Accesses/sets thekeepvariable.
    • For all the data before the last data,keepmust be set to -1 to denote all bytes of the data are valid.
    • For the last data, the kernel has the flexibility to send fewer bytes. For example, for the four bytes data transfer, the kernel can truncate the last data by sending one byte, two bytes, or three bytes by usingset_keep()function as below.
      • If the last data is one byte =>.set_keep(1)
      • If the last data is two bytes =>.set_keep(3)
      • If the last data is three bytes =>.set_keep(7)
      • If the last data is all four bytes (similar to all non-last data) =>.set_keep(-1)
  • The following code block shows how the streaminputis read. Note the usage of.lastto determine the last data.
    // Stream Read // Using "last" flag to determine the end of input-stream // when kernel does not know the length of the input data hls::stream > internal_stream; while(true) { datap temp = input.read(); // "input" -> Input stream internal_stream << temp.get_data(); // Getting data from the stream if(temp.get_last()) // Getting last signal to determine the EOT (end of transfer). break; }
  • The following code block shows how the streamoutputis written. Theset_keepis setting -1 for all data (general case). Also, the kernel uses theset_last()to specify the last data of the stream.
    IMPORTANT:For the proper functionality of the host and kernel system, it is very important to set the lastbit setting.
    // Stream Write for(int j = 0; j <....; j++) { datap t; t.set_data(...); t.set_keep(-1); // keep flag -1 , all bytes are valid if(... ) // check if this is last data to be write t.set_last(1); // Setting last data of the stream else t.set_last(0); output.write(t); // output stream from the kernel }

Streaming Data Transfers Between the Kernels

TheSDAccelenvironment also supports streaming data transfer between two kernels. Consider the situation where one kernel is performing some part of the computation and the second kernel is operating the rest after receiving the output data from the first kernel. BeforeSDx™2019.1 version, the only method to transfer data from one kernel to another was through the global memory. Now with kernel to kernel streaming support, data can move directly from one kernel to another without having to transmit through global memory, improving performance.

Host Coding Guidelines

There is only one consideration from the host coding perspective for kernel to kernel streaming data transfer, the kernel ports involved in kernel to kernel data transfer does not needclSetKernelArgfrom the host code. The host code should set other kernel port arguments that are directly interacting with the host with theclSetKernelArgcommand.

Kernel Coding Guidelines

The kernel streaming interface directly sending or receiving data to another kernel streaming interface should be defined byhls::streamwith theap_axiudata type. Theap_axiudata type needs the header fileap_axi_sdata.h.

IMPORTANT: Xilinxrequires using the qdma_axisdata type for host to kernel and kernel to host as described in the previous section. On the other hand, the ap_axiudata type should be used for intra-kernel streaming data transfer. Both of these data types are defined inside ap_axi_sdata.hfile distributed with the SDAccelrelease.
The following example shows the streaming interfaces of the producer and consumer kernels.
// Producer kernel // Producing stream output to another kernel on the FPGA // The below code segment ignores all other inputs and outputs, if any void kernel1 (.... , hls::stream >& stream_out) { #pragma HLS interface axis port=stream_out for(int i = 0; i < ...; i++) { int a = ...... ; // Internally generated data ap_axiu<32, 0, 0, 0> v; // temporary storage for ap_axiu v.data = a; // Writing the data stream_out.write(v); // Writing to the output stream. } } // Consumer kernel // Consuming stream input from another kernel on the FPGA // The below code segment ignores all other inputs and outputs, if any void kernel2 (hls::stream >& stream_in, .... ) { #pragma HLS interface axis port=stream_in for(int i = 0; i < ....; i++) { ap_axiu<32, 0, 0, 0> v = stream_in.read(); // Reading from the Input stream int a = v.data; // Extract the data // Do further processing } }

Linking the Kernels

Additionally, connect the streaming output port of the producer kernel to the streaming input port of the consumer kernel by the--scswitch applied during thexocclink (-l) stage.

#Syntax:: xocc -l --sc : xocc -l --sc .stream_in:.stream_out