Getting Started with SDAccel

This chapter provides details on usingxfOpenCVin theSDAccel™environment. The following sections would provide a description of the methodology to create a kernel, corresponding host code and a suitable makefile to compile anxfOpenCVkernel for any of the supported platforms inSDAccel. The subsequent section also explains the methodology to verify the kernel in various emulation modes and on the hardware.

Prerequisites

Valid installation ofSDx™2019.1 or later version and the corresponding licenses.
Install thexfOpenCVlibraries, if you intend to use libraries compiled differently than what is provided inSDx.
Install the card for which the platform is supported inSDx2019.1 or later versions.
Xilinx®Runtime (XRT) must be installed. XRT provides software interface to Xilinx FPGAs.
libOpenCL.so must be installed if not present along with the platform.

SDAccel Design Methodology

There are three critical components in making a kernel work on a platform using SDAccel™:

Host code with OpenCL constructs
Wrappers around HLS Kernel(s)
Makefile to compile the kernel for emulation or running on hardware.

Host Code with OpenCL

Host code is compiled for the host machine that runs on the host and provides the data and control signals to the attached hardware with the FPGA. The host code is written using OpenCL constructs and provides capabilities for setting up, and running a kernel on the FPGA. The following functions are executed using the host code:

Loading the kernel binary on the FPGA – xcl::import_binary_file() loads the bitstream and programs the FPGA to enable required processing of data.
Setting up memory buffers for data transfer – Data needs to be sent and read from the DDR memory on the hardware. cl::Buffers are created to allocate required memory for transferring data to and from the hardware.
Transfer data to and from the hardware –enqueueWriteBuffer() and enqueueReadBuffer() are used to transfer the data to and from the hardware at the required time.
Execute kernel on the FPGA – There are functions to execute kernels on the FPGA. There can be single kernel execution or multiple kernel execution that could be asynchronous or synchronous with each other. Commonly used command is enqueueTask().
Profiling the performance of kernel execution – The host code in OpenCL also enables measurement of the execution time of a kernel on the FPGA. The function used in our examples for profiling is getProfilingInfo().

Wrappers around HLS Kernel(s)

All xfOpenCV kernels are provided with C++ function templates (located at /include) with image containers as objects of xf::Mat class. In addition, these kernels will work either in stream based (where complete image is read continuously) or memory mapped (where image data access is in blocks).

SDAccel flow (OpenCL) requires kernel interfaces to be memory pointers with width in power(s) of 2. So glue logic is required for converting memory pointers to xf::Mat class data type and vice-versa when interacting with xfOpenCV kernel(s). Wrapper(s) are build over the kernel(s) with this glue logic. Below examples will provide a methodology to handle different kernel (xfOpenCV kernels located at /include) types (stream and memory mapped).

Stream Based Kernels

To facilitate the conversion of pointer to xf::Mat and vice versa, two adapter functions are included as part ofxfOpenCVxf::Array2xfMat() and xf::xfMat2Array().It is necessary for the xf::Mat objects to be invoked as streams using HLS pragma with a minimum depth of 2. This results in a top-level (or wrapper) function for the kernel as shown below:

extern “C” { void func_top (ap_uint *gmem_in, ap_uint *gmem_out, ...) { xf::Mat<…> in_mat(…), out_mat(…); #pragma HLS stream variable=in_mat.data depth=2 #pragma HLS stream variable=out_mat.data depth=2 #pragma HLS dataflow xf::Array2xfMat<…> (gmem_in, in_mat); xf::xfopencv-func<…> (in_mat, out_mat…); xf::xfMat2Array<…> (gmem_out, out_mat); } }

The above illustration assumes that the data in xf::Mat is being streamed in and streamed out. You can also create a pipeline with multiple functions in pipeline instead of just one xfopencv function.

For the stream based kernels with different inputs of different sizes, multiple instances of the adapter functions are necessary. For this,

extern “C” { void func_top (ap_uint *gmem_in1, ap_uint *gmem_in2, ap_uint *gmem_in3, ap_uint *gmem_out, ...) { xf::Mat<...,HEIGHT,WIDTH,…> in_mat1(…), out_mat(…); xf::Mat<...,HEIGHT/4,WIDTH,…> in_mat2(…), in_mat3(…); #pragma HLS stream variable=in_mat1.data depth=2 #pragma HLS stream variable=in_mat2.data depth=2 #pragma HLS stream variable=in_mat3.data depth=2 #pragma HLS stream variable=out_mat.data depth=2 #pragma HLS dataflow xf::accel_utils obj_a, obj_b; obj_a.Array2xfMat<…,HEIGHT,WIDTH,…> (gmem_in1, in_mat1); obj_b.Array2xfMat<…,HEIGHT/4,WIDTH,…> (gmem_in2, in_mat2); obj_b.Array2xfMat<…,HEIGHT/4,WIDTH,…> (gmem_in3, in_mat3); xf::xfopencv-func(in_mat1, in_mat2, int_mat3, out_mat…); xf::xfMat2Array<…> (gmem_out, out_mat); } }

For the stream based implementations, the data must be fetched from the input AXI and must be pushed to xfMat as required by the xfcv kernels for that particular configuration. Likewise, the same operations must be performed for the output of the xfcv kernel. To perform this, two utility functions are provided, xf::Array2xfMat() and xf::xfMat2Array().

Stream Based Kernels

extern “C” { void func_top (ap_uint *gmem_in, ap_uint *gmem_out, ...) { xf::Mat<…> in_mat(…), out_mat(…); #pragma HLS stream variable=in_mat.data depth=2 #pragma HLS stream variable=out_mat.data depth=2 #pragma HLS dataflow xf::Array2xfMat<…> (gmem_in, in_mat); xf::xfopencv-func<…> (in_mat, out_mat…); xf::xfMat2Array<…> (gmem_out, out_mat); } }

For the stream based kernels with different inputs of different sizes, multiple instances of the adapter functions are necessary. For this,

extern “C” { void func_top (ap_uint *gmem_in1, ap_uint *gmem_in2, ap_uint *gmem_in3, ap_uint *gmem_out, ...) { xf::Mat<...,HEIGHT,WIDTH,…> in_mat1(…), out_mat(…); xf::Mat<...,HEIGHT/4,WIDTH,…> in_mat2(…), in_mat3(…); #pragma HLS stream variable=in_mat1.data depth=2 #pragma HLS stream variable=in_mat2.data depth=2 #pragma HLS stream variable=in_mat3.data depth=2 #pragma HLS stream variable=out_mat.data depth=2 #pragma HLS dataflow xf::accel_utils obj_a, obj_b; obj_a.Array2xfMat<…,HEIGHT,WIDTH,…> (gmem_in1, in_mat1); obj_b.Array2xfMat<…,HEIGHT/4,WIDTH,…> (gmem_in2, in_mat2); obj_b.Array2xfMat<…,HEIGHT/4,WIDTH,…> (gmem_in3, in_mat3); xf::xfopencv-func(in_mat1, in_mat2, int_mat3, out_mat…); xf::xfMat2Array<…> (gmem_out, out_mat); } }

xfMat2Array

This function converts the input xf::Mat to output array. The output of the xf::kernel function will be xf::Mat, and it will require to convert that to output pointer.

template  void xfMat2Array(xf::Mat& srcMat, ap_uint< PTR_WIDTH > *dstPtr)

Table 1.xfMat2Array Parameter Description
Parameter	Description
PTR_WIDTH	Data width of the output pointer. The value must be power 2, from 8 to 512.
MAT_T	Input Mat type. Example XF_8UC1, XF_16UC1, XF_8UC3 and XF_8UC4
ROWS	Maximum height of image
COLS	Maximum width of image
NPC	Number of pixels computed in parallel. Example XF_NPPC1, XF_NPPC8
dstPtr	Output pointer. Type of the pointer based on the PTR_WIDTH.
srcMat	Input image of type xf::Mat

Interface pointer widths

Minimum pointer widths for different configurations is shown in the following table:

Table 2.Minimum and maximum pointer widths for different mat types
MAT type	Parallelism	Min PTR_WIDTH	Max PTR_WIDTH
XF_8UC1	XF_NPPC1	8	512
XF_16UC1	XF_NPPC1	16	512
XF_ 8UC1	XF_NPPC8	64	512
XF_ 16UC1	XF_NPPC8	128	512
XF_ 8UC3	XF_NPPC1	32	512
XF_ 8UC3	XF_NPPC8	256	512
XF_8UC4	XF_NPPC8	256	512
XF_8UC3	XF_NPPC16	512	512

Stream Based Kernels

extern “C” { void func_top (ap_uint *gmem_in, ap_uint *gmem_out, ...) { xf::Mat<…> in_mat(…), out_mat(…); #pragma HLS stream variable=in_mat.data depth=2 #pragma HLS stream variable=out_mat.data depth=2 #pragma HLS dataflow xf::Array2xfMat<…> (gmem_in, in_mat); xf::xfopencv-func<…> (in_mat, out_mat…); xf::xfMat2Array<…> (gmem_out, out_mat); } }

For the stream based kernels with different inputs of different sizes, multiple instances of the adapter functions are necessary. For this,

extern “C” { void func_top (ap_uint *gmem_in1, ap_uint *gmem_in2, ap_uint *gmem_in3, ap_uint *gmem_out, ...) { xf::Mat<...,HEIGHT,WIDTH,…> in_mat1(…), out_mat(…); xf::Mat<...,HEIGHT/4,WIDTH,…> in_mat2(…), in_mat3(…); #pragma HLS stream variable=in_mat1.data depth=2 #pragma HLS stream variable=in_mat2.data depth=2 #pragma HLS stream variable=in_mat3.data depth=2 #pragma HLS stream variable=out_mat.data depth=2 #pragma HLS dataflow xf::accel_utils obj_a, obj_b; obj_a.Array2xfMat<…,HEIGHT,WIDTH,…> (gmem_in1, in_mat1); obj_b.Array2xfMat<…,HEIGHT/4,WIDTH,…> (gmem_in2, in_mat2); obj_b.Array2xfMat<…,HEIGHT/4,WIDTH,…> (gmem_in3, in_mat3); xf::xfopencv-func(in_mat1, in_mat2, int_mat3, out_mat…); xf::xfMat2Array<…> (gmem_out, out_mat); } }

Design example Using Library on SDAccel

Following is a multi-kernel example, where different kernel runs sequentially in a pipeline to form an application. This example performs Canny edge detection, where two kernels are involved, Canny and edge tracing. Canny function will take gray-scale image as input and provided the edge information in 3 states(weak edge(1),strong edge(3) and background(0)), which is being fed into edge tracing, which filters out the weak edges. The prior works in a streaming based implementation and the later in a memory mapped manner.

Host code

The following is the Host code for the canny edge detection example. The host code sets up the OpenCL platform with the FPGA of processing required data. In the case of xfOpenCVexample, the data is an image. Reading and writing of images are enabled using called to functions from xfOpenCV.

// setting up device and platform std::vector devices = xcl::get_xil_devices(); cl::Device device = devices[0]; cl::Context context(device); cl::CommandQueue q(context, device,CL_QUEUE_PROFILING_ENABLE); std::string device_name = device.getInfo(); // Kernel 1: Canny std::string binaryFile=xcl::find_binary_file(device_name,"krnl_canny"); cl::Program::Binaries bins = xcl::import_binary_file(binaryFile); devices.resize(1); cl::Program program(context, devices, bins); cl::Kernel krnl(program,"canny_accel"); // creating necessary cl buffers for input and output cl::Buffer imageToDevice(context, CL_MEM_READ_ONLY,(height*width)); cl::Buffer imageFromDevice(context, CL_MEM_WRITE_ONLY,(height*width/4)); // Set the kernel arguments krnl.setArg(0, imageToDevice); krnl.setArg(1, imageFromDevice); krnl.setArg(2, height); krnl.setArg(3, width); krnl.setArg(4, low_threshold); krnl.setArg(5, high_threshold); // write the input image data from host to device memory q.enqueueWriteBuffer(imageToDevice, CL_TRUE, 0,(height*(width)),img_gray.data); // Profiling Objects cl_ulong start= 0; cl_ulong end = 0; double diff_prof = 0.0f; cl::Event event_sp; // Launch the kernel q.enqueueTask(krnl,NULL,&event_sp); clWaitForEvents(1, (const cl_event*) &event_sp); // profiling event_sp.getProfilingInfo(CL_PROFILING_COMMAND_START,&start); event_sp.getProfilingInfo(CL_PROFILING_COMMAND_END,&end); diff_prof = end-start; std::cout<<(diff_prof/1000000)<<"ms"<


              Top level kernel


             Below is the top-level/wrapper function with all necessary glue logic.
             
              // streaming based kernel #include "xf_canny_config.h" extern "C" { void canny_accel(ap_uint *img_inp, ap_uint *img_out, int rows, int cols,int low_threshold,int high_threshold) { #pragma HLS INTERFACE m_axi port=img_inp offset=slave bundle=gmem1 #pragma HLS INTERFACE m_axi port=img_out offset=slave bundle=gmem2 #pragma HLS INTERFACE s_axilite port=img_inp bundle=control #pragma HLS INTERFACE s_axilite port=img_out bundle=control #pragma HLS INTERFACE s_axilite port=rows bundle=control #pragma HLS INTERFACE s_axilite port=cols bundle=control #pragma HLS INTERFACE s_axilite port=low_threshold bundle=control #pragma HLS INTERFACE s_axilite port=high_threshold bundle=control #pragma HLS INTERFACE s_axilite port=return bundle=control xf::Mat in_mat(rows,cols); #pragma HLS stream variable=in_mat.data depth=2 xf::Mat dst_mat(rows,cols); #pragma HLS stream variable=dst_mat.data depth=2 #pragma HLS DATAFLOW xf::Array2xfMat(img_inp,in_mat); xf::Canny(in_mat,dst_mat,low_threshold,high_threshold); xf::xfMat2Array(dst_mat,img_out); } } // memory mapped kernel #include "xf_canny_config.h" extern "C" { void edgetracing_accel(ap_uint *img_inp, ap_uint *img_out, int rows, int cols) { #pragma HLS INTERFACE m_axi port=img_inp offset=slave bundle=gmem3 #pragma HLS INTERFACE m_axi port=img_out offset=slave bundle=gmem4 #pragma HLS INTERFACE s_axilite port=img_inp bundle=control #pragma HLS INTERFACE s_axilite port=img_out bundle=control #pragma HLS INTERFACE s_axilite port=rows bundle=control #pragma HLS INTERFACE s_axilite port=cols bundle=control #pragma HLS INTERFACE s_axilite port=return bundle=control xf::Mat _dst1(rows,cols,img_inp); xf::Mat _dst2(rows,cols,img_out); xf::EdgeTracing(_dst1,_dst2); } }


           
            Design example Using Library on SDAccel
            
             Following is a multi-kernel example, where different kernel runs sequentially in a pipeline to form an application. This example performs Canny edge detection, where two kernels are involved, Canny and edge tracing. Canny function will take gray-scale image as input and provided the edge information in 3 states(weak edge(1),strong edge(3) and background(0)), which is being fed into edge tracing, which filters out the weak edges. The prior works in a streaming based implementation and the later in a memory mapped manner.
             Host code
             
              The following is the Host code for the canny edge detection example. The host code sets up the OpenCL platform with the FPGA of processing required data. In the case of
              xfOpenCVexample, the data is an image. Reading and writing of images are enabled using called to functions from
              xfOpenCV.
              // setting up device and platform std::vector devices = xcl::get_xil_devices(); cl::Device device = devices[0]; cl::Context context(device); cl::CommandQueue q(context, device,CL_QUEUE_PROFILING_ENABLE); std::string device_name = device.getInfo(); // Kernel 1: Canny std::string binaryFile=xcl::find_binary_file(device_name,"krnl_canny"); cl::Program::Binaries bins = xcl::import_binary_file(binaryFile); devices.resize(1); cl::Program program(context, devices, bins); cl::Kernel krnl(program,"canny_accel"); // creating necessary cl buffers for input and output cl::Buffer imageToDevice(context, CL_MEM_READ_ONLY,(height*width)); cl::Buffer imageFromDevice(context, CL_MEM_WRITE_ONLY,(height*width/4)); // Set the kernel arguments krnl.setArg(0, imageToDevice); krnl.setArg(1, imageFromDevice); krnl.setArg(2, height); krnl.setArg(3, width); krnl.setArg(4, low_threshold); krnl.setArg(5, high_threshold); // write the input image data from host to device memory q.enqueueWriteBuffer(imageToDevice, CL_TRUE, 0,(height*(width)),img_gray.data); // Profiling Objects cl_ulong start= 0; cl_ulong end = 0; double diff_prof = 0.0f; cl::Event event_sp; // Launch the kernel q.enqueueTask(krnl,NULL,&event_sp); clWaitForEvents(1, (const cl_event*) &event_sp); // profiling event_sp.getProfilingInfo(CL_PROFILING_COMMAND_START,&start); event_sp.getProfilingInfo(CL_PROFILING_COMMAND_END,&end); diff_prof = end-start; std::cout<<(diff_prof/1000000)<<"ms"<

              Top level kernel
             

             Below is the top-level/wrapper function with all necessary glue logic.
             
              // streaming based kernel #include "xf_canny_config.h" extern "C" { void canny_accel(ap_uint *img_inp, ap_uint *img_out, int rows, int cols,int low_threshold,int high_threshold) { #pragma HLS INTERFACE m_axi port=img_inp offset=slave bundle=gmem1 #pragma HLS INTERFACE m_axi port=img_out offset=slave bundle=gmem2 #pragma HLS INTERFACE s_axilite port=img_inp bundle=control #pragma HLS INTERFACE s_axilite port=img_out bundle=control #pragma HLS INTERFACE s_axilite port=rows bundle=control #pragma HLS INTERFACE s_axilite port=cols bundle=control #pragma HLS INTERFACE s_axilite port=low_threshold bundle=control #pragma HLS INTERFACE s_axilite port=high_threshold bundle=control #pragma HLS INTERFACE s_axilite port=return bundle=control xf::Mat in_mat(rows,cols); #pragma HLS stream variable=in_mat.data depth=2 xf::Mat dst_mat(rows,cols); #pragma HLS stream variable=dst_mat.data depth=2 #pragma HLS DATAFLOW xf::Array2xfMat(img_inp,in_mat); xf::Canny(in_mat,dst_mat,low_threshold,high_threshold); xf::xfMat2Array(dst_mat,img_out); } } // memory mapped kernel #include "xf_canny_config.h" extern "C" { void edgetracing_accel(ap_uint *img_inp, ap_uint *img_out, int rows, int cols) { #pragma HLS INTERFACE m_axi port=img_inp offset=slave bundle=gmem3 #pragma HLS INTERFACE m_axi port=img_out offset=slave bundle=gmem4 #pragma HLS INTERFACE s_axilite port=img_inp bundle=control #pragma HLS INTERFACE s_axilite port=img_out bundle=control #pragma HLS INTERFACE s_axilite port=rows bundle=control #pragma HLS INTERFACE s_axilite port=cols bundle=control #pragma HLS INTERFACE s_axilite port=return bundle=control xf::Mat _dst1(rows,cols,img_inp); xf::Mat _dst2(rows,cols,img_out); xf::EdgeTracing(_dst1,_dst2); } }


          
           Evaluating the Functionality
           
            You can build the kernels and test the functionality through software emulation, hardware emulation, and running directly on a supported hardware with the FPGA. For PCIe based platforms, use the following commands to setup the environment:
            $ cd  $ source /SDx//settings64.sh $ source /packages/setenv.sh $ export PLATFORM_PATH= $ export XLNX_SRC_PATH= $ export XILINX_CL_PATH=/usr
           
           
            Software Emulation
            
             Software emulation is equivalent to running a C-simulation of the kernel. The time for compilation is minimal, and is therefore recommended to be the first step in testing the kernel. Following are the steps to build and run for the software emulation:
             $ make all TARGETS=sw_emu $ export XCL_EMULATION_MODE=sw_emu $ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/SDx/2019.1/lnx64/tools/opencv:/usr/lib64 $ ./ 
            
           
           
            Hardware Emulation
            
             Hardware emulation runs the test on the generated RTL after synthesis of the C/C++ code. The simulation, since being done on RTL requires longer to complete when compared to software emulation. Following are the steps to build and run for the hardware emulation:
             $ make all TARGETS=hw_emu $ export XCL_EMULATION_MODE=hw_emu $ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/SDx/2019.1/lnx64/tools/opencv:/usr/lib64 $ ./ 
            
           
           
            Testing on the Hardware
            
             To test on the hardware, the kernel must be compiled into a bitstream (building for hardware).
             $ make all TARGETS=hw
             This would consume some time since the C/C++ code must be converted to RTL, run through synthesis and implementation process before a bitstream is created. As a prerequisite the drivers has to be installed for corresponding DSA, for which the example was built for. Following are the steps to run the kernel on a hardware:
             $ source /opt/xilinx/xrt/setup.sh $ export XILINX_XRT=/opt/xilinx/xrt $ cd  $ ./