Real-World Examples

This chapter describes some real-world examples and shows the following:

How these examples are optimized using both the top-down flow and bottom-up flow
- The top-down flow is demonstrated using a Lucas-Kanade (LK) Optical Flow algorithm.
- The bottom-up flow is demonstrated using a stereo vision block matching algorithm.
What optimization directives were applied
Why those directives were chosen

Top-Down: Optical Flow Algorithm

The Lucas-Kanade (LK) method is a widely used, differential method for optical flow estimation or the estimation of movement of pixels between two related images. In this example system, the related images are the current and previous images of a video stream. The LK method is a compute intensive algorithm and works over a window of neighboring pixels using the least square difference to find matching pixels.

The following code example shows how to implement this algorithm, where two input files are read in, processed through functionfpga_optflow, and the results written to an output file.

int main() { FILE *f; pix_t *inY1 = (pix_t *)sds_alloc(HEIGHT*WIDTH); yuv_t *inCY1 = (yuv_t *)sds_alloc(HEIGHT*WIDTH*2); pix_t *inY2 = (pix_t *)sds_alloc(HEIGHT*WIDTH); yuv_t *inCY2 = (yuv_t *)sds_alloc(HEIGHT*WIDTH*2); yuv_t *outCY = (yuv_t *)sds_alloc(HEIGHT*WIDTH*2); printf("allocated buffers\n"); f = fopen(FILEINAME,"rb"); if (f == NULL) { printf("failed to open file %s\n", FILEINAME); return -1; } printf("opened file %s\n", FILEINAME); read_yuv_frame(inY1, WIDTH, WIDTH, HEIGHT, f); printf("read 1st %dx%d frame\n", WIDTH, HEIGHT); read_yuv_frame(inY2, WIDTH, WIDTH, HEIGHT, f); printf("read 2nd %dx%d frame\n", WIDTH, HEIGHT); fclose(f); printf("closed file %s\n", FILEINAME); convert_Y8toCY16(inY1, inCY1, HEIGHT*WIDTH); printf("converted 1st frame to 16bit\n"); convert_Y8toCY16(inY2, inCY2, HEIGHT*WIDTH); printf("converted 2nd frame to 16bit\n"); fpga_optflow(inCY1, inCY2, outCY, HEIGHT, WIDTH, WIDTH, 10.0); printf("computed optical flow\n"); // write optical flow data image to disk write_yuv_file(outCY, WIDTH, WIDTH, HEIGHT, ONAME); sds_free(inY1); sds_free(inCY1); sds_free(inY2); sds_free(inCY2); sds_free(outCY); printf("freed buffers\n"); return 0; }

This method is typical for a top-down design flow using standard C/C++ data types.

Functionfpa_optflowis shown in the following code example and contains the following sub-functions:

readMatRows
computeSum
computeFlow
getOutPix
writeMatRows

int fpga_optflow (yuv_t *frame0, yuv_t *frame1, yuv_t *framef, int height, int width, int stride, float clip_flowmag) { #ifdef COMPILEFORSW int img_pix_count = height*width; #else int img_pix_count = 10; #endif if (f0Stream == NULL) f0Stream = (pix_t *) malloc(sizeof(pix_t) * img_pix_count); if (f1Stream == NULL) f1Stream = (pix_t *) malloc(sizeof(pix_t) * img_pix_count); if (ffStream == NULL) ffStream = (yuv_t *) malloc(sizeof(yuv_t) * img_pix_count); if (ixix == NULL) ixix = (int *) malloc(sizeof(int) * img_pix_count); if (ixiy == NULL) ixiy = (int *) malloc(sizeof(int) * img_pix_count); if (iyiy == NULL) iyiy = (int *) malloc(sizeof(int) * img_pix_count); if (dix == NULL) dix = (int *) malloc(sizeof(int) * img_pix_count); if (diy == NULL) diy = (int *) malloc(sizeof(int) * img_pix_count); if (fx == NULL) fx = (float *) malloc(sizeof(float) * img_pix_count); if (fy == NULL) fy = (float *) malloc(sizeof(float) * img_pix_count); readMatRows (frame0, f0Stream, height, width, stride); readMatRows (frame1, f1Stream, height, width, stride); computeSum (f0Stream, f1Stream, ixix, ixiy, iyiy, dix, diy, height, width); computeFlow (ixix, ixiy, iyiy, dix, diy, fx, fy, height, width); getOutPix (fx, fy, ffStream, height, width, clip_flowmag); writeMatRows (ffStream, framef, height, width, stride); return 0; }

In this example, all of the functions infpga_optfloware processing live video data, and can benefit from hardware acceleration with DMAs used to transfer the data to and from the PS. If all five functions are annotated to be hardware functions, the topology of the system is shown in the following figure:

The system can be compiled into hardware and event tracing used to analyze the performance in detail.

The issue here is that it takes a long time to complete— approximately 15 seconds for a single frame. To process HD video, the system should process 60 frames per second or one frame every 16.7 ms. You can use optimization directives, as described below, to ensure the system meets the target performance.

Optical Flow Memory Access Optimization

The first task is to optimize the transfer of data. In this case, because the system will process steaming video, where each sample is processed in consecutive order, the memory transfer optimization is used to ensure theSDSoC™environment interprets all accesses as sequential in nature.

This is performed by adding SDS pragmas before the function signatures for all functions involved.

#pragma SDS data access_pattern(matB:SEQUENTIAL, pixStream:SEQUENTIAL) #pragma SDS data mem_attribute(matB:PHYSICAL_CONTIGUOUS) #pragma SDS data copy(matB[0:stride*height]) void readMatRows (yuv_t *matB, pix_t* pixStream, int height, int width, int stride); #pragma SDS data access_pattern(pixStream:SEQUENTIAL, dst:SEQUENTIAL) #pragma SDS data mem_attribute(dst:PHYSICAL_CONTIGUOUS) #pragma SDS data copy(dst[0:stride*height]) void writeMatRows (yuv_t* pixStream, yuv_t *dst, int height, int width, int stride); #pragma SDS data access_pattern(f0Stream:SEQUENTIAL, f1Stream:SEQUENTIAL) #pragma SDS data access_pattern(ixix_out:SEQUENTIAL, ixiy_out:SEQUENTIAL, iyiy_out:SEQUENTIAL) #pragma SDS data access_pattern(dix_out:SEQUENTIAL, diy_out:SEQUENTIAL) void computeSum(pix_t* f0Stream, pix_t* f1Stream, int* ixix_out, int* ixiy_out, int* iyiy_out, int* dix_out, int* diy_out, int height, int width); #pragma SDS data access_pattern(ixix:SEQUENTIAL, ixiy:SEQUENTIAL, iyiy:SEQUENTIAL) #pragma SDS data access_pattern(dix:SEQUENTIAL, diy:SEQUENTIAL) #pragma SDS data access_pattern(fx_out:SEQUENTIAL, fy_out:SEQUENTIAL) void computeFlow(int* ixix, int* ixiy, int* iyiy, int* dix, int* diy, float* fx_out, float* fy_out, int height, int width); #pragma SDS data access_pattern(fx:SEQUENTIAL, fy:SEQUENTIAL, out_pix:SEQUENTIAL) void getOutPix (float* fx, float* fy, yuv_t* out_pix, int height, int width, float clip_flowmag);

For thereadMatRowsandwriteMatRowsfunction arguments, which interface to the processor, the memory transfers are specified as sequential accesses from physically contiguous memory, and the data should be copied to and from the hardware function, and not simply accessed from the accelerator. This ensures the data is copied efficiently. The following options are available:

Sequential: The data is transferred in the same sequential manner as it is processed. This type of transfer requires the least amount of hardware overhead for high data processing rates and means an area efficient datamover is used.
Contiguous: The data is accessed from contiguous memory. This ensures there is no scatter-gather overhead in the data transfer rate and an efficient fast hardware datamover is used. This directive is supported by the associated scs_alloclibrary call in the main()function, which ensures data for these arguments is stored in contiguous memory.
Copy: The data is copied to and from the accelerator, negating the need for data accesses back to the CPU or DDR memory. Because pointers are used, the size of the data to be copied is specified.

For the remaining hardware functions, the data transfers are specified as sequential, allowing the most efficient hardware to be used to connect the functions in the programmable logic (PL) fabric.

Optical Flow Hardware Function Optimization

The hardware functions also require optimization directives to execute at the highest level of performance. These are already present in the design example. Reviewing these highlights the lessons learned fromUnderstanding the Hardware Function Optimization Methodology. Most of the hardware functions in this design example are optimized using primarily the PIPELINE directive, in a manner similar to thegetOutPixfunction.

Review of thegetOutPixfunction shows:

The sub-functions have an INLINE optimization applied to ensure the logic from these functions is merged with the function above. This automatically occurs for small functions, but the use of this directive ensures the sub-functions are always inlined, and there is no need to pipeline the sub-functions.
The inner loop of thegetOutPixfunction is the loop that processes data at the level of each pixel and is optimized with the PIPELINE directive to ensure it processes one pixel per clock.

pix_t getLuma (float fx, float fy, float clip_flowmag) { #pragma HLS inline float rad = sqrtf (fx*fx + fy*fy); if (rad > clip_flowmag) rad = clip_flowmag; // clamp to MAX rad /= clip_flowmag; // convert 0..MAX to 0.0..1.0 pix_t pix = (pix_t) (255.0f * rad); return pix; } pix_t getChroma (float f, float clip_flowmag) { #pragma HLS inline if (f > clip_flowmag ) f = clip_flowmag; // clamp big positive f to MAX if (f < (-clip_flowmag)) f = -clip_flowmag; // clamp big negative f to -MAX f /= clip_flowmag; // convert -MAX..MAX to -1.0..1.0 pix_t pix = (pix_t) (127.0f * f + 128.0f); // convert -1.0..1.0 to -127..127 to 1..255 return pix; } void getOutPix (float* fx, float* fy, yuv_t* out_pix, int height, int width, float clip_flowmag) { int pix_index = 0; for (int r = 0; r < height; r++) { for (int c = 0; c < width; c++) { #pragma HLS PIPELINE float fx_ = fx[pix_index]; float fy_ = fy[pix_index]; pix_t outLuma = getLuma (fx_, fy_, clip_flowmag); pix_t outChroma = (c&1)? getChroma (fy_, clip_flowmag) : getChroma (fx_, clip_flowmag); yuv_t yuvpix; yuvpix = ((yuv_t)outChroma << 8) | outLuma; out_pix[pix_index++] = yuvpix; } } }

If you examine thecomputeSumfunction, you will find examples of the ARRAY_PARTITION and DEPENDENCE directives. In this function, the ARRAY_PARTITION directive is used on arrayimg1Win. Becauseimg1Winis an array, it is implemented by default in a block RAM, which has a maximum of two ports, as shown in the following code summary:

img1Win: Used in a for-loop that is pipelined to process 1 sample per clock cycle.
img1Win: Read from 8 + (KMEDP1-1) + (KMEDP1-1) times within the for-loop.
img1Win: Written to (KMEDP1-1) + (KMEDP1-1) times within the for-loop.

void computeSum(pix_t* f0Stream, pix_t* f1Stream, int* ixix_out, int* ixiy_out, int* iyiy_out, int* dix_out, int* diy_out) { static pix_t img1Win [2 * KMEDP1], img2Win [1 * KMEDP1]; #pragma HLS ARRAY_PARTITION variable=img1Win complete dim=0 ... for (int r = 0; r < MAX_HEIGHT; r++) { for (int c = 0; c < MAX_WIDTH; c++) { #pragma HLS PIPELINE ... int cIxTopR = (img1Col_ [wrt] - img1Win [wrt*2 + 2-2]) /2 ; int cIyTopR = (img1Win [ (wrt+1)*2 + 2-1] - img1Win [ (wrt-1)*2 + 2-1]) /2; int delTopR = img1Win [wrt*2 + 2-1] - img2Win [wrt*1 + 1-1]; ... int cIxBotR = (img1Col_ [wrb] - img1Win [wrb*2 + 2-2]) /2 ; int cIyBotR = (img1Win [ (wrb+1)*2 + 2-1] - img1Win [ (wrb-1)*2 + 2-1]) /2; int delBotR = img1Win [wrb*2 + 2-1] - img2Win [wrb*1 + 1-1]; ... // shift windows for (int i = 0; i < KMEDP1; i++) { img1Win [i * 2] = img1Win [i * 2 + 1]; } for (int i=0; i < KMEDP1; ++i) { img1Win [i*2 + 1] = img1Col_ [i]; ... } ... } // for c } // for r ... }

Because a block RAM only supports a maximum of two accesses per clock cycle, all of these accesses cannot be made in one clock cycle. As noted previously in the methodology, the ARRAY_PARTITION directive is used to partition the array into smaller blocks, in this case into individual elements, by using the complete option. This enables parallel access to all elements of the array at the same time and ensures that the for-loop processes data every clock cycle.

The final optimization directive worth reviewing is the DEPENDENCE directive. ThecsIxixarray has a DEPENDENCE directive applied to it. The array is read from and then written to using different indices, as shown in the following code example, and performs these reads and writes within a pipelined loop.

void computeSum(pix_t* f0Stream, pix_t* f1Stream, int* ixix_out, int* ixiy_out, int* iyiy_out, int* dix_out, int* diy_out) { ... static int csIxix [MAX_WIDTH], csIxiy [MAX_WIDTH], csIyiy [MAX_WIDTH], csDix [MAX_WIDTH], csDiy [MAX_WIDTH]; ... #pragma HLS DEPENDENCE variable=csIxix inter WAR false ... int zIdx= - (KMED-2); int nIdx = zIdx + KMED-2; for (int r = 0; r < MAX_HEIGHT; r++) { for (int c = 0; c < MAX_WIDTH; c++) { #pragma HLS PIPELINE ... if (zIdx >= 0) { csIxixL = csIxix [zIdx]; ... } ... csIxix [nIdx] = csIxixR; ... zIdx++; if (zIdx == MAX_WIDTH) zIdx = 0; nIdx++; if (nIdx == MAX_WIDTH) nIdx = 0; ... } // for c } // for r ... }

When a loop is pipelined in hardware, the accesses to the array overlap in time. The compiler analyzes all accesses to an array and issues a warning if any condition exists where the write in iterationNoverwrites the data for iterationN + K, thus changing the value. The warning prevents implementing a pipeline withII = 1.

The following example shows read and write operations for a loop over multiple iterations for an array with indices 0 through 9. As in the code above, it is possible for the address counters to differ between the read and write operations and to return to zero, before all loop iterations are complete. The operations are shown overlapped in time, just like a pipelined implementation.

R4---------W8 R5---------W9 R6---------W0 R7---------W1 R8–––------W2 R9--------W3 R0--------W4 R1--------W5 R2--------W6

In sequential C code, where each iteration completes before the next starts, it is clear what order the reads and writes occur. However, in a concurrent hardware pipeline, the accesses can overlap and occur in different orders. As can be seen clearly above, it is possible for the read from index 8, as noted by R8, to occur in time before the write to index 8 (W8) which is meant to occur some iterations before R8.

The compiler warns of this condition, and the DEPENDENCE directive is used with the settingfalseto tell the compiler that there is no dependence on read-after-write, allowing the compiler to create the pipelined hardware which performs withII=1.

The DEPENDENCE directive is typically used to inform the compiler of algorithm behaviors and conditions external to the function of which is it unaware from static analysis of the code. If a DEPENDENCE directive is set incorrectly, the issue will be discovered in hardware emulation, if the results from the hardware are different from those achieved with the software.

Optical Flow Results

With both the data transfers and hardware functions optimized, the hardware functions are recompiled, and the performance is analyzed using event traces. The figure below shows the start of the event traces, and clearly shows the pipelined hardware functions do not execute until the previous function has completed. Each hardware function begins to process data as soon as data becomes available.

The complete view of the event traces shows all hardware functions and data transfers executing in parallel for the highest performing system, as shown in the following figure.

To get the duration time, hover on top of one of the lanes to obtain a popup window that shows the duration of the accelerator runtime. The execution time is just under 15.5 ms; this meets the targeted 16.8 ms necessary to achieve 60 frames per second. The following figure shows the AXI State View for trace legend:

Software: Execution done on the Arm®processor core.
Accelerator: Execution done in the accelerator(s).
Transfer: Data being transferred from Armcore.
Receive: Data being received by the Armprocessor core.

Bottom-Up: Stereo Vision Algorithm

The stereo vision algorithm uses images from two cameras horizontally displaced from each other. This provides two different views of the scene from different vantage points, similar to human vision. To obtain the relative depth information from the scene, compare the two images to build a disparity map. The disparity map encodes the relative positions of objects in the horizontal coordinates such that the values are inversely proportional to the scene depth at the corresponding pixel location.

The bottom-up methodology starts with a fully optimized hardware design that is already synthesized using theVivado®High-Level Synthesis (HLS) tool and then integrate the pre-optimized hardware function with software in theSDSoCenvironment.

This flow allows hardware designers who are already knowledgeable with the HLS tool to build and optimize the entire hardware function first, using advanced HLS tool features and then for software programmers to leverage this existing work.

The following section uses the stereo vision design example to take you through the steps of starting with an optimized hardware function in the HLS tool and build an application that integrates the full system with hardware and software running on the board using theSDSoCenvironment. The following figure shows the final system to be realized, and highlights the existingstereo_remap_bmhardware function to be incorporated into theSDSoCenvironment.

In the bottom-up flow, the general optimization methodology for theSDSoCenvironment, as detailed in this guide, is reversed. By definition, you would start with an optimized hardware function, and then seek to incorporate it into theSDSoCenvironment and optimize the data transfers.

Stereo Vision Hardware Function Optimization

The following code example shows the existingstereo_remap_bmhardware function with the optimization pragmas highlighted. Before reviewing the optimization directives, note the following details about the function:

The hardware function contains sub-functionsreadLRinput,writeDispOut, andwriteDispOutthat have also been optimized.
The hardware function also uses pre-optimized functions, prefixed with the namespacehls, from theVivadoHLS tool video library,hls_video.h. These sub-functions use their own data type ofMAT.

#include "hls_video.h" #include "top.h" #include "transform.h" void readLRinput (yuv_t *inLR, hls::Mat& img_l, hls::Mat& img_r, int height, int dual_width, int width, int stride) { for (int i=0; i < height; ++i) {#pragma HLS loop_tripcount min=1080 max=1080 avg=1080for (int j=0; j < stride; ++j) {#pragma HLS loop_tripcount min=1920 max=1920 avg=1920#pragma HLS PIPELINEyuv_t tmpData = inLR [i*stride + j]; // from yuv_t array: consume height*stride if (j < width) img_l.write (tmpData & 0x00FF); // to HLS_8UC1 stream else if (j < dual_width) img_r.write (tmpData & 0x00FF); // to HLS_8UC1 stream } } } void writeDispOut(hls::Mat& img_d, yuv_t *dst, int height, int width, int stride) { pix_t tmpOut; yuv_t outData; for (int i=0; i < height; ++i) {#pragma HLS loop_tripcount min=1080 max=1080 avg=1080for (int j=0; j < stride; ++j) {#pragma HLS loop_tripcount min=960 max=960 avg=960#pragma HLS PIPELINEif (j < width) { tmpOut = img_d.read().val[0]; outData = ((yuv_t) 0x8000) | ((yuv_t)tmpOut); dst [i*stride +j] = outData; } else { outData = (yuv_t) 0x8000; dst [i*stride +j] = outData; } } } } namespace hls { void SaveAsGray( Mat& src, Mat& dst) { int height = src.rows; int width = src.cols; for (int i = 0; i < height; i++) {#pragma HLS loop_tripcount min=1080 max=1080 avg=1080for (int j = 0; j < width; j++) {#pragma HLS loop_tripcount min=960 max=960 avg=960#pragma HLS pipeline II=1Scalar<1, short> s; Scalar<1, unsigned char> d; src >> s; short uval = (short) (abs ((int)s.val[0])); // Scale to avoid overflow. The right scaling here for a // good picture depends on the NDISP parameter during // block matching. d.val[0] = (unsigned char)(uval >> 1); //d.val[0] = (unsigned char)(s.val[0] >> 1); dst << d; } } } } // namespace hls int stereo_remap_bm_new( yuv_t *img_data_lr, yuv_t *img_data_disp, hls::Window<3, 3, param_T > &lcameraMA_l, hls::Window<3, 3, param_T > &lcameraMA_r, hls::Window<3, 3, param_T > &lirA_l, hls::Window<3, 3, param_T > &lirA_r, param_T (&ldistC_l)[5], param_T (&ldistC_r)[5], int height, // 1080 int dual_width, // 1920 (two 960x1080 images side by side) int stride_in, // 1920 (two 960x1080 images side by side) int stride_out) // 960 { int width = dual_width/2; // 960#pragma HLS DATAFLOWhls::Mat img_l(height, width); hls::Mat img_r(height, width); hls::Mat img_l_remap(height, width); // remapped left image hls::Mat img_r_remap(height, width); // remapped left image hls::Mat img_d(height, width); hls::Mat map1_l(height, width); hls::Mat map1_r(height, width); hls::Mat map2_l(height, width); hls::Mat map2_r(height, width); hls::Mat img_disp(height, width); hls::StereoBMState<15, 32, 32> state; // ddr -> kernel streams: extract luma from left and right yuv images // store it in single channel HLS_8UC1 left and right Mat's readLRinput (img_data_lr, img_l, img_r, height, dual_width, width, stride_in); //////////////////////// remap left and right images, all types are HLS_8UC1 ////////// hls::InitUndistortRectifyMapInverse(lcameraMA_l, ldistC_l, lirA_l, map1_l, map2_l); hls::Remap<8>(img_l, img_l_remap, map1_l, map2_l, HLS_INTER_LINEAR); hls::InitUndistortRectifyMapInverse(lcameraMA_r, ldistC_r, lirA_r, map1_r, map2_r); hls::Remap<8>(img_r, img_r_remap, map1_r, map2_r, HLS_INTER_LINEAR); ////////// find disparity of remapped images ////////// hls::FindStereoCorrespondenceBM(img_l_remap, img_r_remap, img_disp, state); hls::SaveAsGray(img_disp, img_d); // kernel stream -> ddr : output single wide writeDispOut (img_d, img_data_disp, height, width, stride_out); return 0; } int stereo_remap_bm( yuv_t *img_data_lr, yuv_t *img_data_disp, int height, // 1080 int dual_width, // 1920 (two 960x1080 images side by side) int stride_in, // 1920 (two 960x1080 images side by side) int stride_out) // 960 { //1920*1080 //#pragma HLS interface m_axi port=img_data_lr depth=2073600 //#pragma HLS interface m_axi port=img_data_disp depth=2073600 hls::Window<3, 3, param_T > lcameraMA_l; hls::Window<3, 3, param_T > lcameraMA_r; hls::Window<3, 3, param_T > lirA_l; hls::Window<3, 3, param_T > lirA_r; param_T ldistC_l[5]; param_T ldistC_r[5]; for (int i=0; i<3; i++) { for (int j=0; j<3; j++) { lcameraMA_l.val[i][j]=cameraMA_l[i*3+j]; lcameraMA_r.val[i][j]=cameraMA_r[i*3+j]; lirA_l.val[i][j]=irA_l[i*3+j]; lirA_r.val[i][j]=irA_r[i*3+j]; } } for (int i=0; i<5; i++) { ldistC_l[i] = distC_l[i]; ldistC_r[i] = distC_r[i]; } int ret = stereo_remap_bm_new(img_data_lr, img_data_disp, lcameraMA_l, lcameraMA_r, lirA_l, lirA_r, ldistC_l, ldistC_r, height, dual_width, stride_in, stride_out); return ret; }

As noted inUnderstanding the Hardware Function Optimization Methodology, the primary optimization directives used are the PIPELINE and DATAFLOW directives. Additionally, the LOOP_TRIPCOUNT directive is used.

Based on the recommendations for optimizing hardware functions, which process frames of data, the PIPELINE directives are all applied to for-loops that process data at the sample level, or in this case, the pixel level. This ensures hardware pipelining is used to achieve the highest performing design.

The LOOP_TRIPCOUNT directives are used on for-loops, for which the upper bound of the loop index is defined by a variable, and the exact value, which is unknown at compile time. The estimated tripcount, or loop iteration count, allows the reports generated by the HLS tool to include expected values for latency and initiation interval (II), instead of unknowns. This directive has no impact on the hardware created—it only impacts reporting.

The top-levelstereo_remap_bmfunction is composed of the optimized sub-functions and a number of functions from the HLS tool video library (hls_video.h). For details about the library functions provided by the HLS tool video library, refer toVivado Design Suite User Guide: High-Level Synthesis(UG902).

The functions provided in the HLS tool video library are already pre-optimized and contain all the optimization directives to ensure they are implemented with the highest possible performance. The top-level function is therefore composed of sub-functions that are all optimized, and it only requires the DATAFLOW directive to ensure each sub-function starts to execute in hardware as soon as data becomes available.

int stereo_remap_bm(..) {#pragma HLS DATAFLOWreadLRinput (img_data_lr, img_l, img_r, height, dual_width, width, stride hls::InitUndistortRectifyMapInverse(lcameraMA_l, ldistC_l, lirA_l, map1_l, map2_l); hls::Remap<8>(img_l, img_l_remap, map1_l, map2_l, HLS_INTER_LINEAR); hls::InitUndistortRectifyMapInverse(lcameraMA_r, ldistC_r, lirA_r, map1_r, map2_r); hls::Remap<8>(img_r, img_r_remap, map1_r, map2_r, HLS_INTER_LINEAR); hls::Duplicate(img_l_remap, img_l_remap_bm, img_l_remap_pt); hls::FindStereoCorrespondenceBM(img_l_remap_bm, img_r_remap, img_disp, state); hls::SaveAsGray(img_disp, img_d); writeDispOut (img_l_remap_pt, img_d, img_data_disp, height, dual_width, width, stride); }

In general, the DATAFLOW optimization is not required because theSDSoC™environment automatically ensures that data is passed from one hardware function to the next, as soon as it becomes available; however, in this example, the functions withinstereo_remap_bmare using the HLS tool data typehls::stream, which cannot be compiled on theArm®processor and cannot be used in the hardware function interface in theSDSoCenvironment. For this reason, the top-level hardware function must bestereo_remap_bmand thus, the DATAFLOW directive is used to achieve high-performance transfers between the sub-functions. If this were not the case, the DATAFLOW directive could be removed and each sub-function withinstereo_remap_bmcould be specified as a hardware function.

The hardware functions in this design example use the data typeMat, that is based on the HLS tool data typehls::stream. Thehls::streamdata type can only be accessed in a sequential manner. Data is pushed on and popped off.

In software simulation, thehls::streamdata type has infinite size.
In hardware, thehls::streamdata type is implemented as a single register and can only store one data value at a time, because it is expected that the streaming data is consumed before the previous value is overwritten.

By specifying the top-levelstereo_remap_bmfunction as the hardware function, the effects of these hardware types can be ignored in the software environment; however, when these functions are incorporated into theSDSoCenvironment, they cannot be compiled on theArmprocessor, and the system can only be verified through hardware emulation, executing on the target platform, or both.

IMPORTANT:When incorporating hardware functions that contain the HLS tool hardware data types into the SDSoCenvironment, ensure the functions have been fully verified through C compilation and hardware simulation within the HLS tool environment.

IMPORTANT:The hls::streamdata type is designed for use within the HLS tool, but is unsuitable for running software on embedded CPUs. Therefore, this type should not be part of the top-level hardware function interface.

If any of the arguments of the hardware function use any HLS tool specific data types, the function must be enclosed by a top-level C/C++ wrapper function that exposes only native C/C++ types in the function argument list.

Optimizing the Data Motion Network

After importing the pre-optimized hardware function into a project in theSDSoCenvironment, the first task is to remove any interface optimizations. Based on the data types of the hardware function and data access, the interface between the PS and the hardware function is managed and automatically optimized. SeeData Motion Optimization.

Remove any INTERFACE directives present in the hardware function.
Remove any DATA_PACK directives that reference variables present in the hardware function argument list.
Remove any of theVivadoHLS tool hardware data types by enclosing the top-level function in wrappers that only use native C/C++ types for the function arguments.

In this example, the functions to be accelerated are captured inside a single top-level hardware function,stereo_remap_bm.

int main() { unsigned char *inY = (unsigned char *)sds_alloc(HEIGHT*DUALWIDTH); unsigned short *inCY = (unsigned short *)sds_alloc(HEIGHT*DUALWIDTH*2); unsigned short *outCY = (unsigned short *)sds_alloc(HEIGHT*DUALWIDTH*2); unsigned char *outY = (unsigned char *)sds_alloc(HEIGHT*DUALWIDTH); // read double wide image from disk if (read_yuv_file(inY, DUALWIDTH, DUALWIDTH, HEIGHT, FILEINAME) != 0) return -1; convert_Y8toCY16(inY, inCY, HEIGHT*DUALWIDTH); stereo_remap_bm(inCY, outCY, HEIGHT, DUALWIDTH, DUALWIDTH); // write single wide image to disk convert_CY16toY8(outCY, outY, HEIGHT*DUALWIDTH); write_yuv_file(outY, DUALWIDTH, DUALWIDTH, HEIGHT, ONAME); // write single wide image to disk sds_free(inY); sds_free(inCY); sds_free(outCY); sds_free(outY); return 0; }

The key to optimizing the memory accesses to the hardware is to review the data types passed into the hardware function. Reviewing the function signature shows the key variables names to optimize: the input and output data streamsimg_data_lrandimg_data_disp.

int stereo_remap_bm( yuv_t *img_data_lr, yuv_t *img_data_disp, int height, int dual_width, int stride);

Because the data is transferred in a sequential manner, first ensure that the access pattern is defined asSEQUENTIALfor both arguments. For the next optimization, ensure the data transfer is not interrupted by a scatter gather DMA operation by specifying thememory_attributeasPHYSICAL_CONTIGUOUS|NON_CACHEABLE. This also requires that the memory is allocated withsds_allocfromsds_lib.

#include "sds_lib.h" int main() { unsigned char *inY = (unsigned char *)sds_alloc(HEIGHT*DUALWIDTH); unsigned short *inCY = (unsigned short *)sds_alloc(HEIGHT*DUALWIDTH*2); unsigned short *outCY = (unsigned short *)sds_alloc(HEIGHT*DUALWIDTH*2); unsigned char *outY = (unsigned char *)sds_alloc(HEIGHT*DUALWIDTH); }

Finally, the copy directive is used to ensure the data is explicitly copied to the accelerator, and that the data is not accessed from shared memory.

#pragma SDS data access_pattern(img_data_lr:SEQUENTIAL) #pragma SDS data mem_attribute(img_data_lr:PHYSICAL_CONTIGUOUS|NON_CACHEABLE) #pragma SDS data copy(img_data_lr[0:stride*height]) #pragma SDS data access_pattern(img_data_disp:SEQUENTIAL) #pragma SDS data mem_attribute(img_data_disp:PHYSICAL_CONTIGUOUS|NON_CACHEABLE) #pragma SDS data copy(img_data_disp[0:stride*height]) int stereo_remap_bm( yuv_t *img_data_lr, yuv_t *img_data_disp, int height, int dual_width, int stride);

With these optimization directives, the memory access between the PS and PL is optimized for the most efficient transfers.

Stereo Vision Results

After the hardware function optimized with theVivadoHLS tool is wrapped, as in this example, to ensure the HLS tool hardware data types are not exposed at the interface of the hardware function, any interface directives are removed and the data transfers optimized, the hardware functions are recompiled, and the performance is analyzed using event traces.

The following figure shows the complete view of the event traces, and all hardware functions and data transfers executing in parallel for the highest performing system.

To get the duration time, hover over one of the lanes to display a popup window that shows the duration of the accelerator runtime. The execution time is 15.86 ms; this meets the targeted 16.8 ms necessary to achieve 60 frames per second for live video.