Optical Flow Hardware Function Optimization

The hardware functions also require optimization directives to execute at the highest level of performance. These are already present in the design example. Reviewing these highlights the lessons learned fromHardware Function Optimization Methodology. Most of the hardware functions in this design example are optimized using primarily the PIPELINE directive, in a manner similar to the function getOutPix. Review of the getOutPix function shows:

  • The sub-functions have an INLINE optimization applied to ensure the logic from these functions is merged with the function above. This happens automatically for small functions, but use of this directive ensures the sub-functions are always inlined and there is no need to pipeline the sub-functions.
  • The inner loop of the function getOutPix is the loop that processes data at the level of each pixel and is optimized with the PIPELINE directive to ensure it processes 1 pixel per clock.
pix_t getLuma (float fx, float fy, float clip_flowmag) {#pragma HLS inlinefloat rad = sqrtf (fx*fx + fy*fy); if (rad > clip_flowmag) rad = clip_flowmag; // clamp to MAX rad /= clip_flowmag; // convert 0..MAX to 0.0..1.0 pix_t pix = (pix_t) (255.0f * rad); return pix; } pix_t getChroma (float f, float clip_flowmag) {#pragma HLS inlineif (f > clip_flowmag ) f = clip_flowmag; // clamp big positive f to MAX if (f < (-clip_flowmag)) f = -clip_flowmag; // clamp big negative f to -MAX f /= clip_flowmag; // convert -MAX..MAX to -1.0..1.0 pix_t pix = (pix_t) (127.0f * f + 128.0f); // convert -1.0..1.0 to -127..127 to 1..255 return pix; } void getOutPix (float* fx, float* fy, yuv_t* out_pix, int height, int width, float clip_flowmag) { int pix_index = 0; for (int r = 0; r < height; r++) { for (int c = 0; c < width; c++) {#pragma HLS PIPELINEfloat fx_ = fx[pix_index]; float fy_ = fy[pix_index]; pix_t outLuma = getLuma (fx_, fy_, clip_flowmag); pix_t outChroma = (c&1)? getChroma (fy_, clip_flowmag) : getChroma (fx_, clip_flowmag); yuv_t yuvpix; yuvpix = ((yuv_t)outChroma << 8) | outLuma; out_pix[pix_index++] = yuvpix; } } }

If you examine the computeSum function, you will find examples of the ARRAY_PARTITION and DEPENDENCE directives. In this function, the ARRAY_PARTITION directive is used on array img1Win. Because img1Win is an array, it will by default be implemented in a block RAM, which has a maximum of two ports. As shown in the following code summary:

  • img1Win is used in a for-loop that is pipelined to process 1 sample per clock cycle.
  • img1Win is read from, 8 + (KMEDP1-1) + (KMEDP1-1) times within the for-loop.
  • img1Win is written to, (KMEDP1-1) + (KMEDP1-1) times within the for-loop.
void computeSum(pix_t* f0Stream, pix_t* f1Stream, int* ixix_out, int* ixiy_out, int* iyiy_out, int* dix_out, int* diy_out) { static pix_timg1Win[2 * KMEDP1], img2Win [1 * KMEDP1];#pragma HLS ARRAY_PARTITION variable=img1Win complete dim=0... for (int r = 0; r < MAX_HEIGHT; r++) { for (int c = 0; c < MAX_WIDTH; c++) { #pragma HLS PIPELINE ... int cIxTopR = (img1Col_ [wrt] -img1Win[wrt*2 + 2-2]) /2 ; int cIyTopR = (img1Win[ (wrt+1)*2 + 2-1] -img1Win[ (wrt-1)*2 + 2-1]) /2; int delTopR =img1Win[wrt*2 + 2-1] - img2Win [wrt*1 + 1-1]; ... int cIxBotR = (img1Col_ [wrb] -img1Win[wrb*2 + 2-2]) /2 ; int cIyBotR = (img1Win[ (wrb+1)*2 + 2-1] -img1Win[ (wrb-1)*2 + 2-1]) /2; int delBotR =img1Win[wrb*2 + 2-1] - img2Win [wrb*1 + 1-1]; ... // shift windows for (int i = 0; i < KMEDP1; i++) {img1Win[i * 2] =img1Win[i * 2 + 1]; } for (int i=0; i < KMEDP1; ++i) {img1Win[i*2 + 1] = img1Col_ [i]; ... } ... } // for c } // for r ... }

Because a block RAM only supports a maximum of two accesses per clock cycle, all of these accesses cannot be made in one clock cycle. As noted previously in the methodology, the ARRAY_PARTITION directive is used to partition the array into smaller blocks, in this case into individual elements by using the complete option. This enables parallel access to all elements of the array at the same time and ensures that the for-loop processes data every clock cycle.

The final optimization directive worth reviewing is the DEPENDENCE directive. The array csIxix has a DEPENDENCE directive applied to it. The array is read from and then written to via different indices, as shown below, and performs these reads and writes within a pipelined loop.

void computeSum(pix_t* f0Stream, pix_t* f1Stream, int* ixix_out, int* ixiy_out, int* iyiy_out, int* dix_out, int* diy_out) { ... static int csIxix [MAX_WIDTH], csIxiy [MAX_WIDTH], csIyiy [MAX_WIDTH], csDix [MAX_WIDTH], csDiy [MAX_WIDTH]; ...#pragma HLS DEPENDENCE variable=csIxix inter WAR false... int zIdx= - (KMED-2); int nIdx = zIdx + KMED-2; for (int r = 0; r < MAX_HEIGHT; r++) { for (int c = 0; c < MAX_WIDTH; c++) { #pragma HLS PIPELINE ... if (zIdx >= 0) { csIxixL =csIxix [zIdx]; ... } ...csIxix [nIdx]= csIxixR; ... zIdx++; if (zIdx == MAX_WIDTH) zIdx = 0; nIdx++; if (nIdx == MAX_WIDTH) nIdx = 0; ... } // for c } // for r ... }

When a loop is pipelined in hardware, the accesses to the array overlap in time. The compiler analyzes all accesses to an array and issues a warning if there exists any condition where the write in iteration N overwrites the data for iteration N + K, thus changing the value. The warning prevents implementing a pipeline with II = 1.

The following shows an example of the read and write operations for a loop over multiple iterations for an array with indices 0 through 9. As in the code above, it is possible for the address counters to differ between the read and write operations and to return to zero before all loop iterations are complete. The operations are shown overlapped in time, just like a pipelined implementation.

R4---------W8R5---------W9 R6---------W0 R7---------W1R8–––------W2 R9--------W3 R0--------W4 R1--------W5 R2--------W6

In sequential C code where each iteration completes before the next starts, it is clear what order the reads and writes occur. However, in a concurrent hardware pipeline the accesses can overlap and occur in different orders. As can be seen clearly above, it is possible for the read from index 8, as noted by R8, to occur in time before the write to index 8 (W8) which is meant to occur some iterations before R8.

The compiler warns of this and the DEPENDENCE directive is used with the setting false to tell the compiler, "I, the user state that it is okay to ignore this," thus removing the write-after-read anti-dependence and allowing the compiler to create pipelined hardware which performs with II = 1.

The DEPENDENCE directive is typically used to inform the compiler of algorithm behaviors and conditions external to the function of which is it unaware from static analysis of the code. If a DEPENDENCE directive is set incorrectly, the issue will be discovered in hardware emulation if the results from the hardware are different from those achieved with the software.