Migrating to a New Target Platform

This migration guide is intended for users who need to migrate their acceleratedSDAccel™environment application from one target platform to another. For example, moving an application from aVirtex® UltraScale+™VCU1525 Acceleration Development Board to a U200 Acceleration Development Board.

The following topics are addressed as part of this:

An overview of the Design Migration Process including the physical aspects of FPGA devices.
Any changes to the host code and design constraints if a new release is used.
Controlling kernel placements and DDR interface connections.
Timing issues in the new shell which might require additional options to achieve performance.

Design Migration

When migrating an application implemented in one target platform to another, it is important to understand the differences between the target platforms, and the impact those differences have on the design.

Key considerations:

Is there a change in the release?
Does the new target platform contain a different shell?
Do the kernels need to be redistributed across the Super Logic Regions (SLRs)?
Does the design meet the required frequency (timing) performance in the new platform?

The following diagram summarizes the migration flow described in this guide, and the topics to consider during the migration process.

IMPORTANT:Before starting to migrate a design it is important to understand the architecture of an FPGA and the shell.

Understanding an FPGA Architecture

Before migrating any design to a new target platform, you should have a fundamental understanding of the FPGA architecture. The following diagram shows the floorplan of aXilinx®FPGA device. The concepts to understand are:

SSI Devices
SLRs
SLR routing resources
Memory interfaces

TIP:The FPGA floorplan shown above is for a SSI device with four SLRs where each SLR contains a DDR Memory interface.

Stacked Silicon Interconnect Devices

A SSI device is one in which multiple silicon dies are connected together via silicon interconnect, and packaged into a single device. An SSI device enables high-bandwidth connectivity between multiple die by providing a much greater number of connections. It also imposes much lower latency and consumes dramatically lower power than either a multiple FPGA or a multi-chip module approach, while enabling the integration of massive quantities of interconnect logic, transceivers, and on-chip resources within a single package. The advantages of SSI devices are detailed inXilinx Stacked Silicon Interconnect Technology Delivers Breakthrough FPGA Capacity, Bandwidth, and Power Efficiency.

Super Logic Region

An SLR is a single FPGA die slice contained in an SSI device. Multiple SLR components are assembled to make up an SSI device. Each SLR contains the active circuitry common to mostXilinxFPGA devices. This circuitry includes large numbers of:

LUTs
Registers
I/O Components
Gigabit Transceivers
Block Memory
DSP Blocks

One or more kernels may be implemented within an SLR. A single kernel may not be implemented across multiple SLRs.

SLR Routing Resources

The custom hardware implemented on the FPGA is connected via on-chip routing resources. There are two types of routing resources in an SSI device:

Intra-SLR Resources: Intra-SLR routing resource are the fast resources used to connect the hardware logic. The SDAccelenvironment automatically uses the most optimal resources to connect the hardware elements when implementing kernels.
Super Long Line (SLL) Resources: SLLs are routing resources running between SLRs, used to connect logic from one region to the next. These routing resources are slower than intra-SLR routes. However, when a kernel is placed in one SLR, and the DDR it connects to is in another, the SDAccelenvironment automatically implements dedicated hardware to use SLL routing resources without any impact to performance. More details on managing placement are provided in Modifying Kernel Placement.

Memory Interfaces

Each SLR contains one or more memory interfaces. These memory interfaces are used to connect to the DDR memory where the data in the host buffers is copied before kernel execution. Each kernel will read data from the DDR memory and write the results back to the same DDR memory. The memory interface connects to the pins on the FPGA and includes the memory controller logic.

Understanding Shells

In theSDAcceldevelopment environment, a shell is the hardware design that is implemented onto the FPGA before any custom logic, or accelerators are added. The shell defines the attributes of the FPGA used in the target platform and is composed of two regions:

Static region which contains kernel and device management logic.
Dynamic region where the custom logic of the accelerated kernels is placed.

The figure below shows an FPGA with the shell applied.

The shell, which is a static region that cannot be modified by the user, contains the logic required to operate the FPGA, and transfer data to and from the dynamic region. The static region, shown above in gray, might exist within a single SLR, or as in the above example, might span multiple SLRs. The static region contains:

DDR memory interface controllers
PCIe®interface logic
XDMA logic
Firewall logic, etc.

The dynamic region is the area shown in white above. This region contains all the reconfigurable components of the shell and is the region where all the accelerator kernels are placed.

Because the static region consumes some of the hardware resources available on the device, the custom logic to be implemented in the dynamic region can only use the remaining resources. In the example shown above, the shell defines that all four DDR memory interfaces on the FPGA can be used. This will require resources for the memory controller used in the DDR interface.

Details on how much logic may be implemented in the dynamic region of each shell is provided in theSDx Environments Release Notes, Installation, and Licensing Guide. This topic is also addressed inModifying Kernel Placement, later in this guide.

Migrating Releases

Before migrating to a new target platform, you should also determine if you will need to target the new platform to a different release of theSDAccelenvironment. If you do intend to target a new release, it is highly recommended to first target the existing platform using the new software release to confirm there are no changes required, and then migrate to a new target platform.

There are two steps to follow when targeting a new release with an existing platform:

Host Code Migration
Release Migration

IMPORTANT:Before migrating to a new release, it is recommended that you review the SDx Environments Release Notes, Installation, and Licensing Guide.

Host Code Migration

In the 2018.3 release of theSDAccelenvironment there are some fundamental changes to how theXilinxruntime (XRT) environment and shell(s) are installed. In previous releases, both the XRT environment and shell(s) were automatically installed with theSDAccelenvironment. This has implications on the setup required to compile the host code.

Refer to theSDx Environments Release Notes, Installation, and Licensing Guidefor details on the 2018.3 installation.

TheXILINX_XRTenvironment variable is used to specify the location of the XRT environment and must be set before you compile the host code. When the XRT environment has been installed, theXILINX_XRTenvironment variable can be set by sourcing the/opt/xilinx/xrt/setup.csh, or/opt/xilinx/xrt/setup.shfile as appropriate. Secondly, ensure that yourLD_LIBRARY_PATHvariable also points to the XRT installation area.

To compile, and run the host code, make sure you source the/settings64.csh, or/settings64.shfile from theSDAccelinstallation.

If you are using the GUI, it will automatically incorporate the new XRT location and generate themakefilewhen you build your project.

However, if you are using your own custommakefile, you need to make the following changes:

In yourmakefile, do not use theXILINX_SDXenvironment variable which was used in prior releases.
TheXILINX_SDXvariables and paths must be updated to theXILINX_XRTenvironment variable:
- Include directories are now specified as:-I${XILINX_XRT}/includeand-I${XILINX_XRT}/include/CL
- Library path is now:-L${XILINX_XRT}/lib
- OpenCL™library will be:libxilinxopencl.so. So, use-lxilinxopenclin yourmakefile

Release Migration

After migrating the host code, build the code on the existing target platform using the new release of theSDAcceldevelopment environment. Verify that you can run the project in theSDAccelenvironment using the new release, and make sure it completes successfully, and meets the timing requirements.

Issues which can occur when using a new release are:

Changes to C libraries or library files.
Changes to kernel path names.
Changes to the HLS pragmas or pragma options embedded in the kernel code.
Changes to C/C++/OpenCLcompiler support.
Changes to the performance of kernels: this may require adjustments to the pragmas in the existing kernel code.

Address these issues using the same techniques you would use during the development of any kernel. At this stage, ensure the throughput performance of the target platform using the new release meets your requirements. If there are changes to the final timing (the maximum clock frequency), you can address these when you have moved to the new target platform. This is covered inAddress Timing.

Modifying Kernel Placement

The primary issue when targeting a new platform is ensuring that an existing kernel placement will work in the new target platform. Each target platform has an FPGA defined by a shell. As shown in the figure below, the shell(s) can be different.

The shell of the original platform on the left has four SLRs, and the static region is spread across all four SLRs.
The shell of the target platform on the right has only three SLRs, and the static region is fully-contained in SLR1.

This section explains how to modify the placement of the kernels.

Implications of a New Hardware Platform

The figure below highlights the issue of kernel placement when migrating to a new target platform, or shell. In the example below:

Existing kernel, kernel_B, is too large to fit into SLR2 of the new target platform because most of the SLR is consumed by the static region.
The existing kernel, kernel_D, must be relocated to a new SLR because the new target platform does not have four SLRs like the existing platform.

When migrating to a new platform, you need to take the following actions:

Understand the resources available in each SLR of the new target platform, as documented in theSDx Environments Release Notes, Installation, and Licensing Guide.
Understand the resources required by each kernel in the design.
Use thexocclinker options (--slrand--sp) to specify which SLR each kernel is placed in, and which DDR bank each kernel connects to.

These items are addressed in the remainder of this section.

Determining Where to Place the Kernels

To determine where to place kernels, two pieces of information are required:

Resources available in each SLR of the shell of the hardware platform (.dsa).
Resources required for each kernel.

With these two pieces of information you will then determine which kernel or kernels can be placed in each SLR of the shell.

Keep in mind when performing these calculation that 10% of the available resources can be used by system infrastructure:

Infrastructure logic can be used to connect a kernel to a DDR interface if it has to cross an SLR boundary.
In an FPGA, resources are also used for signal routing. It is never possible to use 100% of all available resources in an FPGA because signal routing also requires resources.

Available SLR Resources

The resources available in each SLR provided byXilinxcan be found in theSDx Environments Release Notes, Installation, and Licensing Guide. The figure below shows an example shell. In this example you can see:

The SLR description indicates which SLR contains static and/or dynamic regions.
The resources available in each SLR (LUTs, Registers, RAM, etc.) are listed.

This allows you to determine what resources are available in each SLR.

Table 1.SLR Resources of a Hardware Platform
Area	SLR 0	SLR 1	SLR 2
SLR description	Bottom of device; dedicated to dynamic region.	Middle of device; shared by dynamic and static region resources.	Top of device; dedicated to dynamic region.
Dynamic region pblock name	pfa_top_i_dynamic_region_pblock _dynamic_SLR0	pfa_top_i_dynamic_region_pblock _dynamic_SLR1	pfa_top_i_dynamic_region_pblock _dynamic_SLR2
Compute unit placement syntax	set_property CONFIG.SLR_ASSIGNMENTS SLR0[get_bd_cells]	set_property CONFIG.SLR_ASSIGNMENTS SLR1[get_bd_cells]	set_property CONFIG.SLR_ASSIGNMENTS SLR2[get_bd_cells]
Global memory resources available in dynamic region
Memory channels; system port name	bank0 (16 GB DDR4)	bank1 (16 GB DDR4, in static region) bank2 (16 GB DDR4, in dynamic region)	bank3 (16 GB DDR4)
Approximate available fabric resources in dynamic region
CLB LUT	388K	199K	388K
CLB Register	776K	399K	776K
Block RAM Tile	720	420	720
UltraRAM	320	160	320
DSP	2280	1320	2280

Kernel Resources

The resources for each kernel can be obtained from theSystem Estimatereport.

TheSystem Estimatereport is available in theAssistantview after either the Hardware Emulation or System run are complete. An example of this report is shown below.

FF refers to the CLB Registers noted in the platform resources for each SLR.
LUT refers to the CLB LUTs noted in the platform resources for each SLR.
DSP refers to the DSPs noted in the platform resources for each SLR.
BRAM refers to the block RAM Tile noted in the platform resources for each SLR.

This information can help you determine the proper SLR assignments for each kernel.

Assigning Kernels to SLRs

Each kernel in a design can be assigned to a SLR region using thexocc --slrcommand line option to specify a placement file. When placing kernels, it is recommended to also assign the specific DDR memory bank that the kernel will connect to using thexocc --spcommand line option. An example can be used to demonstrate these two command line options.

The figure below shows an example where the existing target platform shell has four SLRs, and the new target platform has a shell with three SLRs, and the static region is also structured differently between the target platforms. In this migration example:

Kernel_A is mapped to SLR0.
Kernel_B, which no longer fits in SLR1, is remapped to SLR0, where there are available resources.
Kernel_C is mapped to SLR2.
Kernel_D, is remapped to SLR2, where there are available resources.

The kernel mappings are illustrated in the figure below.

Specifying Kernel Placement

For the above example, the kernels are placed using the following xocccommand option.

xocc --slr kernel_A:SLR0 \ --slr kernel_B:SLR0 \ --slr kernel_C:SLR2 \ --slr kernel_D:SLR2

With these command line options, each of the kernels is placed as shown in the figure above.

Specifying Kernel DDR Interfaces

You should also specify the kernel DDR memory interface when specifying kernel placements. Specifying the DDR interface ensures the automatic pipelining of kernel connections to a DDR interface in a different SLR. This ensures there is no degradation in timing which can reduce the maximum clock frequency.

In this example, using the kernel placements in the above figure:

Kernel_A is connected to Memory Bank 0.
Kernel_B is connected to Memory Bank 1.
Kernel_C is connected to Memory Bank 2.
Kernel_D is connected to Memory Bank 1.

The following xocccommand line performs these connections:

xocc --sp kernel_A.arg1:bank0 \ --sp kernel_B.arg1:bank1 \ --sp kernel_C.arg1:bank2 \ --sp kernel_D.arg1:bank1

IMPORTANT:When using the --spoption to assign kernel ports to memory banks, you must specify the --spoption for all interfaces/ports of the kernel. Refer to "Customization of DDR Bank to Kernel Connection" in the SDAccel Environment Programmers Guidefor more information.

Address Timing

Perform a system run and if it completes with no violations, then the migration is successful.

If timing has not been met you may need to specify some custom constraints to help meet timing. Refer toUltraFast Design Methodology Guide for the Vivado Design Suite(UG949)for more information on meeting timing.

Custom Constraints

Custom constraints are passed to theVivado®tools using thexocc -xpoption for custom placement and timing constraints. Custom Tcl constraints for floorplanning of the kernels will need to be reviewed in the context of the new target platform (.dsa). For example, if a kernel was moved to a different SLR in the new shell, the corresponding placement constraints for that kernel will also need to be modified.

In general, timing is expected to be comparable between different target platforms that are based on the 9PVirtex UltraScaledevice. Any custom Tcl constraints for timing closure will need to be evaluated and might need to be modified for the new platform.

Additionally, any non-default options that are passed toxoccor to theVivadotools using thexocc --xpswitch will need to be updated for the new shell.

Timing Closure Considerations

Design performance and timing closure can vary when moving acrossSDx™releases or shell(s), especially when one of the following conditions is true:

Floorplan constraints were needed to close timing.
Device or SLR resource utilization was higher than the typical guideline:
- LUT utilization was higher than 70%
- DSP, RAMB, and UltraRAM utilization was higher than 80%
- FD utilization was higher than 50%
High effort compilation strategies were needed to close timing.

The utilization guidelines provide a threshold above which the compilation of the design can take longer, or performance can be lower than initially estimated. For larger designs which usually require using more than one SLR, specify the kernel/DDR association on the xocccommand line while verifying that any floorplan constraint ensures the following:

The utilization of each SLR is below the recommended guidelines.
The utilization is balanced across SLRs if one type of hardware resource needs to be higher than the guideline.

For designs with overall high utilization, increasing the amount of pipelining in the kernels, at the cost of higher latency, can greatly help timing closure and achieving higher performance.

For quickly reviewing all aspects listed above, use the fail-fast reports generated throughout the SDxflow when using one of the following two options:

xocc –R 1
- report_failfastis run at the end of each kernel synthesis step
- report_fafailstis run afteropt_designon the entire design
- opt_designDCP is saved
xocc –R 2
- Same reports as with-R 1, plus:
- report_failfastis post-placement for each SLR
- Additional reports and intermediate DCPs are generated

All reports and DCPs can be found in the implementation directory, including kernel synthesis reports:

/_x/link/vivado/prj/prj.runs/impl_1

For more information about timing closure and the fail-fast report, see theUltraFast Design Methodology Timing Closure Quick Reference Guide(UG1292).