RTL Kernels

Many hardware engineers have existing RTL IP (includingVivado®IP integratorbased designs), or just feel comfortable implementing a kernel in RTL and develop it usingVivado.SDAccel™allows RTL designs to be used, however they must adhere to the software and hardware requirements to be used within the tool flow and runtime library.

TIP:RTL kernels should be written, designed, and tested using the recommendations in the UltraFast Design Methodology Guide for the Vivado Design Suite(UG949).

Requirements for Using an RTL Design as an RTL Kernel

An RTL design must meet both interface and software requirements to be used as an RTL kernel within theSDAccelframework.

It might be necessary to add or modify the original RTL design to meet these requirements, which are outlined in the following sections.

Kernel Interface Requirements

To satisfy theSDAccelexecution model, an RTL kernel must adhere to the following interface requirements:

  • One and only oneAXI4-Liteinterface is used to access control signals and pass arguments.
  • At least one of the following interfaces (can have both):
    • AXI4master interface to communicate with memory.
    • AXI4-Streaminterface to communicate between the kernels and/or with the host.
      Note:With the ap_ctrl_nonekernel control interface option, AXI4master interfaces cannot be used.
  • At least one clock interface port.

The various interface requirements are summarized in the following table.

Note:In some instances the port names must be written exactly.
Table 1.RTL Kernel Interface and Port Requirements
Port or Interface Description Comment
ap_clk Primary clock input port
  • Name must be exact.
  • Required port.
ap_clk_2 Secondary optional clock input port
  • Name must be exact.
  • Optional port.
ap_rst_n Primary active-Low reset input port
  • Name must be exact.
  • Optional port.
  • This signal should be internally pipelined to improve timing.
  • This signal is driven by a synchronous reset in the ap_clk clock domain.
ap_rst_n_2 secondary optional active-Low reset input
  • Name must be exact.
  • Optional port.
  • This signal should be internally pipelined to improve timing.
  • This signal is driven by a synchronous reset in the ap_clk_2 clock domain.
interrupt Active-High interrupt.
  • Name must be exact.
  • Optional port. Port must be omitted if it is unused.
s_axi_control One and only oneAXI4-Liteslave control interface
  • Name must be exact; case sensitive.
  • Required port interface.
AXI4_MASTER One or moreAXI4master interfaces for global memory access
  • AllAXI4master interfaces must have 64-bit addresses.
  • The kernel developer is responsible for partitioning global memory spaces. Each partition in the global memory becomes a kernel argument. The memory offset for each partition must be set by a control register programmable via theAXI4-Liteslave interface.
  • AXI4masters must not use Wrap or Fixed burst types and must not use narrow (sub-size) bursts meaning AxSIZE should match the width of the AXI data bus.
  • Any user logic or RTL code that does not conform to the requirements above, must be wrapped or bridged to satisfy these requirements.

Kernel Software Requirements

RTL kernels have the same software interface model asOpenCL™and C/C++ kernels. That is, they are seen by the host application as functions with a void return value, scalar arguments, and pointer arguments. For instance:

void mmult(unsigned int length, int *a, int *b, int *output)

TheSDAccelexecution model dictates the following:

  • Scalar arguments are directly written to the kernel through anAXI4-Liteslave interface.
  • Pointer arguments are transferred to/from memory.
  • Kernels are expected to read/write data in global memory though one or moreAXI4memory map and/or stream directly between the host and kernel.
  • Kernels are controlled by the host application through the control register (shown below) through theAXI4-Liteslave interface.

If the RTL design has a different execution model, it must be adapted to ensure that it can be completed in this manner.

The following table defines the required register map for a kernel to be used within theSDAccelenvironment. The control register is required by all kernels while the interrupt related registers are only required for designs with interrupts. All user-defined registers must begin at location0x10; locations below this are reserved.

Table 2.Address Map
Address Name Description
0x0 Control Controls and provides kernel status.
0x4 Global Interrupt Enable Used to enable interrupt to the host.
0x8 IP Interrupt Enable Used to control which IP generated signal are used to generate an interrupt.
0xC IP Interrupt Status Provides interrupt status.
0x10 Kernel arguments start at address 0x10 Includes scalars and global memory arguments.

The definition of the Control register bits differs depending on which of the following three modes of operation the kernel is operating (seeVivado Design Suite User Guide: High-Level Synthesis(UG902)for detailed descriptions of these modes of operation).

  • ap_ctrl_none(that is, free-running kernels)
  • ap_ctrl_hs(that is, sequential kernels)
  • ap_ctrl_chain(that is, pipelined kernels)

The developer chooses the mode of operation of the kernel by complying with the definition of the respective Control registers defined below.

ap_ctrl_none

Forap_ctrl_nonemode, the kernel starts as soon as it is out of reset and never stops. For streaming kernels only (seeStreaming Interfaces).

Table 3.Control (0x0) inap_ctrl_noneMode
Bit Name Description
31:0 Reserved Reserved

ap_ctrl_hs

Inap_ctrl_hsmode, the driver writes a 1 inap_startand waits for bothap_startto be deasserted (guaranteeing the input data is fully processed) andap_doneto be asserted (guaranteeing the output data is fully produced). The definition of the Control register bits under this mode of operation is given in the following table.

Theap_ctrl_hskernels can and should only be restarted after they are done. They do not allow pipelined/overlapping execution.

Table 4.Control (0x0) inap_ctrl_hsMode
Bit Name Description
0 ap_start Asserted by host when kernel can start processing data. Cleared by kernel on handshake with ap_ready.
1 ap_done Asserted by kernel when it has finished producing output data. Cleared on read by host.
2 ap_idle Asserted by kernel when it is idle (deprecated).
3 ap_ready Asserted by kernel when it has finished processing input data. Self-cleared immediately.
6:4 Reserved Reserved
7 auto_restart1 If asserted, ap_start is held asserted by the kernel. Read/write access by the host.
31:8 Reserved Reserved
  1. auto_restartbit is not used by theXilinxRuntime.

ap_ctrl_chain

Inap_ctrl_chainmode of operation, the driver assertsap_startand waits for either:

  • ap_startto be deasserted (guaranteeing the input data is fully processed) therefore allowing to start the next batch or
  • ap_doneto be asserted (guaranteeing the output data is fully produced) then assertsap_continue(to allow the kernel to continue operation)

This mode is recommended if pipelined execution is desired. The definition of the Control register bits forap_ctrl_chainmode is given in the following table.

Table 5.Control (0x0) inap_ctrl_chainMode
Bit Name Description
0 ap_start Asserted by host when kernel can start processing data. Cleared by kernel on handshake with ap_ready.
1 ap_done Asserted by kernel when it has finished producing output data. Cleared by kernel on handshake with ap_continue.
2 ap_idle Asserted by kernel when it is idle (deprecated).
3 ap_ready Asserted by kernel when it has finished processing input data. Self-cleared immediately.
4 ap_continue Asserted by host when kernel can proceed with operation. Cleared immediately by kernel.
6:5 Reserved Reserved
7 auto_restart1 If asserted ap_start and ap_continue are held asserted. Read/write access by the host.
31:8 Reserved Reserved
  1. auto_restartbit is not used by theXilinxRuntime.

Interrupt Registers

The following interrupt related registers are only required if the kernel has an interrupt.

Table 6.Global Interrupt Enable (0x4)
Bit Name Description
0 Global Interrupt Enable When asserted by the host along with any of the IP Interrupt Enable bit, this interrupt is enabled. Read/write access by the host.
31:1 Reserved Reserved
Table 7.IP Interrupt Enable (0x8)
Bit Name Description
0 Channel 0 (ap_done) When asserted along with the Global Interrupt Enable bit, interrupt will be asserted on ap_done assertion. Read/write access by host, and read only by kernel.
1 Channel 1 (ap_ready) When asserted along with the Global Interrupt Enable bit, interrupt will be asserted on ap_ready assertion. Read/write access by host, and read only by kernel.
31:2 Reserved Reserved
Table 8.IP Interrupt Status (0xC)
Bit Name Description
0 Channel 0 (ap_done) Kernel asserts this interrupt status bit when an interrupt is asserted due to ap_done. If you disable interrupt on ap_done, this bit is never asserted by the kernel. Host must clear this bit by writing 1.
1 Channel 1 (ap_ready) Kernel asserts this interrupt status bit when an interrupt is asserted due to ap_ready. If you disable interrupt on ap_ready, this bit is never asserted by the kernel. Host must clear this bit by writing 1.
31:2 Reserved Reserved

Interrupt

RTL kernels can optionally have an interrupt port containing a single interrupt. The port name must be calledinterruptand be active-High. It is enabled when both the Global Interrupt Enable (GIE) and Interrupt Enable Register (IER) bits are asserted. Further, the interrupt is cleared only when writing a one to asserted bits of the IP Interrupt Status Register.

If adding aninterruptport to the kernel, thekernel.xmlfile needs be updated with this information. Thekernel.xmlis generated automatically when using the RTL Kernel Wizard. For details on updating the file, seeCreate Kernel Description XML File.

RTL Kernel Wizard

The RTL Kernel Wizard automates some of the steps that need to be taken to ensure that the RTL IP is packaged into a kernel that can be integrated into a system inSDAccel.

The benefit of the wizard are:

  • Automates some of the steps that must be taken to ensure that the RTL IP is packaged into a kernel that can be integrated into a system inSDAccel.
  • Steps you through the process of specifying your software function model and interface model for the RTL kernel.
  • Generates an RTL wrapper for the kernel that meets the RTL kernel interface requirements, based on the interface information provided.
  • Automatically generates theAXI4-Liteinterface module including the control logic and register file. TheAXI4-Liteinterface module is included in the generated top level RTL Kernel wrapper.
  • Includes in the wrapper an example kernel IP module that you need to replace with your RTL IP design. The RTL IP developer must ensure correct connectivity between RTL IP with a wrapper template.
  • Akernel.xmlfile is generated to match the software function prototype and behavior specified in the wizard.

The RTL Kernel Wizard generates aVivadoproject containing an example design consisting of a simple adder RTL IP, called VADD. In addition, it generates an associated RTL wrapper matching the desired interface, control logic and register map (described above) based on the user Wizard input. You can use this wrapper to wrap your RTL IP into an RTL kernel accessible by theSDAccelframework.

Note:It is not required to use the code generated by the Wizard. You can completely generate your own RTL kernel as long as it meets the software and interface requirements outline above.

If you do use the generated wrapper, you need to replace the generated RTL IP (VADD) with your RTL IP and connect to the wrapper.

The connections include clock(s), reset(s),AXI4-Liteinterface, memory interfaces, and optionally streaming interfaces. The number of connections will be based on the interface information provided to the kernel wizard (for example, choosing twoAXI4memory interfaces). It is necessary to manually make these connections to your IP and validate the design.

The RTL Kernel Wizard generates aVivadoproject for the top-level RTL kernel wrapper and the generated files. This enables you to easily update and optimize the RTL kernel.

Furthermore, the RTL Kernel Wizard also generates a simple test bench for the generated RTL kernel wrapper and a sample host code to exercise the example RTL kernel. This example test bench and host code must be modified to test the your RTL IP design accordingly.

Using the RTL Kernel Wizard is described in the following subsections.

Launching the RTL Kernel Wizard

The RTL Kernel Wizard can be launched with two different methods: from theSDx™Development Environment or from theVivadoIntegrated Design Environment (IDE). TheSDxDevelopment Environment provides a more seamless experience by automatically importing the generated kernel/example host code back into theSDxproject.

To launch the RTL Kernel Wizard from the SDxDevelopment Environment, perform the following:
  1. Launch theSDxDevelopment Environment.
  2. Create anSDxProject (Application Project Type).
  3. ClickXilinx>RTL Kernel Wizard.

To launch the RTL Kernel Wizard fromVivadoIDE, perform the following:

  1. Create a newVivadoproject choosing the same device as exists on the platform you intend to target. If you do not know your target device, choose the default part.
  2. Go to the IP catalog by clicking theIP catalogbutton.
  3. Typewizardin the IP catalog search box.
  4. Double-clickSDx Kernel Wizardto launch the wizard.
Note:Use Vivadofrom the SDxinstall so the tool versions are the same.

Using the RTL Kernel Wizard

The wizard is organized into pages that break down the process of creating a kernel into smaller steps. To navigate between pages, clickNextand selectBack. To finalize the kernel and build a project based on the inputs of the wizard, clickOK. Each of the following sections describes each page and its input options.

RTL Kernel Wizard General Settings

The following graphic shows the three settings in the General Settings tab.

Figure:RTL Kernel Wizard General Settings

Kernel Identification

The following are three settings in the General Settings tab.

Kernel name
The kernel name. This will be the name of the IP, top-level module name, kernel, and C/C++ functional model. This identifier shall conform to C and Verilog identifier naming rules. It must also conform to VivadoIP integrator naming rules, which prohibits underscores except when placed in between alphanumeric characters.
Kernel vendor
The name of the vendor. Used in the Vendor/Library/Name/Version (VLNV) format described in the Vivado Design Suite User Guide: Designing with IP(UG896).
Kernel library
The name of the library. Used in the VLNV. Must conform to the same identifier rules.
Kernel Options
Kernel type
An RTL kernel type consists of a Verilog RTL top-level module with a Verilog control register module and a Verilog kernel example inside the top-level module. The block design kernel type also delivers a Verilog RTL top-level module, but instead it instantiates an IP integrator block diagram inside of a Verilog RTL top-level module. The block design consists of a MicroBlaze™subsystem that uses a block RAM exchange memory to emulate the control registers. Example MicroBlazesoftware is delivered with the project to demonstrate using the MicroBlazeto control the kernel.
Kernel control interface
Selects the kernel mode of operation. Choices include ap_ctrl_hs, ap_ctrl_none, and ap_ctrl_chain. For more information, see Kernel Software Requirements.
Enable MicroBlaze debug(Only available on select configurations)
Adds a MicroBlazeDebug Module (MDM) to a Block Design Kernel type example. The boundary scan interface of the MDM module is connected to the top-level of the kernel. The debug interface is connected to the MicroBlazeinstance. This option is only available for platforms that support system debug over the Xilinx®Virtual Cable and if the Kernel typeis set as Block Design.
Clock and Reset Options
Number of clocks
Sets the number of clocks used by the kernel. Every kernel has a primary clock and reset called ap_clkand ap_rst_n. All AXI interfaces on the kernel are driven with this clock and reset. When selecting Number of clocksto 2, a secondary clock and related reset are provided to be used by the kernel internally. The secondary clock and reset are called ap_clk_2and ap_rst_n_2, respectively. This secondary clock supports independent frequency scaling and is independent from the primary clock. The secondary clock is useful if the kernel clock needs to run at a faster or slower rate than the AXI4interfaces, which must be clocked on the primary clock. When designing with multiple clocks, proper clock domain crossing techniques must be used to ensure data integrity across all clock frequency scenarios.
Has reset
Specifies whether to include a top-level reset input port to the kernel. Omitting a reset can be useful to improve routing congestion of large designs. Any registers that would normally have a reset in the design should have proper initial values to ensure correctness. If enabled, there is a reset port included with each clock. Block Designtype kernels must have a reset input.

Scalar Arguments

Scalar arguments are used to pass control type information to the kernels. Scalar arguments cannot be read back from the host. For each argument that is specified, a corresponding register is created to facilitate passing the argument from software to hardware. See the following figure.

Figure:Kernel Wizard Scalars

Number of scalar kernel input arguments
Specifies the number of scalar input arguments to pass to the kernel. For each number specified, a table row is generated that allows customization of the argument name and argument type. There is no required minimum number of scalars and the maximum allowed by the wizard is 64.
Scalar Input Argument Definition

The following is the scalar input argument definition:

Argument name
The argument name is used in the generated Verilog control register module as an output signal. Each argument is assigned an ID value. This ID value is used to access the argument from the host software. The ID value assignments can be found on the summary page of this wizard. To ensure maximum compatibility, the argument name follows the same identifier rules as the kernel name.
Argument type
Specifies the data type, and hence bit-width, of the argument. This affects the register width in the generated Verilog module. The data types available are limited to the ones specified by the OpenCL C Specification Version 2.0in "6.1.1 Built-in Scalar Data Types" section. The specification provides the associated bit-widths for each data type. The RTL wizard reserves 64 bits for all scalars in the register map regardless of their argument type. If the argument type is 32 bits or less, the RTL Wizard sets the upper 32 bits (of the 64 bits allocated) as a reserved address location. Data types that represent a bit width greater than 32 bits require two write operations to the control registers.

Global Memory

Global memory is accessed by the kernel throughAXI4master interfaces (see the following figure).

Figure:Global Memory

EachAXI4interface operates independently of each other, and eachAXI4interface can be connected to one or more memory controllers to off-chip memory such as DDR4. Global memory is primarily used to pass large data sets to and from the kernel from the host. It can also be used to pass data between kernels. See theMemory Performance Optimizations for AXI4 Interfacesection for recommendations on how to design these interfaces for optimal performance. For each interface, example AXI master logic is generated in the RTL kernel to provide a starting point and can be discarded if not used.

In the Global Memory dialog box, you can specify theNumber of AXI master interfacespresent on the kernel. The maximum is 16 interfaces. For each interface, you can customize an interface name, data width, and the number of associated arguments. Each interface contains all read and write channels. The default names proposed by the RTL kernel wizard arem00_axiandm01_axi. If not changed, these names will have to be used when assigning a DDR bank through the--spoption.

AXI Master Definition (Table Columns)
Interface name
Specifies the name of the interface. To ensure maximum compatibility, the argument name follows the same identifier rules as the kernel name.
Width (in bytes)
Specifies the data width of the AXI data channels. Xilinxrecommends matching to the native data width of the memory controller AXI4slave interface. The memory controller slave interface is typically 64 bytes (512 bits) wide.
Number of arguments
Specifies the number of arguments to associate with this interface. Each argument represents a data pointer to global memory that the kernel can access.
Argument Definition
Interface
Specifies the name of the AXI Interface that the corresponding columns in the current row are associated with. This value is not directly modifiable; it is copied from the interface name defined in the previous table.
Argument name
Specifies the name of the pointer argument as it appears on the function prototype signature. Each argument is assigned an ID value. This ID value is used to access the argument from the host software. The ID value assignments can be found on the summary page of this wizard. To ensure maximum compatibility, the argument name follows the same identifier rules as the kernel name. The argument name is used in the generated Verilog control register module as an output signal.
Streaming Interfaces

The streaming interfaces page allows configuration ofAXI4-Streaminterfaces on the kernel. Streaming interfaces can be used for bus connections between kernels or to/from the host (on certain QDMA-based platforms only). For kernel-to-kernel communication, theAXI4-Streamsignal set and protocol should match between kernels. Streaming interfaces used for direct host-to-kernel and kernel-to-host communication must follow a strict protocol and signal declaration. The QDMAAXI4-Streamprotocol uses theTDATA/TKEEP/TLASTsignals of theAXI4-Streamprotocol. Stream transactions consists of a series of transfers where the final transfer is terminated with the assertion of theTLASTsignal. The following figure shows the configuration options. Stream transfers to/from the host must adhere to the following:

  • AXI4-Streamtransfer occurs whenTVALID/TREADYare both asserted.
  • TDATAmust be 8, 16, 32, 64, 128, 256, or 512 bits wide.
  • TKEEP(per byte) must be all 1s whenTLASTis 0.
  • TKEEPcan be used to signal a ragged tail whenTLASTis 1. For example, on a 4-byte interface,TKEEPcan only be0b0001,0b0011,0b0111, or0b1111to specify the last transfer is 1-byte, 2 bytes, 3 bytes, or 4 bytes in size, respectively.
  • TKEEPcannot be all zeros (even ifTLASTis 1).
  • TLASTmust be asserted at the end of a packet.
  • TREADYinput/TVALIDoutput should be low if kernel is not started to avoid lost transfers.

Figure:Streaming Interfaces



Number of AXI4-Stream interfaces
Specifies the number of AXI4-Streaminterfaces that exist on the kernel. A maximum of 32 interfaces can be enabled per kernel. Xilinxrecommends keeping the number of interfaces as low as possible to reduce the amount of area consumed.
Stream Settings
Name
Specifies the name of the interface. To ensure maximum compatibility, the argument name follows the same identifier rules as the kernel name.
Mode
Specifies the direction of the interface. A read only interface is an AXI4-Streamslave interface and can be sent data with the clWriteStreamAPI. A write only interface is an AXI4-Streammaster interface and the host can receive data from the interface with the clReadStreamAPI.
Width (bytes)
Specifies the TDATAwidth (in bytes) of the AXI4-Streaminterface. This interface width is limited to 1 to 64 bytes in powers of 2.

Summary

This section summarizes VLNV, the software function prototype, and hardware control registers created from options selected in the previous pages. The function prototype conveys what a kernel call would be like if it was a C function. See the host code generated example of how to set the kernel arguments for the kernel call. The register map shows the relationship between the host software ID, argument name, hardware register offset, type, and associated interface. Review this section for correctness before proceeding to generate the kernel.

Figure:Kernel Wizard Summary

Finalizing and Generating the Kernel from the RTL Wizard

If the RTL Kernel Wizard was launched fromSDx, after clickingOK, the exampleVivadoproject opens.

If the RTL Kernel Wizard was launched fromVivado, after clickingOKdo the following:

  1. When the Generate Output Products window appears, selectGlobalsynthesis options and clickGenerate, then clickOK.
  2. Right-click the.xcifile in the Design Sources View inVivado, and selectOpen IP Example Design.
  3. In the open example design window, select an output directory (or accept default) and clickOK. This opens a newVivadoproject with the example design in it.
  4. You can now close the currentVivadoproject from which the RTL Kernel Wizard was invoked.

Interrupt

By default, the RTL Kernel Wizard creates a single interrupt port, namedinterrupt, along with the interrupt logic in the Control Register block. This is reflected in the generated Verilog code and the associatedcomponent.xmlandkernel.xmlfiles.

The interrupt is active-High and is enabled by setting both the Global Interrupt Enable (GIE) and Interrupt Enable (IER) registers. By default, the IER uses the internalap_doneandap_readysignals to trigger an interrupt.

An interrupt is cleared when all the defined bits of the ISR register are zero as triggered by a toggle on write command..

RTL Kernel Wizard Vivado Project

The RTL Kernel Wizard configuration dialog box customizes the specification of an RTL kernel by specifying its I/O, control registers, andAXI4interfaces. The next step in the process customizes the contents of the kernel and then packages those contents into aXilinxObject (xo) file. After the RTL Kernel Wizard configuration GUI has completed, aVivadokernel project is generated and populated with the files necessary to create an RTL Kernel.

The top-level Verilog file contains the expected input/output signals and parameters. These top-level ports are matched to the kernel specification file (kernel.xml) and when combined with the rest of the RTL/block design becomes the acceleration kernel. TheAXI4interfaces defined at the top-level file contain a minimum subset ofAXI4signals required to generate an efficient, high throughput interface. Signals omitted inherit optimized defaults when connected to the rest of the AXI system. These optimized defaults allow the system to omit AXI features that are not required, saving area and reducing complexity. If starting with existing code that contains AXI signals not listed in the port list, it is possible to add these signals to the top-level ports and the IP packager will adapt to them appropriately.

Depending on the selectedKernel Type, the contents of the top-level file is populated either with a Verilog example and control registers or an instantiatedIP integratorblock design.

RTL Kernel Type Project Flow

The RTL kernel type delivers a top-level Verilog design consisting of control register and Vadd sub-modules example design. The Vadd sub-module, shown in the following figure, consists of a simple adder function, anAXI4read master, twoAXI4-Streaminterfaces, and anAXI4write master. Each definedAXI4interface has an independent example adder code. The first associated argument of each interface is used as the data pointer for the example. Each example reads 16 KB of data, performs a 32-bitadd oneoperation, and then writes out 16 KB of data back in place (the read and write address are the same). Care should be taken if the Control Register module is modified to ensure that it still aligns with thekernel.xmlfile located in the imports directory of theVivadokernel project. The example sub-module can be replaced with your custom logic or used as a starting point for your design.

Figure:Kernel Type RTL Top



The Vadd sub-module, shown in the following figure, consists of a simple adder function, anAXI4read master, and anAXI4write master. Each definedAXI4add oneoperation, and then writes out 16 KB of data back in place (the read and write address are the same). interface has independent example adder code. The first associated argument of each interface is used as the data pointer for the example. Each example reads 16 KB of data, performs a 32-bit

Figure:Kernel Type RTL Example



The following table describes important files relative to the root of theVivadoproject for the kernel, whereis the name of the kernel chosen in the wizard.

Table 9.RTL Kernel Wizard Source and Test Bench File
Filename Description Delivered with Kernel Type
interface has independent example add_ex.xpr Vivadoproject file All
imports directory
.v Kernel top-level module All
_control_s_axi.v RTL control register module RTL
_example.sv RTL example block RTL
_example_vadd.sv RTL exampleAXI4vector add block RTL
_example_axi_read_master.sv RTL exampleAXI4read master RTL
_example_axi_write_master.sv RTL exampleAXI4write master RTL
_example_adder.sv RTL exampleAXI4-Streamadder block RTL
_example_counter.sv RTL example counter RTL
_exdes_tb_basic.sv Simulation test bench All
_cmodel.cpp Software C-Model example for software emulation. All
_ooc.xdc Out-of-contextXilinxconstraints file All
_user.xdc Xilinxconstraints file for kernel user constraints. All
kernel.xml Kernel description file All
package_kernel.tcl Kernel packaging script proc definitions All
post_synth_impl.tcl Tcl post-implementation file All
sdx_imports directory
src/host_example.cpp Host code example All
makefile Makefile example All
_ex.sdk/_control/src directory
kernel_control.h MicroBlazeC header file Block Design
kernel_control.c MicroBlazeC file Block Design
_ex.sdk/_control/Debug directory
_control.elf MicroBlazeelf file Block Design
_ex.src/sources_1/_bd directory
_bd.bd VivadoBlock Diagram file Block Design

Block Design Kernel Type Project Flow

The block design kernel type delivers anIP integratorblock design (BD) as the basis of the kernel. AMicroBlazeprocessor subsystem is used to sample the control registers and to control the flow of the kernel. TheMicroBlazeprocessor system uses a block RAM as an exchange memory between the Host and the Kernel instead of a register file.

For each AXI interface, a DMA and math operation sub-blocks are created to provide an example of how to control the kernel execution. The example uses the MicroBlaze AXI4-Streaminterfaces to control the AXI DataMover IP to create an example identical to the one in the RTL kernel type. Also, included is an SDKproject to compile and link an ELF file for the MicroBlazecore. This ELF file is loaded into the Vivadokernel project and initialized directly into the MicroBlazeinstruction memory. The following steps can be used to modify the MicroBlazeprocessor program:
  1. If the design has been updated, you might need to run the Export Hardware option. The option can be found in theFile>Export>Export Hardwaremenu location. When the export Hardware dialog opens, clickOK.
  2. The software development kit (SDK) application can now be invoked. SelectFile>Launch>SDKfrom theVivadomenu.
  3. When theXilinxSDKGUI opens, clickXjust to the right of the text on the Welcome tab to close the welcome dialog box. This shows an already loaded SDK project underneath.
  4. From the Project Explorer, the source files are under the_control/srcsection. Modify these as appropriate.
  5. When updates are complete, compile the source by selecting the menu optionProject>Build All>Check for errors/warnings and resolve if necessary. The ELF file is automatically updated in the GUI.
  6. Run simulation to test the updated program and debug if necessary.

Simulation Test Bench

When a SystemVerilog simulation test bench is generated, this exercises the kernel to ensure its operation is correct. It is populated with the checker function to verify theadd oneoperation. This generated test bench can be used as a starting point in verifying the kernel functionality. It writes/reads from the control registers and executes the kernel multiple times while also including a simple reset test. It is also useful for debugging AXI issues, reset issues, bugs during multiple iterations, and kernel functionality. Compared to hardware emulation, it executes a more rigorous test of the hardware corner cases, but does not test the interaction between host code and kernel.

To run a simulation, clickVivadoFlow Navigator>Run Simulationlocated on the left hand side of the GUI and selectRun Behavioral Simulation. If behavioral simulation is working as expected, a post-synthesis functional simulation can be run to ensure that synthesis is matched with the behavioral model.

Out-of-Context Synthesis

TheVivadokernel project is configured to run synthesis and implementation in out-of-context (OOC) mode. AXilinxDesign Constraints (XDC) file is populated in the design to provide default clock frequencies for this purpose. Running synthesis is useful to determine whether the kernel synthesizes without errors. It also provides estimates of usage and frequency. The kernel should be able to run through synthesis successfully before it is packaged.

Otherwise, errors occur during linking and it could be harder to debug. The synthesized outputs can be used when packaging the kernel as a netlist instead of RTL. If a block design is used within the kernel, the kernel must be packaged as a netlist. To run OOC synthesis, clickRun Synthesisfrom theVivado Flow Navigator>Synthesismenu.

Software Model and Host Code Example

A C++ software model of the exampleadd oneoperation is provided in the imports directory. It has the same name as the kernel and has acppfile extension. This software model can be modified to model the function of the kernel. In the packaging step, this model can be included with the kernel. When usingSDx, this allows software emulation to be performed with the kernel. The Hardware Emulation and the System Linker always uses the hardware description of the kernel.

In thesdx_importsdirectory, example C host code is provided and is calledmain.c. The host code expects the binary container as the argument to the program. This can be automatically specified by selectingAutomatically add binary container(s) to argumentsinRun Configuration>Argumentsafter the host code is loaded into theSDxGUI. The host code then loads the binary as part of theinitfunction. The host code instantiates the kernel, allocates the buffers, sets the kernel arguments, executes the kernel, and then collects and checks the results for the exampleadd onefunction.

Package RTL Kernel

After the kernel is designed and tested inVivado, the final step for generating the RTL kernel is to package theVivadokernel project for use withSDx.

To begin the process, click Generate RTL Kernelfrom the Vivado Flow Navigator>Project Managermenu. A pop-up dialog box opens with three main packaging options:
  • A source-only kernel packages the kernel using the RTL design sources directly.
  • The pre-synthesized kernel packages the kernel with the RTL design sources with a synthesized cached output that can be used later on in the flow to avoid re-synthesizing. If the target platform changes, the packaged kernel might fall back to the RTL design sources instead of using the cached output.
  • The netlist, design checkpoint (DCP), based kernel packages the kernel as a block box, using the netlist generated by the synthesized output of the kernel. This output can be optionally encrypted if necessary. If the target platform changes, the kernel might not be able to re-target the new device and it must be regenerated from the source. If the design contains a block design, the netlist (DCP) based kernel is the only packaging option available.

Optionally, all kernel packaging types can be packaged with the software model that can be used in software emulation. If the software model contains multiple files, provide a space in between each file in the Source files list, or use the GUI to select multiple files using theCTRLkey when selecting the file.

After you clickOK, the kernel output products are generated. If the pre-synthesized kernel or netlist kernel option is chosen, then synthesis can run. If synthesis has previously run, it uses those outputs, regardless if they are stale. The kernelXilinxObject.xofile is generated in thesdx_importsdirectory of theVivadokernel project.

At this point, you can close theVivadokernel project. If theVivadokernel project was invoked from theSDxGUI, the example host code calledmain.cand kernelXilinxObject (.xo) files are automatically imported into theSDxsource folder.

Modifying an Existing RTL Kernel Generated from the Wizard

From theSDxGUI, it is possible to modify an existing generated kernel. By invoking theXilinxRTL Kernel Wizard menu option after a kernel has been generated, a dialog box opens that gives you the option to modify an existing kernel. SelectingEdit Existing Kernel Contentsre-opens theVivadoProject, and you can then modify and generate the kernel contents again. SelectingRe-customize Existing Kernel Interfacesrevisits the RTL Kernel Wizard configuration dialog box. Options other than Kernel Name can be modified and the previousVivadoproject is replaced.

IMPORTANT:All files and changes in the previous project are lost when the updated Vivadokernel project is generated.

Manual Development Flow for RTL Kernels

Using the RTL Kernel Wizard to create RTL kernels is highly recommended; however RTL kernels can be created without using the wizard. This section provides details on each step of the manual development flow. The three steps to package an RTL design as an RTL kernel forSDAccelapplications are:

  1. Package the RTL block asVivadoIP.
  2. Create a kernel description XML file.
  3. Package the RTL kernel into aXilinxObject (.xo) file.

These steps are an automated use of the RTL Kernel Wizard. A fully packaged RTL Kernel is delivered as an.xofile with a file extension of.xo. This file is a container encapsulating theVivadoIP object (including source files) and associated kernel XML file. The.xofile can be compiled into the platform and run in hardware or hardware emulation flows.

Packaging an RTL Block as Vivado IP

RTL kernels must be packaged as aVivadoIP suitable for use in theIP integrator. See theVivado Design Suite User Guide: Creating and Packaging Custom IP(UG1118)for details on IP packaging inVivado.

The following interface packaging is required for the RTL Kernel:

  • TheAXI4-Liteinterface name must be packaged asS_AXI_CONTROL, but the underlying AXI ports can be named differently.
  • TheAXI4interfaces must be packaged as AXI4 master endpoints with 64-bit address support.
    Note: Xilinxstrongly recommends that AXI4interfaces be packaged with AXI meta data HAS_BURST=0and SUPPORTS_NARROW_BURST=0. These properties can be set in an IP level bd.tclfile. This indicates wrap and fixed burst type is not used and narrow (sub-size burst) is not used.
  • ap_clkandap_clk_2must be packaged as clock interfaces.
  • ap_rst_nandap_rst_n_2must be packaged as active-Low reset interfaces.
  • ap_clkmust be packaged to be associated with allAXI4-Lite,AXI4, andAXI4-Streaminterfaces.

To test if the RTL kernel is packaged correctly for theIP integrator, try to instantiate the packaged kernel in theIP integrator. In the GUI, it should show up as having interfaces for clock, reset,AXI4-Liteslave,AXI4master, andAXI4slave only. No other ports should be present in the canvas view. The properties of the AXI interface can be viewed by selecting the interface on the canvas. Then in theBlock Interface Propertieswindow, select the Properties tab and expand theCONFIGtable entry. If an interface is to be read-only or write-only, the unused AXI channels can be removed and theREAD_WRITE_MODEis set to read-only or write-only.

IMPORTANT:If the RTL kernel has constraints which refer to constraints in the static area such as clocks, then the RTL kernel constraint file needs to be marked as late processing order to ensure RTL kernel constraints are correctly applied.

There are two methods to mark constraints as late processing order:

  1. If the constraints are given in a.ttclfile, add<: setFileProcessingOrder "late" :>to the.ttclpreamble section of the file as shown below:
    <: set ComponentName [getComponentNameString] :> <: setOutputDirectory "./" :> <: setFileName $ComponentName :> <: setFileExtension ".xdc" :> <: setFileProcessingOrder "late" :>
  2. If the constraints are given in a.xdcfile, then add the four lines starting atbelow in thecomponent.xml. The four lines in thecomponent.xmlneed to be next to the area where the.xdcfile is called. In the following example,my_ip_constraint.xdcfile is being called with the subsequent late processing order defined.
     ttcl/my_ip_constraint.xdc ttcl USED_IN_implementation USED_IN_synthesis  processing_order late  

Create Kernel Description XML File

A kernel description XML file needs to be created for each RTL kernel such that it can be used in theSDAccelenvironment. The file must be calledkernel.xml. The XML file specifies kernel attributes like the register map and ports which are needed by the runtime andSDAccelflows. The following is an example of akernel.xmlfile.

           

The following table describes the format of the kernel XML in detail:

Table 10.Kernel XML Format
Tag Attribute Description
versionMajor Set to 1 for the current release ofSDAccel.
versionMinor Set to 6 for the current release ofSDAccel.
name Kernel name
language Always set it toip_cfor RTL kernels.
vlnv Must match the vendor, library, name, and version attributes in thecomponent.xmlof an IP. For example, Ifcomponent.xmlhas the following tags:

xilinx.com

hls

test_sincos

1.0

The vlnv attribute in kernel XML must be set to:

xilinx.com:hls:test_sincos:1.0

attributes Reserved. Set it to empty string.
preferredWorkGroupSizeMultiple Reserved. Set it to 0.
workGroupSize Reserved. Set it to 1.
interrupt Set equal to "true" (that is,. interrupt="true") if interrupt present else omit.
name Port name. At least anAXI4master port and anAXI4-Liteslave port are required. TheAXI4-Streamport can be optionally specified to stream data between kernels. TheAXI4-Liteinterface name must beS_AXI_CONTROL.
mode
  • ForAXI4master port, set it to "master."
  • ForAXI4slave port, set it to "slave."
  • ForAXI4-Streammaster port, set it to "write_only."
  • ForAXI4-Streamslave port, set it "read_only."
range The range of the address space for the port.
dataWidth The width of the data that goes through the port, default is 32 bits.
portType Indicate whether or not the port is addressable or streaming.
  • ForAXI4master and slave ports, set it to "addressable."
  • ForAXI4-Streamports, set it to "stream."
base ForAXI4master and slave ports, set to0x0. This tag is not applicable toAXI4-Streamports.
name Kernel argument name.
addressQualifier Valid values:
  • 0: Scalar kernel input argument
  • 1: global memory
  • 2: local memory
  • 3: constant memory
  • 4: pipe
id Only applicable forAXI4master and slave ports. The ID needs to be sequential. It is used to determine the order of kernel arguments.

Not applicable forAXI4-Streamports.

port Indicates the port to which theargis connected.
size Size of the argument. The default is 4 bytes.
offset Indicates the register memory address.
type The C data type for the argument. For example,int*,float*.
hostOffset Reserved. Set to0x0.
hostSize Size of the argument. The default is 4 bytes.
memSize Not applicable toAXI4master and slave ports.

ForAXI4-Streamports,memSizesets the depth of the created FIFO.

The following tags specify additional information forAXI4-Streamports. They are not applicable toAXI4master or slave ports.
For each pipe in the compute unit, the compiler inserts a FIFO for buffering the data. The pipe tag describes configuration of the FIFO.
name This specifies the name for the FIFO inserted for theAXI4-Streamport. This name must be unique among all pipes used in the same compute unit.
width This specifies the width of FIFO in bytes. For example,0x4for 32-bit FIFO.
depth This specifies the depth of the FIFO in number of words.
linkage Always set to internal.
The connection tag describes the actual connection in hardware either from the kernel to the FIFO inserted for the PIPE or from the FIFO to the kernel.
srcInst Specifies the source instance of the connection.
srcPort Specifies the port on the source instance for the connection.
dstInst Specifies the destination instance of the connection.
dstPort Specifies the port on the destination instance of the connection.

Package RTL Kernel into Xilinx Object File

The final step is to package the RTL IP and the associated kernel XML file together into aXilinxobject file (.xo) so it can be used by theSDAccelcompiler. The following example command line packagestest_sincosRTL IP andkernel.xmlinto object file namedtest.xo.

package_xo -xo_path test.xo -kernel_name test_sincos -kernel_xml kernel.xml -ip_directory ./ip/

For additional information on thepackage_xo, see the "package_xo Command" section inSDx Command and Utility Reference Guide(UG1279). Also, for examples on using thepackage_xocommand, see theXilinxGitHubrepository.

Designing RTL Recommendations

While the RTL Kernel Wizard assists in packaging RTL designs for use within theSDxflow, the underlying RTL kernels should be designed with recommendations from theUltraFast Design Methodology Guide for the Vivado Design Suite(UG949).

In addition to adhering to the interface and packaging requirements, the kernels should be designed with performance goals in mind. Specifically:

These topics are described in the following subsections.

Memory Performance Optimizations for AXI4 Interface

TheAXI4interfaces typically connects to DDR memory controllers in the platform.

Note:For optimal frequency and resource usage it is recommended that one interface is used per memory controller.

For best performance from the memory controller, the following is the recommended AXI interface behavior:

  • Use an AXI data width that matches the native memory controller AXI data width, typically 512 bits.
  • Do not useWRAP,FIXED, or sub-sized bursts.
  • Use burst transfer as large as possible (up to 4k byteAXI4protocol limit).
  • Avoid use of deasserted write strobes. Deasserted write strobes can cause error-correction code (ECC) logic in the DDR memory controller to perform read-modify-write operations.
  • Use pipelined AXI transactions.
  • Avoid using threads if an AXI interface is only connected to one DDR controller.
  • Avoid generating write address commands if the kernel does not have the ability to deliver the full write transaction (non-blocking write requests).
  • Avoid generating read address commands if the kernel does not have the capacity to accept all the read data without back pressure (non-blocking read requests).
  • If a read-only or write-only interfaces are desired, the ports of the unused channels can be commented out in the top level RTL file before the project is packaged into a kernel.
  • Using multiple threads can cause larger resource requirements in the infrastructure IP between the kernel and the memory controllers.

Managing Clocks in an RTL Kernel

An RTL kernel can have up to two external clock interfaces; a primary clock,ap_clk, and an optional secondary clock,ap_clk_2. Both clocks can be used for clocking internal logic. However, all external RTL kernel interfaces must be clocked on the primary clock. Both primary and secondary clocks support independent automatic frequency scaling.

If you require additional clocks within the RTL kernel, a frequency synthesizer such as the Clocking Wizard IP or MMCM/PLL primitive can be instantiated within the RTL kernel.

Thus your RTL kernel can use just the primary clock, both primary and secondary clock, or primary and secondary clock along with an internal frequency synthesizer. The following shows the advantages and disadvantages of using these three RTL kernel clocking methods:

  • Single input clock:ap_clk
    • External interfaces and internal kernel logic run at the same frequency.
    • No clock domain crossing (CDC) issues.
    • Frequency ofap_clkcan automatically be scaled to allow kernel to meet timing.
  • Two input clocks:ap_clkandap_clk_2
    • Kernel logic can run at either clock frequency.
    • Need proper CDC technique to move from one frequency to another.
    • Bothap_clkandap_clk_2can automatically scale their frequencies independently to allow the kernel to meet timing.
  • Using a frequency synthesizer inside the kernel:
    • Additional device resources required to generate clocks.
    • Must haveap_clkand optionallyap_clk_2interfaces.
    • Generated clocks can have different frequencies for different CUs.
    • Kernel logic can run at any available clock frequency.
    • Need proper CDC technique to move from one frequency to another.

When using a frequency synthesizer in the RTL kernel there are some constraints you should be aware of:

  1. RTL external interfaces are clocked atap_clk.
  2. The frequency synthesizer can have multiple output clocks that are used as internal clocks to the RTL kernel.
  3. You must provide a Tcl script to downgrade the clock resource placement DRCs inVivadoplacement to stop aVivadoDRC error from occurring. An example of the Tcl command follows:
    set_property CLOCK_DEDICATED_ROUTE ANY_CMT_COLUMN [get_nets pfm_top_i/static_region/base_clocking/clkwiz_kernel/inst/CLK_CORE_DRP_I/clk_inst/clk_out1
    Note:This constraint should be edited to reflect the shell clock structure of your platform.
  4. Use thexocc --xpoption to specify the above Tcl script for use byVivadoimplementation, after optimization. For example:
    --xp vivado_prop:run.impl_1.STEPS.OPT_DESIGN.TCL.POST={/}
  5. Specify the two global clock input frequencies which can be used by the kernels (RTL or HLS-based). Use thexocc --kernel_frequencyoption to ensure the kernel input clock frequency is as expected. For example to specify one clock use:
    xocc --kernel_frequency 250
    For two clocks, you can specify multiple frequencies based on the clock ID. The primary clock has clock ID 0 and the secondary has clock ID 1.
    xocc --kernel_frequency 0:250|1:500
    TIP:Ensure that the PLL or MMCM output clock is locked before RTL kernel operations. Use the locked signal in the RTL kernel to ensure the clock is operating correctly.
After adding the frequency synthesizer to an RTL kernel, the generated clocks are not automatically scalable. Ensure the RTL kernel passes timing requirements, or xoccwill return an error like the following:
ERROR: [VPL-1] design did not meet timing - Design did not meet timing. One or more unscalable system clocks did not meet their required target frequency. Please try specifying a clock frequency lower than 300 MHz using the '--kernel_frequency' switch for the next compilation. For all system clocks, this design is using 0 nanoseconds as the threshold worst negative slack (WNS) value. List of system clocks with timing failure.

In this case you will need to change the internal clock frequency, or optimize the kernel logic to meet timing.

Quality of Results Considerations

The following recommendations help improve results for timing and area:

  • Pipeline all reset inputs and internally distribute resets avoiding high fanout nets.
  • Reset only essential control logic flip-flops (FFs).
  • Consider registering input and output signals to the extent possible.
  • Understand the size of the kernel relative to the capacity of the target platforms to ensure fit, especially if multiple kernels will be instantiated.
  • Recognize platforms that use Stack Silicon Interconnect (SSI) Technology. These devices have multiple die and any logic that must cross between them should be FF to FF timing paths.

Debug and Verification Considerations

  • RTL kernels should be verified in their own test bench using advanced verification techniques including verification components, randomization, and protocol checkers. The AXI Verification IP (VIP) is available in theVivadoIP catalog and can help with the verification of AXI interfaces. The RTL kernel example designs contain an AXI VIP-based test bench with sample stimulus files.
  • The hardware emulation flow should not be used for functional verification because it does not accurately represent the range of possible protocol signaling conditions that real AXI traffic in hardware can incur. Hardware emulation should be used to test the host code software integration or to view the interaction between multiple kernels.