Vivado Hardware Design for Deep Learning Unit

This section is part of a series focused on utilizing the DPUCZDX8G Deep Learning Processor Unit (DPU), a programmable engine optimized for convolutional neural networks (CNNs), within the Vitis AI environment. In this section, I will describe the process of integrating the DPU with configuration B1152 into a Vivado hardware design for the Zynq UltraScale+ 2CG device. The setup used includes the Trenz Module TE0820-03-2AI21FA (ZU+ 2CG) and the TE0703-06 carrier board.

This series covers key steps for enabling efficient AI acceleration on embedded platforms, including:

By following this guide, can be gained practical insights into designing, deploying, and running AI workloads on Zynq UltraScale+ MPSoCs using Vitis AI.

All sources are available for free into my repository. The repository consists of example for to ZU+ devices (2CG and 4EV, Trenz modules TE0820-03-2AI21FA and TE0820-05-4DE21MA) and allows to create Vivado Hardware Design and deploy Linux by using Petalinux for Vitis AI enviroment.

The structure of the repository:

board/ - Vivado block design and project configuration tcl

petalinux/ - Petalinux project configuration

ip/ - DPU IP core sources

scripts/ - some helper scripts

My host enviroment:

1. Ubuntu 20.04 on WSL2

2. Vitis 2022.2 installed in Ubuntu 20.04

3. Petalinux 2022.2

4. Vitis AI 3.0 repository

5. DPU DPUCZDX8G IP core

How to start. Clone the repository and build Vivado project for te0820-2cg board. Hardware deisgn Xilinx support archive (xsa) for Linux deployment with Petalinux will be exported into folder build/

The following figures show the configuration page of the DPUCZDX8G.

Number of DPU cores is 1 with low RAM usage. For power consumption reducing dpu_2x clock Gating is enable. Amount of resources and AXI port configuration is depicted on the Summary tab. Because of the Softmax function enabled the amount of DSP Slices and RAM blocks is higher than pointed for B1152 architecture, but still feasable for integration on ZU+ 2CG devices.

        
[host]:~$ git clone https://github.com/farbius/edu-vitis-ai
[host]:~$ cd edu-vitis-ai
[host]:~/edu-vitis-ai$ make te0820-2cg-xsa

The DPUCZDX8G IP core supports one AXI slave interface for accessing configuration and status registers, and AXI master interfaces for instruction and data fetch. To optimize performance, all interfaces should be directly connected to the Processing System (PS), when possible, to minimize latency and maximize data throughput.

Xilinx Recommendations for optimal DPUCZDX8G integration:

  • Direct Connections to PS: Connect each master interface of the DPUCZDX8G directly to the PS instead of using an AXI Interconnect IP, provided there are sufficient AXI slave ports on the PS.

  • Instruction Fetch Interface: All master ports responsible for instruction fetching should connect to the S_AXI_LPD of the PS, either directly or via an AXI interconnect (if only one master port is present).

  • Data Fetching Ports: To ensure high bandwidth, the master ports used for data fetching should be directly connected to the PS whenever possible.

  • Port Prioritization: Master ports with higher priority (such as DPU0, which has a smaller port number) should be connected to higher-priority slave ports on the PS (like S_AXI_HP0_FPD), to ensure efficient data access.

  • Slave Interface Connection: The AXI slave port of the DPUCZDX8G should be connected to the M_AXI_HPM0_LPD of the PS to manage register access efficiently.

Below is an example of the hardware design for the Zynq UltraScale+ 2CG device. The DPUCZDX8G IP requires three input clocks, each with a corresponding reset signal. It’s important to ensure that each reset is synchronous to its respective clock to maintain proper system operation.

DPU configuration. Different configurations can be selected for DSP slices, LUT, block RAM, and UltraRAM usage based on the amount of available programmable logic resources. In ZU+ 2CG device 150 RAM blocks, 240 DSP Slices and no UltraRAM blocks are available. Therefore DPUCZDX8G IP with B1152 architecture can be fitted.

Clock Wizard for DPU. A DSP Double Data Rate (DDR) technique may be used to used to improve the performance achieved with the DPU. In this configuration, two input clocks for the DPUCZDX8G are needed: A 1x clock for general logic and a 2x clock for DSP slices are employed. dpu_2x clock gating is an option for reducing the power consumption of the DPUCZDX8G. ZU+ 2CG device uses three clocks for DPUCZDX8G: 100 MHz for AXI Slave port, 325 MHz for AXI master ports and 650 MHz gated clock for DSP.

Address space. The minimum space needed for the DPUCZDX8G is 16 MB. The DPUCZDX8G slave interface can be assigned to any starting address accessible by the host CPU.

Conclusion. The proposed Vivado hardware design, tailored for the Vitis AI Vivado flow, integrates the DPUCZDX8G IP core with architecture B1152, specifically optimized for Zynq UltraScale+ MPSoC (ZU+ 2CG) devices. This design enables efficient acceleration of AI workloads on the programmable logic, leveraging the high-performance DPU core for deep learning inference. By utilizing the Vitis AI stack, the hardware accelerates AI model deployment, offering seamless integration with the software development flow while achieving optimal performance on embedded platforms.