Views: 123 Author: Site Editor Publish Time: 2022-07-12 Origin: Site
Each of these FPGAs focuses on a different aspect of improving performance: the Xilinx VU57P attempts to bypass memory bandwidth challenges in demanding applications. Intel Stratix 10 NX FPGAs integrate AI-optimized DSP blocks to help implement large AI models with low latency. Also, Lattice Nexus FPGAs attempt to redefine low power, small form factor FPGAs.
Over the past decade, computing bandwidth has grown exponentially in many application areas. For example, the number of DSP slices offered by Xilinx FPGAs for machine learning applications has increased from about 2,000 slices for the largest Virtex 6 FPGAs to about 12,000 slices for modern Virtex UltraScale+ devices . Similar trends are observed in other application areas such as network technology and video applications, as shown below.
Requirements for memory bandwidth
The graph above shows that the memory bandwidth of DDR technology has only increased slightly over the past decade - roughly a 2x increase from DDR3 to DDR4. (It's worth noting that the leap from DDR4 to DDR5 may have been more impactful .)
The bandwidth gap in the graph means that the limited data transfer rate between the FPGA and memory is a bottleneck in these applications. To solve this problem, designers often use multiple DDR chips in parallel to increase memory bandwidth (not necessarily memory capacity). However, due to high power consumption, form factor and cost issues, and PCB design challenges, this approach becomes unusable at memory bandwidths greater than about 85GB/s.
Alternatively, an effective solution to the memory bandwidth problem is a type of DRAM-based memory called High Bandwidth Memory (HBM for short). In this case, DRAM memory and FPGA can be implemented simultaneously in the same package using silicon stacking technology, as shown in the figure below.
Silicon stacking facilitates parallel implementation of DRAM memory and FPGA
HBM technology allows us to eliminate the relatively long PCB traces connecting the DDR chip to the FPGA. Using an integrated HBM interface with a large number of pins can significantly increase memory bandwidth with latency similar to DDR-based technologies.
Xilinx recently released the VU57P FPGA (from the Virtex UltraScale+ family), which integrates 16 G HBM and up to 460 GB/s of memory bandwidth. The device features an integrated AXI port switch that allows us to access any HBM memory location from any memory port.
In addition to the energy-efficient computing features and large memory bandwidth discussed above, the VU57P offers high-speed interfaces such as 100G Ethernet with RS-FEC, 150G Interlaken, and PCIe Gen4. The new device's 58G PAM4 transceiver supports connectivity to the latest optical standards. This is useful in different applications such as next-generation firewalls and switches and routers with QoS.
Many routine applications of digital signal processing (DSP) require high-precision arithmetic. This is why FPGAs typically have DSP blocks with high precision multipliers and adders. For example, the XC7A50T (Xilinx) and 5CGXC4 (Intel) have 120 and 140 18×18 multipliers, respectively.
It turns out that many deep learning applications can be implemented using fewer bits without significantly sacrificing accuracy. Lower precision approximations reduce the amount of computational resources and the required memory bandwidth.
Another advantage of reducing the bit width is the power savings due to lower precision computations and fewer bits that need to be transferred per memory transaction. In fact, according to the UC Davis researchers, INT8 or even lower precision calculations can yield acceptable results in many deep learning applications.
The Intel Stratix 10 NX FPGA is the first AI-optimized FPGA from Intel. The devices integrate arithmetic blocks called AI Tensor Blocks, which contain dense arrays of low-precision multipliers. The base precision of these blocks is INT8 and INT4, although they support FP16 and FP12 numeric formats via shared exponent support hardware.
The AI Tensor block (used in the Stratix 10 NX FPGA) can increase INT8 throughput by a factor of 15 compared to the DSP block of a standard Intel Stratix 10 FPGA. The high-level block diagram of the AI Tensor Block is shown below.
Block diagram of AI Tensor Block
The most notable feature of the Intel Stratix 10 NX FPGA is the high computational density provided by AI-optimized compute blocks. However, the new device also integrates two other features that further help designers implement its large AI models in a low-latency manner: it supports rich approximate computational memory (integrated HBM) and high-bandwidth networking (up to 57.8 G PAM4 transceivers) device).
Lattice Semiconductor recently announced its Certus-NX FPGA family , which uses 28nm fully depleted silicon-on-insulator (FD-SOI) process technology. FD-SOI was originally developed by Samsung and is somewhat similar to conventional CMOS processes. However, as shown in the figure below, it provides programmable bias for most transistors.
Lattice Semiconductor recently announced its Certus-NX FPGA family, which utilizes 28nm fully depleted silicon-on-insulator (FD-SOI) process technology. Originally developed by Samsung, FD-SOI is somewhat similar to conventional CMOS processes; however, it provides programmable biasing for most transistors, as conceptually explained below.
Programmable buck voltage greatly reduces chip area and power consumption. Compared to other FPGAs with similar logic cell counts, Certus-NX consumes up to four times less power.
It's important to note that the new device supports AES for bulk encryption and Elliptic Curve (ECDSA) for authentication. Therefore, it can provide higher security for connected devices. In addition, it has a high resistance to soft errors, which makes the device suitable for aerospace applications.
By examining these newly released FPGAs from Xilinx, Intel and Lattice Semiconductors, we can get a clearer picture of how FPGAs are evolving - focusing on higher memory bandwidth, AI optimization, low power consumption and small size.