News

FPGA Roundup: New Contenders Hone in on Memory, Size, Power, and Even AI

July 02, 2020 by Dr. Steve Arar

Three newly-released FPGAs can tell us a lot about the direction of these devices in the industry.

The past month has seen a boom in the FPGA market. In this article, we'll briefly examine three recently-released FPGAs from Xilinx, Intel, and Lattice Semiconductors.

Each of these devices concentrates on improving a different aspect of performance: the Xilinx VU57P tries to circumvent the memory bandwidth challenge in demanding applications. The Intel Stratix 10 NX FPGA incorporates AI-optimized DSP blocks to help implement large AI models at a low latency. And, the Lattice Nexus FPGAs try to redefine low-power, small form factor FPGAs.

What can each of these devices tell us about the direction of FPGAs?

 

The Xilinx VU57P FPGA—High-Bandwidth Memory

Over the last decade, the computational bandwidth of many application areas has increased exponentially. For example, the number of DSP slices that a Xilinx FPGA provides for a machine learning application has increased from about 2,000 slices in the largest Virtex 6 FPGA to about 12,000 slices in a modern Virtex UltraScale+ device. A similar trend is observed in other application areas such as networking technologies and video applications as shown below.

 

The requirements of memory bandwidth

The requirements of memory bandwidth. Image used courtesy of Xilinx

 

The above figure shows that the memory bandwidth of DDR technology has increased only slightly over the last decade—by a factor of about 2 from DDR3 to DDR4. (It's worth noting that the leap from DDR4 to DDR5 may be more impactful.)

The bandwidth gap depicted in the figure means that the limited data transfer rate between the FPGA and memory is a bottleneck in these applications. To address this issue, designers usually employ several DDR chips in parallel to increase the memory bandwidth—not necessarily the memory capacity. However, this approach becomes prohibitive at a memory bandwidth above about 85 GB/s because of large power consumption, form factor, and cost issues as well as PCB design challenges. 

Alternatively, an efficient solution to the memory bandwidth problem is a DRAM-based memory type called high bandwidth memory (HBM for short). In this case, silicon stacking technologies are utilized to implement both the DRAM memory and the FPGA beside each other in the same package as depicted below.

 

Silicon stacking helps implement DRAM memory and the FPGA side-by-side

Silicon stacking helps implement DRAM memory and the FPGA side-by-side. Image used courtesy of Xilinx

 

The HBM technology allows us to eliminate the relatively long PCB traces that connect a DDR chip to the FPGA. Employing an integrated HBM interface with a large number of pins leads to a drastically improved memory bandwidth with a latency similar to that of the DDR-based technique.

Xilinx has recently released the VU57P FPGA (from Virtex UltraScale+ series) that incorporates a 16 G HBM with a memory bandwidth as high as 460 GB/s. The device employs an integrated AXI port switch that lets us access any HBM memory location from any memory port.

In addition to power-efficient compute capabilities and the large memory bandwidth discussed above, the VU57P provides high-speed interfaces such as 100G Ethernet with RS-FEC, 150G Interlaken, and PCIe Gen4. The 58G PAM4 transceiver of the new device supports connectivity to the latest optical standards. This can be helpful in different applications such as next-generation firewalls and switches and routers with QoS. 

 

Intel Stratix 10 NX FPGA—AI-Optimized DSP Blocks

Many conventional applications of digital signal processing (DSP) need high-precision arithmetic. That’s why FPGAs commonly have DSP blocks with high-precision multipliers and adders. For example, the XC7A50T (Xilinx) and the 5CGXC4 (Intel) respectively have 120 and 140 of 18 x 18 multipliers.

It turns out that a lower number of bits can be used to implement many deep learning applications without significantly sacrificing accuracy. A lower-precision approximation reduces the amount of the computational resources as well as the required memory bandwidth.

Another advantage of lowering the bit-width is the power saving from both the lower-precision computations and the smaller number of bits that need to be transferred for each memory transaction. In fact, with many deep learning applications, INT8 or even lower precision computations can lead to acceptable results, according to UC Davis researchers.

The Intel Stratix 10 NX FPGAs are the first AI-optimized FPGAs from Intel. These devices incorporate arithmetic blocks called the AI Tensor Blocks that contain a dense array of low-precision multipliers. The base precisions for these blocks are INT8 and INT4, although they support FP16 and FP12 numerical formats through shared-exponent support hardware.

An AI Tensor Block (employed in a Stratix 10 NX FPGA) can increase the INT8 throughput by a factor of 15 as compared with the DSP block of a standard Intel Stratix 10 FPGA. The high-level block diagram of the AI Tensor Block is shown below. 

 

Block diagram of the AI Tensor Block

Block diagram of the AI Tensor Block. Image used courtesy of Intel

 

The most distinctive feature of the Intel Stratix 10 NX FPGA is its high compute density provided by the AI-optimized compute blocks. However, the new device incorporates two other features that further help designers implement it large AI models at a low latency: it supports abundant near-compute memory (integrated HBM) and high bandwidth networking (up to 57.8 G PAM4 transceivers).    

 

Lattice Nexus—Low-Power, Small Form Factor FPGAs

 Lattice Semiconductor has recently released its Certus-NX FPGA family that uses 28 nm fully depleted silicon-on-insulator (FD-SOI) process technology. The FD-SOI, which was originally developed by Samsung, is somewhat similar to the conventional CMOS process; however, it enables a programmable bias for the bulk of the transistors as conceptually illustrated below.

 

The circuit architecture of the Lattice Nexus platform

The circuit architecture of the Lattice Nexus platform. Image (modified) used courtesy of Lattice Semiconductor (PDF)

 

A programmable bulk voltage enables significant reductions in chip area and power consumption. The power consumption of the Certus-NX is reduced by up to four times compared to other FPGAs with a similar number of logic cells.

Thanks to the employed FD-SOI technology, the new device can fit packages as small as 6 mm x 6 mm and provides up to two times more I/Os per mm2 compared to similar FPGAs. The following table compares the Certus-NX-40 with similar products from Intel and Xilinx. 

 

Comparison of three popular FPGAs for PCIe designs

Comparison of three popular FPGAs for PCIe designs. Image used courtesy of Lattice Semiconductor (PDF)

 

Note that the new device supports AES for bulk encryption and elliptic curve (ECDSA) for authentication. As a result, it can offer higher security for internet-connected devices. Besides, it exhibits higher immunity to soft errors, which makes the new device suited for aerospace applications.

 

How FPGAs are Being Optimized

By examining these recently-released FPGAs from Xilinx, Intel, and Lattice Semiconductors, we can see a clearer vision of how FPGAs are developing—with concentrations on higher memory bandwidths, AI optimization, low power consumption, and small form factors. 

 


 

Do you work directly with FPGAs? How have you seen this technology evolve over the years? Share your thoughts in the comments below.