Introducing the Alveo U55C: Xilinx’s Answer To Modern HPC Data Center Demands
Distributed high-performance compute (HPC) is the cornerstone technology for climate forecasting, signal processing, and more applications. To address this, Xilinx brings forth their Alveo U55C HPC Accelerator.
In advance of SC21, the international conference for HPC, our Team sat down with Xilinx Data Center Group (DCG) members, including Nathan Chang, HPC Product Manager Xilinx DCG.
Our goal was to understand the Xilinx DCG team's moves in the HPC cluster space and get a better view of the exciting capabilities and applications of their new Alveo U55C HPC cluster solution.
The Alveo U55C accelerator card from Xilinx
Significant advancements are on offer as Xilinx moves from the Alveo U280 data center accelerator card to the Alveo U55C.
As Nathan Chang explains the core value of the U55C clustering solution, "HPC problems, they are not single card or single server problems."
In today's article, we look at the hardware changes between these two generations of cards. Then, looking beyond the hardware itself, HPC applications in radio astronomy and finite element method (FEM) are waiting in the wings.
The Smaller U55C Optimizes Out DDR4
The two most significant changes to the hardware between the U280 and the U55C include modifications to the dimensions and thermal envelope.
Overall, the form factor has been reduced from a dual-slot full-height ¾ length to a single-slot full-height half-length. The thermal system consumes 150 W TDP (down from 225 W) and is now modeled as a purely passive system rather than active/passive.
The U55C accelerator card increases the HBM2 (high bandwidth memory) capacity to 16 GB (from 8 GB) and strips away the external DDR4 interface in lieu of that second HBM2 chiplet on the FPGA itself.
Advances in Alveo hardware design specifications from U280 to U55C. Image [modified] used courtesy of Xilinx
The PCIe interface has also seen an upgrade and now includes two Gen4 x8 lanes in addition to the Gen3 x16 configuration.
Changing Hardware Advances HPC Scalability
The immediate effect of these changes is said to allow for more parallelism of data pipelines, superior memory management, optimized data movement, and the 'best' performance-per-watt.
Chang explains, "[you] can actually manipulate the data in transit [with an FPGA] so that you don't have to read and write [data] as often."
From an infrastructure point of view, the move to a single-slot chassis allows for an immediate increase in rack density (pending the ability to dissipate the thermal effects), potentially doubling the scale of any given U280 solutions' distributed compute capacity.
What kind of HPC scale is the U55C capable of achieving? Long-term, it could help process more data than global internet traffic today (more than 300 petabytes of data per year).
Real-Time Data Processing For CSIRO’s SKA
As part of a larger system (with a second large dish array in South Africa), the SKA-Low comprises 131,072 'Christmas-tree' shaped antennas that operate at frequencies between 50 MHz and 350 MHz.
An artist's vision of the future of SKA-Mid (Africa) and SKA-Low (Australia). Image used courtesy of CSIRO
The distributed signal processing capability of the U55C cluster for SKA-Low includes 21 nodes and 420 Alveo U55C cards handling more than 15 Tb/s, using only 50% of the FPGA fabric and HBM capacity.
Most incredibly, due to the remote nature of the site, the entire system is solar-powered, with each card consuming only 90 W.
FEM, Real-Time Graphs & Cloud FPGAaaS
Xilinx shows that the U55C can perform the finite element method with 700k elements or provide real-time insights via graphing big data. All of this can be seen with its demonstrating the versatility of the Vitis Core development kit, along with its APIs, high-level synthesis capabilities, and the integration of external frameworks.
LS-DYNA is a FEM program designed to simulate real-world performance, especially in crash testing dynamics. ANSYS, who owns LS-DYNA, spoke to Xilinx because they were looking for "a 2 to 3x improvement" in performance over CPUs.
Chang explains that "we got 5x on our first try, and that created a lot of interest."
He explains that they achieved this metric by pipelining the data and optimizing the queries to a sparse matrix, which resulted in getting more relevant results per clock cycle.
Further diversifying, Xilinx showed the U55C's applicability to the big data industry. Xilinx has partnered up with TigerGraph to accelerate disparate databases and transform them into Graphs to help Data Scientists find meaning in data. Bringing the focus to relationships between datasets is said to be key in optimizing recommendation engines.
"Nobody wants to wait on their recommendation," says Chang, "Facebook and Amazon don't want you to wait either."
Chang notes that attention spans are short and very valuable. To that end, Xilinx took the two most prolific algorithms that drive recommendation engines, "and with clustering, we accelerated them on the U55C".
The U55C is available now for engineers looking to get started. At some point in the near future, Xilinx will have FPGAs as a service with Xilinx store access and fixed managed server configurations. Today, they offer co-location on-site services for partner and customer evaluations.
All images courtesy of Xilinx, unless otherwise noted.