A Circuit-level Assessment of Dojo, Tesla’s Proposed Supercomputer
The star of this year's Tesla AI Day was a newly-announced supercomputer, Dojo. But just how remarkable is this project from a design perspective?
One of the highlights of this year's Tesla AI Day was the announcement of the company's in-house AI framework, a supercomputer called Dojo.
Dojo is based on a custom computing chip, the D1 chip, which is the building block of a large multi-chip module (MCM)-based compute plane. These MCMs will be tiled to create the final supercomputer used for training autonomous driving AI networks.
While a full assessment of such a huge multi-disciplinary project is beyond the scope of a single news piece, here are a few highlights of this project from a circuit design perspective—and specifically at the MCM level.
The Smallest Entity of Scale of Dojo
The smallest entity of scale used in Tesla’s proposed supercomputer is called a training node. The block diagram of a training node is shown below.
A training node is the smallest compute element of the Dojo.
The training node is a 64-bit CPU fully optimized for machine learning workloads. It has optimized matrix multiply units and SIMD (single-instruction, multiple-data) instructions and incorporates 1.25 MB of fast ECC-protected SRAM.
Although this is the smallest compute element used in Dojo, it is capable of more than 1 teraflop of compute. The physical size of the training node is chosen based on the furthest distance that a signal can travel in one cycle of the desired clock frequency—about 2 GHz in Tesla’s design.
The training node has a modular design. A larger compute plane can be created by employing an array of these training nodes.
The D1 Chip
The D1 chip is created by an array of 354 training nodes. This enables 362 teraflops of machine learning compute.
The D1 chip consists of 354 training nodes.
The bandwidth for the communications between the training nodes (or the on-chip bandwidth of the D1) is 10 TBps. The chip incorporates 576 high-speed, low-power SerDes units to support an IO bandwidth of 4 TBps/edge. The IO bandwidth is one of the most important features of the D1 chip.
According to Tesla, the IO bandwidth of the D1 is about twice that of state-of-the-art networking switch chips. The following graph compares the IO bandwidth versus teraflops of compute for the new chip and an unspecified comparable solution.
The IO bandwidth versus teraflops of compute for some high-performance ML solutions.
As far as the basics go, the D1 is manufactured in 7 nm technology and occupies an area of 645 mm2. The thermal design power (TDP) of the chip is 400 W.
The D1 chip offers interesting features, such as high IO bandwidth, and no doubt a lot of effort is put into creating it. However, up to this point, one of the real challenges of this project will be connecting a large number of the D1 chips together to create a supercomputer with optimized bandwidth and minimum latency.
With a normal IC design flow, the D1 dies would be singulated and packaged after they were tested at the wafer level. Then, these packaged chips would be soldered to the PCB to create a larger system. In this case, however, the communication between the chips will occur through the chip's IOs and PCB traces. This is where the chip will encounter lower bandwidth and increased latency.
Chip-to-Chip Communication: a Serious Challenge
Packages connect a die to the rest of the system; however, they do so in a very inefficient way. The on-chip interconnect pitches are about a few micrometers while BGA pitches are at 400–600μm. Board trace pitches are also typically in the 50–200μm regime. These large off-chip pitches limit the number of IOs that a package can have.
Besides, only a limited number of the chip bumps are allocated to IOs. For example, in a processor with 10,000 bumps, only 1,000 bumps might be allocated to IOs. Since the package IOs are limited, we cannot have parallel communication between two packaged dies. We’ll have to serialize, transmit, and then deserialize the signals by means of SerDes units. In a typical processor chip, SerDes circuits generally take a significant area (about 25 percent of the die area) and burn considerable power (about 30 percent of the total power).
Communication between a processor and an off-chip memory faces similar challenges as well. Moreover, IO circuitry adds to the signal path delay and increases the system latency. As you can see, packages adversely affect the design in several different ways. Therefore, if we could connect the dies to each other without packaging them, we could achieve a more parallel communication (i.e., a higher bandwidth) while reducing latency, area, and power consumption.
Multi-chip Module (MCM) Assembly
One method to combat the IO issues is a multi-chip module technique where multiple dies and/or other discrete components are integrated onto a unifying substrate. Applying this technique, we can implement high-performance processors with maximized die-to-die communication speeds.
Tesla’s Dojo is designed based on this idea; it should be noted, however, that this is not Tesla’s innovation. For instance, NVIDIA has implemented a scalable MCM-based deep neural network accelerator to maximize die-to-die communication speed.
NVIDIA’s MCM-based accelerator. Image used courtesy of R. Venkatesan
Dojo’s Training Tiles: Perhaps the Biggest Organic MCM in the Chip Industry
A training tile is a unit of scale for the Dojo supercomputer. It is an MCM consisting of 25 D1 chips. These D1 chips are tightly integrated using a fan-out wafer process such that the bandwidth between the dies is preserved.
A training tile consists of 25 D1 chips.
What is so special about this MCM? According to Tesla, this is perhaps the biggest organic MCM in the chip industry.
To have a sense of how large this MCM is, consider a typical MCM-based solution such as the NVIDIA processor mentioned above. The NVIDIA MCM occupies an area of about 2256 mm2; in contrast, the Dojo training tile is larger than 25 ✕ 645 mm2 ( about 16125 mm2). The Dojo training tiles are at least seven times larger than the NVIDIA processor.
The Challenges of a Large MCM
Such a large MCM can have thermal and power delivery concerns. As mentioned, the thermal design power (TDP) of the D1 chip is 400 W. With 25 D1 chips packed tightly, only the processors can burn as much as 10 kW. This doesn’t take into account the power dissipated by the voltage regulator modules, which can be significant.
In a large MCM, the design should be able to safely dissipate such large amounts of power in a relatively small space. Because of the thermal and power delivery concerns, Tesla engineers had to find a new way of applying power to the D1 chips.
Another challenge with such a large MCM is the yield concerns. With larger designs, the yield can be lower. The D1 dies are "known-good" chips. This means that they are fully tested before being placed in the MCM. Hence, the interconnect fabric of the wafer should be the main yield concern here.
Further, CAD tools don't support the design of such a large MCM. Even Tesla's computer cluster couldn’t handle it. The engineers had to find new ways to address this issue.
Specialized Solutions: A High-bandwidth Connector and Power Supply
In order to preserve the high bandwidth between the tiles, Tesla created a high-density, high-bandwidth connector that surrounds the training tile as shown below.
A Dojo training tile provides 36 TB/s off-tile bandwidth.
The training tile offers 9 PFLOPS of compute and 36 TB/s off-tile bandwidth. To feed the MCM with power, Tesla engineers built custom voltage regulator modules that could be directly reflowed onto the fan-out wafer.
Custom voltage regulator modules directly reflowed onto the fan-out wafer.
This new method of feeding the chips should reduce the number of wafer metal layers required for power distribution, leading to a more cost-effective and compact design. In the next step, the engineers integrated mechanical and thermal pieces to arrive at a so-called fully integrated solution.
The training tile is described as a fully integrated solution.
With the cooling and power supply orthogonal to the compute plane, the engineers created even larger compute planes without losing bandwidth.
The Dojo May Be Operational in 2022
Tesla has yet to put this entire system together. Thus far, only the training tile—the main building blocks of the Dojo supercomputer—have been implemented. 120 of these training tiles will be arrayed to implement a supercomputer capable of 1.1 EFLOPs.
However, Musk believes that the Dojo supercomputer will be fully operational next year.
What do you think about this project? Do you think the Dojo supercomputer can defeat the existing solutions in terms of bandwidth and latency?
For a more detailed discussion on the challenges cited above, please refer to the following papers.
S. Pal, D. Petrisko, M. Tomei, P. Gupta, S. S. Iyer and R. Kumar, "Architecting Waferscale Processors - A GPU Case Study," 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2019, pp. 250-263, doi: 10.1109/HPCA.2019.00042.
S. -R. Chun et al., "InFO_SoW (System-on-Wafer) for High Performance Computing," 2020 IEEE 70th Electronic Components and Technology Conference (ECTC), 2020, pp. 1-6, doi: 10.1109/ECTC32862.2020.00013.
S. S. Iyer, "Heterogeneous Integration for Performance and Scaling," in IEEE Transactions on Components, Packaging and Manufacturing Technology, vol. 6, no. 7, pp. 973-982, July 2016, doi: 10.1109/TCPMT.2015.2511626.
Screenshots used courtesy of Tesla