Meta’s Supercomputer Zeros in on Training AI Models for Areas Like Computer Vision

Packing thousands of NVIDIA GPUS, Meta researchers unveil AI Research SuperCluster (RSC) supercomputer that could train large AI models needed for new AI-based applications.

News February 12, 2022 by Darshil Patel

Meta, formerly known as Facebook, announced last year that it would focus on the "Metaverse," a shared virtual environment.

As a part of the Meta Research program, engineers are innovating hardware and software that are immersive, social, and increase the depth of people's connections.

Some of the research areas that Meta is investing in are:

AR (augmented reality)/VR (virtual reality)
Artificial Intelligence (AI)
Blockchain and cryptocurrency
Computer vision
Machine learning

These advanced technologies often require powerful computers capable of performing quadrillions of operations per second.

To help ease the computing requirements for its research, Meta recently announced that they have designed and built an AI Research SuperCluster (RSC).

Meta's RSC

Meta's RSC.

Using RSC, the researchers at Meta could train large models needed to develop AI for technologies like Natural Language Processing, Computer Vision, and Speech Recognition.

This article will look into the need for AI and supercomputers and then dive into Meta's RSC supercomputer.

Supercomputers Further AI Applications

The extensive use of AI and AI-based applications has significantly increased the demand for supercomputers.

AI models are increasing in complexity as they solve next-generation technology challenges. Training them also requires massive computational power and scalability, especially since learning is the real power of AI, which is only as reliable as the training they have been given.

Overall, supercomputers can increase the system's speed that trains AI models. Due to increased speed and capacity, AI models can be trained faster, with larger, more detailed, and focused sets.

Applications like computer vision require a system that can process a lot of media with high data sampling rates. Other applications like natural language processing (NLP) require understanding various languages, dialects, and accents. Supercomputers can help accomplish tasks like these in the real world.

Not only would a supercomputer help Meta with its future ventures with AR/VR and AI in general, but it could also help Meta engineers develop various models. For example, they could create models that can identify harmful content on social media websites and pave the road for embodied AI and multimodal AI to help improve user experience.

With that in mind, let's take a look at Meta's RSC.

What is the AI Research SuperCluster?

The RSC will help researchers build new and better AI models capable of learning trillions of examples, whether images, texts, or any other media. It claims to be among the world's fastest AI supercomputers.

Generally, supercomputers are built by integrating multiple graphics processing units (GPUs) into compute nodes, which are then connected by high-performance and high-speed data lines that allow fast communication between nodes.

Phase 1 of Meta's RSC.

The RSC consists of 760 NVIDIA DGX A100 as compute nodes, for a total of 6080 GPUs.

The NVIDIA DGX A100 is a high-performance system that states to be suitable for all kinds of AI workloads. It embeds one of the most advanced accelerators and the NVIDIA A100 tensor core GPU, which allows the hardware to provide three times higher throughput for AI training and 83% higher throughput than CPU.

NVIDIA's A100 tensor core GPU.

NVIDIA's A100 tensor core GPU. Image used courtesy of NVIDIA

Additionally, this GPU uses the NVIDIA Ampere architecture to provide twenty times the high performance over its prior generation.

Each DGX compute node communicates via NVIDIA 1600 Gb/s InfiniBand fabric with no oversubscription (a situation that occurs when a shared hosting offers a series of computing resources that exceed the available capacity).

Moreover, when the RSC is completed, it will have over 16000 GPUs as endpoints.

For any data-center solution, there are types of data storage systems that allow accelerated computing: one optimized to store data and the other optimized to deliver it.

Flash storage solutions that implement this configuration are faster than traditional storage. The RSC's storage has 175 petabytes of Pure Storage Flash Array, 46 petabytes of cache storage, and 10 petabytes of Pure Storage Flash Blade.

The Meta researchers also included safety and privacy to train AI models using encrypted data, where the data path from the storage to GPUs is end-to-end encrypted and includes tools and systems for verification.

Moreover, as the data is decrypted only at one endpoint, information is preserved safely even in the event of a physical breach of the facility.

Future Directions for RSC

The RSC is running today and is still under development. Phase two of the project will include increasing the number of GPUs to 16000 and InfiniBand fabric to 16000 ports.

Moreover, researchers plan to increase the delivery bandwidth of 16 TB/s and exchange scale capacity in storage.

All in all, the Meta researchers state that phase two of the project will create more accurate AI models and improve user experiences. With this supercomputer, they hope to develop next-generation AI infrastructure and design foundational technologies to advance a broad AI community.

All images used courtesy of Meta