AI Hardware Built from a Software-first Perspective: Groq’s Flexible Silicon Architecture

Groq, a semiconductor startup with software roots, has developed a new processing unit with a unique architecture that offers inference solutions for AI acceleration.

News December 03, 2019 by Majeed Ahmad

Semiconductor industry startups are usually founded by hardware engineers who develop a silicon architecture and then figure out how to map software for that specific hardware.

Here is a tale of a chip startup founded in the age of artificial intelligence (AI) that has a software DNA.

Approaching AI Hardware from the Software Angle

Groq was founded in 2016 by a group of software engineers who wanted to solve AI problems from the software side. When they approached the issue without any preconceptions of what an AI architecture may need to look like, they were able to create an architecture that can be mapped to different AI models.

The company is focused on the inference market for data centers and autonomous vehicles, and its first product is a PCIe plug-in card for which Groq designed the ASIC and AI accelerator and developed the software stack.

Part of this hardware is what they've called a TSP or tensor streaming processor. Last month, Groq announced that their TSP architecture is capable of a quadrillion (1,000,000,000,000,000) operations per second.

Groq’s Tensor Streaming Processor (TSP) shown on the PCIe board currently offered by the Mountain View. Image used courtesy of Groq.

The TSP architecture capitalizes on increased compute efficiency and allows both greater flexibility than current GPUs and CPUs as well as smaller silicon footprints.

A Unique Silicon Architecture for AI Semiconductor Devices

Besides its software roots, according to chief operating officer Adrian Mendes, what’s also different about Groq is its silicon architecture. The core chip design of Groq’s AI semiconductor device is very unlike the pipelined process commonly used in multi-core GPUs or FPGAs.

The way it has been developed from early on is that it starts with the compiler, so designers can see what different machine learning (ML) models look like and optimize what comes out of them. From there, they can develop a piece of hardware on a highly flexible architecture.

Groq claims that this silicon architecture has three distinct advantages:

Flexibility in AI models
Future-proofing for upcoming AI models via software-based optimization
More information on compiling demands

With a highly flexible AI architecture, designers don’t have to map it to neural networks like ResNet-50 or long short-term memory (LSTM). Instead, they can employ this architecture that is generic enough and has got extendibility to accommodate new models created by the research community. Subsequently, the PetaOp-capable architecture can be optimized for those models without having to make any change in the hardware.

A representation of Groq's software-defined hardware concept for improved compute efficiency. Image from Groq's whitepaper "Tensor Streaming Architecture Delivers Unmatched Performance for ComputeIntensive Workloads"

In other words, it’s a piece of hardware that can accommodate AI models that we haven’t even seen yet, and the optimization can be done in the software stack. Besides flexibility, the AI chip offers high inference throughput and very low latency for different AI models.

The third important feature is that the chip is deterministic down to the cycle count. As Mendes explained, what that means is that when a machine learning engineer takes a program and pushes it through a compiler, he or she will immediately know how long that program is going to run.

So, engineers can understand what their power consumption will be, whether they want to optimize for latency or throughput, and how to change design for each of these different parameters. And they can do this during the period it takes to compile (which is not very long).

Now compare this to an approach where engineers have to run hardware a thousand times and see what latency is. And that sums up the benefit of chip determinism.

A Google TPU Lineage

If the term "tensor" sounds familiar in the AI hardware context, it may be because Google introduced the tensor processing unit (TPU) as a concept in 2016. This ASIC (application-specific integrated circuit) is designed for AI, allowing resource-hungry AI processing to be done on the cloud.

Google's TPUs have set several milestones for AI acceleration. In 2018, for example, Google showed off their third-generation TPUs by having an AI program call real-world restaurants and hair salons to make appointments on behalf of a user—without the person on the other end of the line ever being able to tell they were speaking to a machine. This project was dubbed Google Duplex.

Grog has benefited from Google's work in a rather direct way as its co-founder and CEO, Jonathan Ross, helped develop Google's TPUs on their architecture team.

Jonathan Ross, CEO and co-founder of Groq. Image used courtesy of Groq

Ross took part in Google's research initiatives in developing TPUs, including a 2017 study on the use of TPUs in datacenter applications that focused on the development of architectures for convolutional neural networks (CNNs).

This background has proved foundational for Ross's work with Groq as the company has made strides in silicon architectures for AI acceleration hardware.

What is your experience with AI acceleration? Have you utilized Google's TPUs on the cloud? Share your thoughts on this developing technology in the comments below.