TinyML Summit: Enhancing NPUs With Digital In-memory Computing

May 01, 2023 by Jake Hertz

At the TinyML Summit 2023, STMicroelectronics made a presentation on the benefits of in-memory computing, particularly for neural processing units.

TinyML takes large machine learning (ML) models and makes them capable of running on small, resource-constrained microcontrollers. With the goal of running at low power for long periods of time, TinyML devices require extremely efficient hardware and optimized software. Each year the TinyML Foundation hosts the TinyML Summit, a gathering of leaders in the industry to discuss the current state and future of the industry.

Recently, the TinyML Summit 2023 made its Summit 2023 sessions available for public viewing. One of these sessions came from STMicroelectronics, where Danilo Pau, Technical Director at ST, offered a talk titled "Enhancing neural processing units (NPUs) with digital in-memory computing."


ST's dataflow NPU template

ST's dataflow NPU template. Screenshot courtesy of STMicroelectronics


In this article, we’ll talk about the merits of the in-memory computing architecture and the key benefits of this computing style for NPUs, according to ST.


The TinyML Tradeoff: Complex Computation vs. Power

Some of the most well-known ML models run in massive data centers with nearly infinite computing and memory resources. Taking that same ML functionality but bringing it to a small, battery-powered edge device presents a significant design challenge.

The tradeoff that exists in TinyML hardware is achieving the computational power necessary for ML models while simultaneously keeping power consumption to a minimum. As Pau explains in his talk, “On the one hand, we would like to achieve very high computational power, but on the other side we would like to consume nearly zero power and manufacture the chip as cheaply as possible.”


The energy cost of data movement

The energy cost of data movement. Image courtesy of Shi et al.


ML computation burns massive amounts of energy in conventional computing architectures because of the loads of data these operations involve. With potentially billions of weights and parameters involved in an ML model, computing these algorithms requires immense movement of data in and out of memory to the processing core. 

This data movement is a fundamental limitation of classical von Neumann architectures and results in inefficiencies for ML computation.


In-memory Computing Means Less Data Movement—And Power

To sidestep the fundamental limitations of the von Neumann architecture, STMicroelectronics is turning to in-memory computing as a solution. In-memory computing changes the paradigm of computing from one where data and computing are spatially separated to one where they occur in the same place. 

Pau explains, “The transfer of data in and out memory to any hardware accelerator is a very careful point of design because, while the hardware accelerator can offer very high computational power, it can ultimately be limited by the efficiency of the transaction with the memory. We need to beat the memory wall.” 

In this way, computing in memory sidesteps power loss associated with data movement and makes for more efficient TinyML hardware.


In-memory compute removes the von Neumann bottleneck

In-memory compute removes the von Neumann bottleneck. Image courtesy of Coluccio et al.


Beyond in-memory computing, STMicroelectronics is also pursuing the idea of lowering computational precision to increase energy efficiency. This is the idea behind quantization in TinyML: by lowering data precision, say from 32-bit FP to 8-bit INT, ML models can operate on significantly less data overall. This significantly lowers power consumption without a huge impact on model accuracy. 

On this, Pau said, “Imagine that we could design binary neural networks, within 1-bit weights and 1-bit activation. We’d unlock systems that are very efficient, low power, low complex, and have high parallelism—that would be fantastic.”


ST Presents Low-power NPU for TinyML

Working off these ideas, STMicroelectronics is developing a new, experimental low-power NPU for TinyML.

The chip itself is built around a 600 MHz Arm Cortex-M4 processing core and features eight integrated digital in-memory computing (DIMC) SRAM tiles. The system features 4 CNN accelerators; tensor-wise direct-memory access (DMAs), which facilitate memory transfer; and 51 kB of shared SRAM.


Block diagram of ST’s in-memory compute NPU

Block diagram of ST’s in-memory compute NPU. Screenshot courtesy of STMicroelectronics


Built on a 40nm node, DIMC tiles can perform binary in-memory computation, yielding a dramatic increase in the computational efficiency of binary layers. ST validated the system by running a real-time facial detection algorithm, where they found the test chip to run with a latency of 3 ms. Importantly, the DIMC subsystem achieved a peak efficiency of 100 TOPS/W for binary computations. Overall, the system proved to achieve a TOPS/W that is 40x higher than traditional NPU implementations. 

With this, ST has proven the major benefits that come from both in-memory compute as well as precision reduction. By improving NPU power efficiency by 40x, ST’s work could have significant implications for the future of TinyML hardware.