TinyML Strikes Out to Improve Memory Performance and Ease FPGA Design

January 31, 2022 by Jake Hertz

As the world of machine learning (ML) grows and finds its use in more applications, researchers and companies are finding ways to leverage it to improve hardware performance and design.

It's no secret that ML is continually advancing in terms of technology, scope, and applications. Adding to ML's spread, a subsect of ML, called TinyML, has seen a particular spike in interest over the past couple of years.


High-level overview of TinyML.

A high-level overview of TinyML. Image used courtesy of the TinyML Foundation


The past couple of months alone has seen significant advancement in academia and industry. 

This article will cover some of the interesting recent headlines and research in TinyML to understand the state and direction of the field today and how it is starting to gain momentum.


Edge Impulse Raises Funds for Future ML Efforts

The first in a series of TinyML news announcements came back in December when Edge Impulse announced a new round of funding worth $34M.

Edge Impulse can best be described as a platform for developing, analyzing, and deploying ML on embedded/IoT (Internet of Things) devices. The company leverages the TensorFlow ecosystem and was originally designed for engineers without ML expertise, hoping to democratize ML and bring it to the edge. 

Edge Impulse offers engineers an end-to-end online platform to collect data, train models, test machine learning performance, and communicate with devices in the field. Since their inception, over 30k users have deployed 50k custom TinyML projects using Edge Impulse.

By securing another $34M in funding, Edge Impulse hopes to prove that TinyML is undoubtedly a growing field with a high potential to be extremely lucrative. According to the company, it will use its new funding to expand its ecosystem, form new partnerships, and further build out its platform.

As companies attempt to push and make ML more accessible for engineers, researchers are finding new ways to bridge the gap between ML and hardware. 


MIT Improves Memory in Constrained Devices Using TinyML

Shifting over to academia, researchers at MIT made headlines earlier in December with a new technique for improving TinyML performance.

A major challenge in developing TinyML applications is how to deploy these extremely memory-intensive ML applications to memory-constrained microcontroller units (MCUs) and embedded devices. 

To address this challenge, MIT researchers designed a new inference technique and a new convolutional neural network (CNN) architecture, the results of which reduced peak memory usage by up to 8x while improving accuracy on computer-vision detection applications. 


Using a per-patch computation technique, the researchers significantly reduced peak memory usage.

Using a per-patch computation technique, the researchers significantly reduced peak memory usage. Image used courtesy of Lin et al


As described in their paper, researchers created a new technique due to an observed imbalance in memory utilization for CNNs during inference. This inference is where memory usage was significantly higher in the first handful of convolutional layers than in the entire network. 

One way to counteract this imbalance is the new technique and architecture that uses a generic patch-by-patch inference scheduling that operates only a small spatial region of a feature map to cut down on peak memory. 

By patching their inference to about 25% of the layer's feature map at a time, the researchers achieved significant memory performance improvements.

When incorporated into MCUNetV2, the new technique set a record for accuracy at ~72% while working with a 1k-class image classification on the ImageNet dataset while only requiring 465 kB of memory. 

Altogether, the researchers hope that this innovation will improve performance on even more constrained devices in the future.


CFU Playground: TinyML for FPGAs

The final TinyML headline to be covered came in early January from a joint research paper including researchers from Google, Harvard, and Purdue University.

The paper introduced a full-stack open-source framework for developing FPGA-based (field-programmable gate array) TinyML accelerators for embedded systems. 

The new framework, called CFU Playground, integrates open-source software, register transfer level (RTL) generators, and FPGA tools for synthesis, place, and route. 

The goal of CFU PLayground is to remove infrastructure details via abstraction. That way, the user can focus on essential design aspects, like processor instructions, exploit these instructions in execution, and measure the results.


CFU Playground vs competing frameworks.

CFU Playground vs competing frameworks. Image used courtesy of Prakash et al


Overall, the hardware for CFU Playground is based around the Xilinx Artix 7 35T FPGA, a device that consists of 33,000 logic cells, 90 DSP slices, and 50 x 36 Kbit block RAMS. 

The user can take an existing TFliteMicro (TensorFlow Lite for Microcontrollers) model, perform the inference on the Artix 7, get cycle counts per layer, and use this information to modify code, instructions, and operations to improve on the model's performance quickly.


TinyML Picks Up Speed

Between academia and industry, there is clearly no shortage of innovation and investment in TinyML. 

As 2022 kicks off, we can expect to see more innovation as the year goes on. It will be interesting to see where the intersection of ML or TinyML meets the hardware and device design.