The Role of Last-Level Cache Implementation for SoC Developers

There is a challenge for SoC developers to find ways to navigate the demand of memory in their design. This article looks at how a fourth, or last-level, cache can provide a solution.

Industry Article May 12, 2020 by Kurt Shuler, Arteris

One of the weird paradoxes of our time is that microprocessors are faster than memories. It’s strange because microprocessors are complex, billion-transistor circuits while DRAMs are just… rows and rows of identical memory cells. Processors are expensive, but DRAMs are practically commodities sold by the pound.

One of our jobs as product designers is to engineer solutions around this problem, chiefly by using caches. We wrap our processors in first-, second-, and sometimes third-level caches to buffer the latency and bandwidth mismatch between the fast processor(s) and the comparatively slow DRAM(s). It’s a trick that’s worked for decades, and it still works now.

A last level cache (also known as a system cache) reduces the number of accesses to off-chip memory

Figure 1: A last-level cache (also known as a system cache) reduces the number of accesses to off-chip memory, which reduces system latency and power consumption while increasing achievable bandwidth. It is often physically located prior to the memory controllers for off-chip DRAM or flash memory.

Machine learning (ML) is making this design more challenging. ML inference requires lots and lots of data, the same way that graphics convolutions or digital signal processing (DSP) tear through big, memory-resident data structures. New SoC designs for ML don’t just need fast processors – they need fast access to memory, too. Otherwise, all that CPU muscle goes to waste.

Options to Solve the Memory Challenge

Memory manufacturers have started developing all sorts of new DRAMs to bridge the gap. We’ve got high-bandwidth memory (HBM) with 1Gbit/sec theoretical bandwidth, and even HBM2 and HBM2E, which promise somewhere around 3Gbit/sec. And there are even more workaround options like stacked 3D SRAMs using wireless inductive coupling to achieve triple the HBM2E bandwidth. But getting to that kind of speed requires tricky and expensive manufacturing tricks with multichip modules, vertically stacked silicon die, silicon interposers, through-silicon vias (TSVs), 256-bit and 1024-bit buses, and/or silicon micro-bumps.

What about graphics memories, like GDDR6? They’re cheap and plentiful. Surely, they can do the job for ML workloads, too? Not so fast. As the name suggests, the GDDR6 interface is intended for graphics cards, and it’s already in its sixth generation. The good news is, graphics chips/cards are a high-volume market. The bad news is, it’s a short-lived market. Although it’s tempting to piggyback on the mainstream PC market for components and technology, that’s usually a recipe for disappointment. When the mainstream moves on, you’re left searching for outdated components from limited suppliers. An SoC designer needs components and interfaces that will be around for the life of the design, never mind the life of the product in the field.

Adding a Fourth (Last-Level) Cache

So, what’s the best memory solution? For hints, we can look at what other companies are doing. Tear-down analyses have shown that Apple, for one, solves the speed mismatch problem by adding another cache. If a big company with nearly infinite R&D resources designs around its SoCs bottlenecks this way, it’s probably worth looking into.

The trick is not to put the cache near the processor. It’s counterintuitive, but it works. Most high-end embedded processors, like an Arm Cortex A-series, will have L1 and L2 caches for each CPU core. Sometimes, the processor complex has an L3 cache as well that’s shared among all the CPU cores. That all works fine. No adjustment is necessary.

Now, add a fourth cache – a last-level cache – on the global system bus, near the peripherals and the DRAM controller, instead of as part of the CPU complex. The last-level cache acts as a buffer between the high-speed Arm core(s) and the large but relatively slow main memory.

This configuration works because the DRAM controller never “sees” the new cache. It just handles memory read/write requests as normal. The same goes for the Arm processors. They operate normally. No extra cache coherence hardware or software is required. Like all good caches, this last-level cache is transparent to software.

The Pros and Cons of a Last-Level Cache

Like all good caches, a last-level cache helps improve performance dramatically without resorting to exotic or expensive (or ill-defined) memory technologies. It makes the generic DRAMs you have work better. Specifically, it improves both latency and bandwidth, because the cache is on-chip and far faster than off-chip DRAM, and because it has a wider, faster connection to the CPU cluster. It’s a win-win-win.

What’s the downside? A cache takes up die area, of course. The cache-control logic (managing tags, lookup tables, etc.) is negligible, but the cache RAM itself uses a measurable amount of space. On the other hand, a last-level cache saves on power consumption, because nothing consumes more energy than reading or writing to/from external DRAM, especially when you’re hammering DRAM the way modern workloads do. Every transaction that stays on-chip saves a ton of power as well as time.

There’s also a security benefit. On-chip cache transactions can’t be snooped or subjected to side-channel attacks from signal probing or RF manipulation. Caches are tested, well-understood technology. That doesn’t mean they’re simple or trivial to implement – there’s a real art to designing a good cache – but at least you know you’re not beta-testing some vendor’s proprietary bleeding-edge memory interface.

To dive deeper into last-level cache implementation you can read CodaCache: Helping to Break the Memory Wall, a technical paper that describes last-level cache implementation for SoC developers. It’s designed as drop-in IP that can be included in SoC design without resorting to weird, expensive, or evolving technologies.