Xilinx and Samsung Pioneer First Adaptable Computational Storage Device

With the demand for in-memory computation increasing, Xilinx and Samsung have set out to create a “one size fits all” solution.

News November 13, 2020 by Jake Hertz

One of the largest roadblocks in AI/ML from a hardware perspective is the movement of data. These applications require a huge amount of data moving in and out of memory. At the same time, Dennard scaling has caused interconnects to become increasingly parasitic.

The result is that interconnect delay has become more significant than gate delay, and data movement energy has become a significant contributor to total chip energy consumption.

As such, the solution that most are turning to is in-memory computation: instead of moving data from memory to processing units, just process the data where it's stored. In-memory computation removes the problems of interconnect delay and energy consumption, significantly improving AI/ML performance.

Now, Samsung and Xilinx have teamed up to create a new generation of versatile computational memory.

What is SmartSSD? Adaptable Computational Storage

Nearly two years ago, Samsung and Xilinx announced plans to partner in creating a unique type of computational storage that would allow developers to adapt and optimize for their specific applications. This week, their work came to fruition with the release of their newest product: the SmartSSD computational storage drive (CSD).

Samsung SmartSSD CSD

Samsung SmartSSD CSD. Image used courtesy of Samsung

This new product combines Samsung’s memory technologies with Xilinx's FPGA expertise to create an adaptable computational storage platform. The idea is that incorporating an FPGA into the device will allow developers to build unique, hardware-accelerated solutions in familiar high-level languages.

Utilizing the Vitis unified software platform, runtimes, libraries, APIs, and drivers can be built into the system using languages such as C, C++, and OpenCL. Further, the Xilinx runtime environment will allow developers to use hardware definition languages to develop application-specific hardware and re-use existing accelerator IP designed in HDL for ASICs or FPGAs.

Simplified architecture of the SmartSSD

Simplified architecture of the SmartSSD. Image used courtesy of Samsung

Equally as important as versatility in this new solution is scalability. A single server can contain multiple SmartSSD drives, and each SmartSSD drive can run query acceleration in parallel. SmartSSD-based servers remove PCI-Express bottlenecks, producing near-linear performance scaling even on an over-subscribed host CPU.

SmartSSD CSD's Speed and Functionality

Beyond the versatility of the device, it's important to look at some specs offered as well.

As a memory system, the SmartSSD CSD offers 3.8 TB of memory and utilizes a PCIe Gen 3x4 host interface. Since PCIe 4.0 is twice as fast as PCIe 3.0, this won’t be the fastest storage device on the market. However, the allure is not the speed; it's functionality. If computation can occur in memory, then memory transfer speed may not be the end-all-be-all.

Comparison of the storage and data acceleration

Comparison of the storage and data acceleration using SmartSSD CSDs vs. a traditional architecture. Image used courtesy of Samsung

Beyond this, the device offers dynamic power management and throttling and incorporates Samsung's V-NAND flash memory.

Forging In-Memory Computing Forward

The new SmartSSD CSD from Samsung and Xilinx can potentially prove a useful prototype for in-memory computing devices to come. With versatility and scalability at its heart, the device may even be a “one-size fits all” memory solution for data centers, as Samsung and Xilinx describe it to be.

Do you have any experience with in-memory computing? Do you see this method as the future of AI/ML or does it still have a long way to go? Share your thoughts in the comments below.

Learn More About

samsung xilinx von Neumann architecture in-memory computing SmartSSD CSD Adaptable Computational Storage

Gorbag March 28, 2021

This sounds like, in effect, having multiple slave Xilinx-based servers with an SSD interface to the master server processor. That’s a pretty large chunk of memory for a so-called “PIM” architecture. When I think PIM, I think of operations that combine content addressability with processing, e.g. “All records representing adults with an even number of children should add a grandparent link to each child” would be an operation that would be almost instantaneous as each record runs its program concurrently.

Like. Reply