Forget CPUs! RaiderChip Has a Hardware-Based Generative AI Accelerator

The Spanish startup is pushing the limits of edge AI with a new accelerator reportedly trumping competitors in efficiency and token generation speed.

News January 31, 2025 by Jake Hertz

Spanish AI startup RaiderChip recently announced its new GenAI NPU, a fully hardware-based accelerator designed to optimize generative AI inference at the edge. Unlike hybrid solutions that rely on software execution, the GenAI NPU integrates all large language model (LLM) operations into hardware to eliminate processing latency and deliver deterministic, real-time performance. The company's website lacks much in the way of engineering-level technical details about the NPU, but it does offer information about its performance and features.

RaiderChip

RaiderChip seeks to overcome the memory bottleneck of AI acceleration. Image (modified) used courtesy of RaiderChip

GenAI NPU: Processing Complex LLMs at the Edge

RaiderChip’s GenAI NPU is a fully hardware-based accelerator designed to optimize generative AI inference at the edge. According to the company, this architecture eliminates the need for external processors by embedding all LLM operations within the hardware itself. While hybrid solutions tend to rely on software execution, the GenAI NPU reportedly achieves a 2.4× increase in token generation speed per available memory bandwidth by removing hardware-software communication latency.

The accelerator is also said to optimize memory efficiency by maximizing the number of tokens generated per gigabyte per second of available memory bandwidth. Unlike competing solutions that require high-bandwidth memory (HBM), the GenAI NPU achieves high performance using cost-effective DDR or LPDDR memory. This design reduces hardware cost, energy consumption, and overall system size without sacrificing computational power. Performance comparisons indicate a 37% efficiency gain over Intel’s Gaudi 2, a 28% improvement over Nvidia’s cloud GPUs, and a 25% advantage over Google’s TPU v5e.

Quantization shrinks neural networks

Quantization shrinks neural networks by decreasing the precision of weights, biases, and activations. Image used courtesy of Qualcomm

The architecture also supports both vanilla and quantized LLM models and executes floating-point computations, including FP32, FP16, BF16, and FP8. Further, the NPU also supports four-bit and five-bit quantization (Q4_K, Q5_K) to reduce memory requirements by up to 75% and increase inference speed by 276%.

RaiderChip calls its architecture “target-agnostic,” meaning it supports FPGA and ASIC implementations that allow configurable parameters for model size, inference speed, and power efficiency.

Built on the GenAI v1 Hardware Accelerator

RaiderChip built the GenAI NPU on the features of the preceding GenAI v1, a hardware accelerator designed for generative AI workloads on FPGA-based systems.

As an IP core, GenAI v1’s architecture is optimized for transformer-based models such as Meta’s Llama 2, Llama 3, and Microsoft’s Phi series. It executes models in full floating-point precision while also offering a quantized variant, GenAI v1-Q, which supports four-bit and five-bit quantization to reduce memory requirements by up to 75% and improve inference speed by 276%.

Comparing quality and size in LLMs

Comparing quality and size in LLMs. Image used courtesy of Microsoft

RaiderChip claims the GenAI v1 offloads more than 99.99% of inference computations to dedicated hardware, with only a lightweight software layer handling model support and updates. This software layer is under 200 KB and functions autonomously without external dependencies like PyTorch or Hugging Face Transformers. The architecture’s adaptability allows new models to be integrated with minimal effort, with support for fine-tuned versions of existing transformer-based LLMs requiring no additional modifications.

Hardware-First AI Acceleration

As generative AI expands into embedded and industrial applications, the demand for scalable, cost-effective acceleration continues to grow. Traditional cloud-based inference requires substantial energy and bandwidth, which hinders enterprises from deploying AI in environments where latency, security, and power are non-negotiable. By optimizing memory efficiency and eliminating reliance on high-bandwidth memory, RaiderChip hopes to reduce both infrastructure costs and energy demands while maintaining high-performance AI processing.