One of the most critical problems chip designers face today is having to reconfigure RTL at any point in the design process, even in-system. Unfortunately, chip designers have no way of knowing if they will have to do this until it is too late. Any changes at that point end up costing millions of dollars and delaying projects by months.
With embedded FPGA, this problem goes away. Chip designers can finally go into a project knowing they have the flexibility to change RTL at any time during the project, something that has never been possible before.
Because embedded FPGA is a new technology, we will first highlight how it differs from standard FPGAs, which have been around for decades. Basically, an embedded FPGA is an IP block that allows a complete FPGA to be incorporated into an SoC or any kind of integrated circuit. Just as RAM, SERDES, PLL, and processors transitioned from standalone chips to routine IP blocks, FPGA is now also an IP block.
An FPGA combines an array of programmable/reconfigurable logic blocks in a programmable interconnect fabric. In an FPGA chip, the outer rim of the chip consists of a combination of GPIO, SERDES and specialized PHYs such as DDR3/4. In advanced FPGAs, the I/O ring is roughly 1/4 of the chip and the “fabric” is roughly 3/4 of the chip. The “fabric” itself is mostly interconnect in today’s FPGA chips where 20-25% of the fabric area is programmable logic and 75-80% is programmable interconnect.
An embedded FPGA is an FPGA fabric without the surrounding ring of GPIO, SERDES, and PHYs. Instead, an embedded FPGA connects to the rest of the chip using standard digital signaling, enabling very wide, very fast on-chip interconnects.
Inside an Embedded FPGA: The Primitive Building Blocks
The programmable logic block in an FPGA is a Look Up Table (LUT) that can implement any Boolean function via programming: 4, 5, or 6 inputs with one or two outputs.
In the Flex Logix EFLX arrays, the LUT is a dual 4-input LUT that can be combined to form a 5-input LUT. The LUT outputs are optionally storable in a flip-flop. LUTs are typically grouped into groups of four with carry logic to facilitate adders and shifters.
Another programmable logic block is a MAC (multiplier-accumulator) or DSP Accelerator block.
In the Flex Logix EFLX array, there is a 22-bit pre-adder, a 22x22 multiplier, and a 4-bit post-adder/accumulator. MACs can be combined or cascaded to form fast DSP functions.
The programmable logic blocks are programmed by configuration bits that set the values of the LUTs, select whether the flip-flops are used are bypassed, activate or not the carry logic, etc. The configuration bits also program the operation of the MACs. Typically in an FPGA, the configuration bits are loaded from an external flash.
For embedded FPGA, it is the same since almost all SoCs have an ARM/ARC/MIPS/etc. processor that is booted from external flash. The same flash is used to store the configuration bits for the embedded flash.
The programmable logic blocks receive inputs and send outputs to an interconnect network that allows connections to be programmably made from and to any logic blocks in the FPGA fabric. The interconnect fabric is also programmed by configuration bits. The interconnect fabric is typically the bulk of the FPGA fabric.
A major differentiator for embedded FPGAs is the design of the interconnect fabric. The best interconnect uses less area and fewer metal layers while providing high utilization of the resources.
Unlike an FPGA chip, there are no PHYs/SERDES/PLLs in an embedded FPGA. There is a ring of “I/O,” but it is really simple digital interconnects to the rest of the chip. An embedded FPGA will have hundreds to thousands of interconnects that can run at full speed within the chip. This increase in I/O width and bandwidth is a huge advantage of embedding FPGA in a chip.
Inside an Embedded FPGA: Building Any Size and Configuration of Array
One complexity is that customers want a wide range of sizes and configurations of embedded FPGAs, and everyone wants the IP block proven in silicon before using it in their chip.
For example, in 16nm, one customer might want only a few hundred LUTs of programmable logic for fast reconfigurable control logic running at ~1GHz; while another customer in the same process may want 50K-100K LUTs for a datacenter processor accelerator. How can these customers be satisfied with the least amount of design investment and time-to-market?
Flex Logix uses a tileable building block approach. First, 4 EFLX IP cores are designed using the above approach. Each IP core is a stand-alone FPGA, but they can also be arrayed to offer EFLX arrays, about 75 in total, from 100 LUTs to 122.5K LUTs, with any mix of logic/DSP.
Each EFLX IP core has an extra top-layer of interconnect which allows one core to connect automatically to surrounding neighbors to make a large array up to NxN.
EFLX-100 arrays up to 5x5 or 3,000 LUTs (there are actually 120 LUTs in an EFLX-100).
EFLX-2.5K takes over at 2500 LUTs and arrays up to 122.5K LUTs.
An array can be all-logic or all-DSP or any mix of the two types of cores, like so:
It is also possible to embed large amounts of RAM in the embedded array. Flex Logix does this by using standard RAM compilers to generate any kind of RAM that the customer requests (single port, dual port; ECC/parity/none; as much as wanted) and positions the RAM between the cores. The RAM is part of a single EFLX array.
Using the above approach allows a few IP cores to generate an almost limitless variety of embedded FPGA arrays to suit any customer requirement.
Inside an Embedded FPGA: Proving the Building Blocks in Silicon
Flex Logix builds validation chips to prove out the IP cores in silicon. Below is an example in TSMC 40ULP.
In that process, there are a wide range of VT (voltage threshold mask) combinations that customers use and Flex Logix designed the EFLX array to be compatible with all the possible combinations. So the validation chip has five arrays: one large array (4x4) in the most requested VT combination and four 2x2 arrays in the other four sizes.
Since the EFLX arrays in 40nm can operate up to 300MHz, and GPIO is only reliable to ~150MHz, there is an on-chip PLL to generate very fast, precise clocks for testing performance, and there is SRAM to enable banks of “test vectors” to be loaded up then run at full speed with results output to another bank. This gives a “tester on a chip” so that full speed operation can be verified at above GPIO speeds. There are also monitors for temperature/voltage to ensure tests are done at the targeted, worst case conditions.
Embedded FPGA will change the way chips and SoCs are designed in the future. Designers no longer need to be locked into a project and forced to spend millions of dollars to change RTL if they need to. Companies also no longer have to risk missing their schedules when RTL needs to be updated. With embedded FPGA, the chip design process just got a whole lot easier and a whole lot cheaper.