Simple Parametrized FFT Engine

Simple Parametrized FFT Engine


Category: Arithmetic Core

Created: February 05, 2014

Updated: January 27, 2020

Language: VHDL

Other project properties

Development Status: Beta

WishBone compliant: No

WishBone version: n/a

License: BSD


This project implements a simple parametrized FFT engine.
The user may define length of FFT (fftlen equal to a power of 2), and may also define the format of numbers used.
To change the format of numbers, the user must change definition of the icpx_number (internal complex number) type defined in the icpx_pkg.vhd file.
It is also necessary to adjust the conversion functions defined in this file.
The user must also modify the butterfly.vhd file, so that the entity "butterfly" performs calculations on the user defined type.

There are two implementations available.

In the first one (in single_unit directory), all calculations are performed by a single "butterfly block".

In the second one (in multiple units directory), there is one "butterfly block" for each stage of the
radix-2 implementation. Therefore the calculation is performed much faster.
Additionally this implementation allows to calculate FFT on a stream of data with
overlapping blocks. New FFT is calculated after next block of input data with length of fftlen/2 is read.
In this implementation also the window function may be used to limit the spectral leakage.

Both implementations rely on known latency of the "butterfly" block.
In the first implementation it is defined by the constant BTFLY_LATENCY in the fft_engine.vhd file.
In the second implementation it is defined by the constant MULT_LATENCY in the fft_engine.vhd file, and additionally
it is used as a latency of the multiplier multiplying the data by the window function).

It is assumed, that butterfly block (and the multiplier in the second implementation) work in a pipeline mode
(so new data may be delivered to the input every clock pulse, and the results are output after the known latency.

The single_unit implementation to speed up processing, uses two DPRAMs for data.
It allows to read both input data for the "butterfly block" simultaneously, and to write both results simultaneously.
When the engine completes calculations, the results may be read from the output, and simultaneously the new data may be written to the output (it is granted that different DPRAMs are used for input and output).

The multiple_units implementation uses DPRAMs in "read before write" configuration, to allow simultaneous operation of all
stages of the radix-2 FFT.

The design is prepared for simulation with ghdl.

The script "test_fft.m" may be run in Octave (probably also in Matlab) to check, that the core works correctly.
It configures the core for selected FFT length, generates the test data, compiles and runs the simulation and displays comparison between the results calculated with floating point FFT and results calculated by the simulated core.
The implementation of the "butterfly block" is not optimal (e.g. it lacks proper rounding) and therefore there may be small differences between those values.

For simulation you need the following free software packages:
ghdl ( )
octave ( )

You may also install gtkwave, to view internal signals (simulation may generate the .ghw file for gtkwave).

The code is synthesizable. It has been successfully synthesized with ISE toolkit from Xilinx.

For FFT length of 256 (LOG2_FFT_LEN=8) and complex numbers with 16-bit real and imaginary parts (ICPX_WIDTH=16), the resources consumption is as follows:

For the single_unit implementation:
For chip xc3s500e:

Number of BUFGMUXs 1 out of 24 4%
Number of MULT18X18SIOs 4 out of 20 20%
Number of RAMB16s 2 out of 20 10%
Number of Slices 825 out of 4656 17%

For chip xc6slx45:

Number of RAMB16BWERs: 2 out of 116 1%
Number of DSP48A1s: 4 out of 58 6%
Number of Slice Registers: 58 out of 54,576 1%
Number of Slice LUTs: 475 out of 27,288 1%
Number of occupied Slices: 188 out of 6,822 2%
Number of MUXCYs used: 92 out of 13,644 1%
Number of LUT Flip Flop pairs used: 477

For the multiple_units implementation:

For chip xc6slx45:

Number of RAMB16BWERs: 12 out of 116 10%
Number of RAMB8BWERs: 2 out of 232 1%
Number of DSP48A1s: 40 out of 58 68%
Number of Slice Registers: 1,258 out of 54,576 2%
Number of occupied Slices: 683 out of 6,822 10%
Number of MUXCYs used: 764 out of 13,644 5%
Number of LUT Flip Flop pairs used: 1,838

All my sources in this project are published under the BSD license. You can use them and modify them, however you should keep the
information about the original author.

I don't know whether my IP core infringes any patents. If you want to use it for commercial purposes, you should check it yourself.
I also don't know if my IP core works correctly in all possible conditions. I provide it "AS IS" without any warranty.
You use it on your own risk!