FT816 Floating Point Accelerator

FT816 Floating Point Accelerator

Details

Category: Arithmetic Core

Created: December 09, 2014

Updated: January 27, 2020

Language: Verilog

Other project properties

Development Status: Alpha

WishBone compliant: No

WishBone version: n/a

License: LGPL

Description

07/06/2019 - Updated the square root core to allow restarting the calculation any time load is active.

06/14/2019 - Updates have been made to improve the accuracy of the cores. The normalizer needed an extra bit for results generated by the FMA. Please be wary about use. The author has limited testing resources.

06/11/2019 - Added a latency 17 fused multiplier - adder (FMA) , also commonly called a MAC. Testing shows that the output is sometimes off by one in the LSB versus the results generated by a program on the workstation. It could have to do with the way the test data was generated. It seems there may also be a pipeline skew issue with the core. It's supposed to be able to process a new set of operands every clock cycle, but works better if the latency expires first.

06/09/2019 - Added a latency of 10 adder/subtractor. This adder/subtractor should be capable of a much higher clock frequency than the current adder/subtractor which has a latency of two.

06/06/2019 - Reciprocal estimate function added. The estimate is accurate to about eight bits. It uses a piecewise linear approximation from a 1024 entry lookup table, then interpolates. Comparing the results of the reciprocal generated by the workstation (fpRes_tv.txt) and simulation results (fpRes_tvo.txt) is a bit tricky. For some reason the order of the output is a little scrambled. Also added reciprocal square root estimate and sigmoid function estimates.

10/10/2018 - Goldschmidt divider added
The floating-point divider may now use a Goldschmidt divider. This is the divider used in some modern microprocessors and it converges in very few clock cycles (around six clocks).

fpDiv – when using the Goldschmidt divider, the divider result is sometimes off by 1 in the least significant bit. It may be too high but is usually too low if it’s off. This is due to the fact that only four extra bits are being calculated in the divider. Increasing the number of extra bits calculated would help to obtain a match to the workstation results. The Goldschmidt divider may not be suitable for an FPGA implementation. It uses a fair number of resources in an FPGA in part due to the need for two wide single cycle multiply operations.

02/09/2018 - a fix was made to the position of the sticky bit
02/05/2018 - Some more testing has been done
About 8,000 single precision random test values have been fed into fpMul, FpAAddsub, and fpDiv and output checked against the output produced by a desktop workstation.
Results are somewhat different, but the same in many cases.
- underflow output isn't the same, those really small numbers might be off.

12/09/2016 - Some rudimentary testing has been done on the fp units at 128 bit and 80 bit precision. It correctly calculates the following:
10.0 + 10.0 = 20.
10.0 * 10.0 = 100.
300.0 / 25.0 = 12.
1.0 + 1.0 = 2
1.0 + 0.0 = 1
1.0 - 1.0/65536 = 0.99998474121095

7/10/2016 - This project is now a bit of a misnomer because it includes cores for IEEE compatible operations as well as the original FT816 core. Rather than start another project I just decided to lump the cores together in this one. FT816Float.v is the original unit which shouldn't require any other modules to use.
added missing redor64 function for floating point unit

3/24/2016 - Added FloatToInt and IntToFloat cores with single cycle latency

FT816 floating point accelerator consists of two ninety-six bit floating point accumulators between which floating point or fixed point operations occur. Basic operations include ADD, SUB, MUL, DIV, FIX2FLT, FLT2FIX, SWAP, NEG and ABS. The floating point accumulators operate as a memory mapped device placed by default between $FEA200 and $FEA2FF. The floating point accelerator communicates through a byte wide data port and twenty-four bit address port. It was intended for use primarily with smaller byte oriented cpu’s like the 65xx, 68xx series in order to provide them with some floating point capability.