RFC 1951 - DEFLATE Data Compression Algorithm

RFC 1951 - DEFLATE Data Compression Algorithm

Details

Category: Uncategorized

Created: June 14, 2005

Updated: January 27, 2020

Language: VHDL

Other project properties

Development Status: Alpha

Additional info: Design done

WishBone compliant: Yes

WishBone version: n/a

License: GPL

Description

A VHDL implementation of the open DEFLATE data compression algorithm. The DEFLATE standard is specified in RFC 1951 and was jointly developed by Jean-loup Gailly and Mark Adler. More information about the DEFLATE algorithm is available on the zlib library home page www.zlib.org

The full text of the deflate specification and a brief explanation are available on :
http://www.gzip.org/zlibrfc-deflate.html

At the core the algorithm uses LZ77 compression, a veriy nice explanation for which is available on the zlib website and I have quoted below

LZ77 compression

LZ77 compression works by finding sequences of data that are repeated. The term ``sliding window'' is used; all it really means is that at any given point in the data, there is a record of what characters went before. A 32K sliding window means that the compressor (and decompressor) have a record of what the last 32768 (32 * 1024) characters were. When the next sequence of characters to be compressed is identical to one that can be found within the sliding window, the sequence of characters is replaced by two numbers: a distance, representing how far back into the window the sequence starts, and a length, representing the number of characters for which the sequence is identical.

I realize this is a lot easier to see than to just be told. Let's look at some highly compressible data:

Blah blah blah blah blah!

Our datastream starts by receiving the following characters: `B,' `l,' `a,' `h,' ` ,' and `b.' However, look at the next five characters:

vvvvv
Blah blah blah blah blah!
^^^^^

There is an exact match for those five characters in the characters that have already gone into the datastream, and it starts exactly five characters behind the point where we are now. This being the case, we can output special characters to the stream that represent a number for length, and a number for distance.

The data so far:

Blah blah b

The compressed form of the data so far:

Blah b[D=5,L=5]

The compression can still be increased, though to take full advantage of it requires a bit of cleverness on the part of the compressor. Look at the two strings that we decided were identical. Compare the character that follows each of them. In both cases, it's `l' -- so we can make the length 6, and not just five. But if we continue checking, we find the next characters, and the next characters, and the next characters, are still identical -- even if the so-called ``previous'' string is overlapping the string we're trying to represent in the compressed data!

It turns out that the 18 characters that start at the second character are identical to the 18 characters that start at the seventh character. It's true that when we're decompressing, and read the length, distance pair that describes this relationship, we don't know what all those 18 characters will be yet -- but if we put in place the ones that we know, we will know more, which will allow us to put down more... or, knowing that any length-and-distance pair where length > distance is going to be repeating (distance) characters again and again, we can set up the decompressor to do just that.

It turns out our highly compressible data can be compressed down to just this:

Blah b[D=5, L=18]!

Then distance and length pairs have to be encoded into huffman codes that follow all the rules of classic huffman trees but with two extra rules

1. In the variation used by the Deflate standard, there are two additional rules: elements that have shorter codes are placed to the left of those with longer codes.

2. Among elements with codes of the same length, those that come first in the element set are placed to the left.

A more detailed explanation is available at the website below:

http://www.zlib.netfeldspar.html

Features

Fully compliant to the deflate compression specifications

Status

07/11/05
Design completed.

28/3/06
Using the DJB2 hash algorithm, algorithm currently uses 112 slices and outputs a 32 hash key.

Now working on search and match using the hash algorithm

24/05/06
Improved DJB2 algorithm implemented, some changes made to synchronisie the modules better, algorithm now uses only one slice and is implemented with a 4 byte buffer.

29/05/06
Implemented the LZ77 algorithm but haventuploaded it as it needs to be integrated with the huffman tree algorithm and that is still being coded.