Lempel-Ziv Algorithms. LZ77 (Sliding Window). Variants: LZSS (Lempel-Ziv- Storer-Szymanski); Applications: gzip, Squeeze, LHA, PKZIP, ZOO. LZ78 ( Dictionary. version of LZ77, called LZSS, and one improved version of LZ78, called LZW. The base of the LZ77 algorithm is a sliding window technique with two buffers, one. CULZSS algorithm proposed in  parallelizes the LZSS algorithm at two levels. The first level is to split the input data into equally sized chunks and each chunk.
|Published (Last):||17 January 2013|
|PDF File Size:||15.28 Mb|
|ePub File Size:||7.93 Mb|
|Price:||Free* [*Free Regsitration Required]|
As with everything else that I’ve written, my LZSS implementation left and still leaves room for improvement. This web page attempts to discuss my implementation and the updates that have sporadically taken place since my original LZSS release.
I also toyed with an in-memory implementation of the LZSS algorithm that preforms all encode and decode operations on arrays rather than files. This implementation might be useful to those developing on systems that do not include a file system. Information on downloading the source code for all of my LZSS implementations may be found here.
LZSS is a dictionary encoding technique. Unlike Huffman coding, which attempts to reduce the average amount of bits required to represent a symbol, LZSS attempts to replace a string of symbols with a reference to a dictionary location for the same string. It is intended that the dictionary reference should be shorter than the string it replaces.
The LZSS algorithm and it’s predecessor LZ77 attempt to compress series of strings by converting the algoritnm into a dictionary offset and string length. The larger Nthe longer it takes to search the whole dictionary algodithm a match and the more bits will be required to store the offset into the dictionary.
Typically dictionaries contain an amount of symbols that can be represented by a whole power of 2. A symbol dictionary would require 9 bits to represent all possible offsets. If you need to use 9 bits, you might as well algorithn a symbol dictionary so that you can have more entries. Additional new symbols cause an equal number of the oldest symbols to slide out.
As stated above, encoded strings are represented as an offset and a length. Since the dictionary size is fixed at N the largest offset may be N – 1and the longest string matching a series of characters in the dictionary may be N characters. For large Nit’s highly unlikely that you’ll ever read an N symbol string that matches the contents of dictionary, so the maximum allowed match length is often limited. In the examples I have seen, N is typically or and the maximum length allowed for matching strings is typically between 10 and 20 characters.
In their original LZ77 algorithm, Lempel and Ziv proposed that all strings be encoded as algorihm length and offset, even strings with no match.
c – understanding this LZSS based decompression algorithm – Stack Overflow
Storer and Szymanski observed that individual unmatched symbols or matched strings of one or two symbols take up more space to encode than they do to leave uncoded.
For example, encoding a string from a dictionary of symbols, and allowing for matches of up to 15 symbols, will require 16 bits to store an offset and a length. A single character is only 8 bits. Using this method, encoding a single character doubles the required storage.
Using the above example it now takes 9 bits for each uncoded symbol and 17 bits for each encoded string. Storer and Szymanski also observed that if you’re not encoding strings of length 0 to Mthen M may be used as an offset to the length of the match.
Based on the discussion above, encoding input requires the following steps: Initialize the dictionary to a known value. Read allgorithm uncoded string that is the length of the maximum allowable match.
Search for the longest matching string in the dictionary. If a match is found greater than or equal to the minimum allowable match length:. Shift a copy of the symbols written to the encoded output from the unencoded string to the dictionary. Read a number of symbols from the uncoded kzss equal to the number of symbols written in Step 4.
Repeat from Step 3, until all the entire input has been encoded. The encoding process requires that a the dictionary is searched for matches to the string to be encoding.
Decoding an offset and length combination only requires going to a dictionary offset and copying the specified number of symbols. No searching is required. Decoding input requires the following steps: If the flag indicates an encoded string:. Shift a copy of the symbols written to the decoded output into the dictionary. Repeat from Step 2, until all the entire input has been decoded. Further algoriithm of LZSS with links algorithk other documentation and libraries may be found at http: All of my versions use codes that are integral bytes, and each of algoruthm newer versions are derived from my earlier versions so there is a lot of common design.
I chose to implement my dictionary as a character alforithm buffer sliding window. I chose characters, because others have achieved good results with dictionaries of this size, and I can encode offsets of 0 lzsss in 12 bits.
If I encode the offset and length of a string in a 16 bit word, that leaves me with 4 bits for the length. Keeping the goal of a 16 bit encoded string in mind, and the above requirement for a 12 bit offset, I’m left with 4 bits to encode the string length.
With 4 bits, I can encode lengths of 0 through However, as Storer and Szymanski observed, it only takes 1 byte to write out characters that match dictionary strings 0 or 1 character long, and 2 bytes to write out characters that match dictionary strings 2 characters long. Savings of one or more bytes per string doesn’t occur unless the string is 3 characters or longer. By not encoding strings that offer less than one byte of savings, the minimum encoded length is three characters.
Since the minimum encoded length is three, I am able to add three to the 4 bit value stored in the length field, resulting in encoded string lengths of 3 to So 18 is the maximum string length encoded by this implementation.
I’ve been calling my implementation a modified LZSS implementation. It’s modified because of the way my version 0. Similarly writing an encoded string would require 17 bits, 1 bit for the flag and 16 bits for the code.
While I was studying the algorithm, I came across some implementations that stored encoded flags in groups of eight followed by lsss characters or encoded strings that they represent.
I don’t know who to credit with the discovery of this technique, but it allowed my version 0. This technique also makes the EOF clear.
By processing only bytes, there are no spare bits, os the EOF of the encoded dats is actually the EOF of the encoded file. Don’t worry lzss it if I lost you on the EOF discussion, just relax, because there’s no need to handle it specially. After writing my version 0.
LZSS (LZ77) Discussion and Implementation
As stated above, one of my goals was to provide an LZSS implementation that is easy to read and easy to debug. So, to keep things simple, my first implementation used a brute force sequential search to match the string to be encoded to strings in the dictionary. Through experimentation and reading, I’ve discovered that the methods used for string matching significantly impact encoding time. Other’s have successfully improved the time required to find matching strings using the following techniques:.
I have already experimented with some of these techniques and plan to experiment with others as time allows. The rest of this section documents some of what I have tried so far. The first search I implemented was a sequential search. I simply start at the beginning of the sliding window dictionary and do a character by character comparisons with the string being encoded. If the first characters match, I check the characters that follow. If I don’t have a match, I move to the next character in the sliding window.
The source code implementing a sequential search is contained in the version 0. The logic is pretty simple, but may require a number of comparisons. The addition of code implementing the KMP algorithm is a relatively new one version 0.
By the algorihm I got around to including it Wikipedia had a reasonable description as well as pseudocode that I could reference.
However KMP attempts to use some information about the string we’re attempting to find a match for and the comparisons already made in order skip some comparisons that must fail.
The initial pass, will start by comparing string to dictionary and fail lagorithm it compares string to dictionary. The brute force sequential search will resume by comparing string to dictionary but we know algorihm will fail because string matches dictionary and string and string are not the same. KMP is algorlthm enough to skip ahead and resume by comparing string to dictionarywhich happens to be a match in this example. The KMP algorithm requires that the string being searched for be preprocessed.
The preprocessing generates a look-up table used to determine how far back from the failed comparison, the algorithm must go to resume comparisons. The look-up table is know as a “partial match table” or a “failure function”.
The partial match table for our example algorithj is depicted below:. When the example above failed to match string to dictionaryposition 5 of the partial match table was used to determine that search should fallback 2 to dictionary.
The source code implementing the KMP algorithm is contained in the file kmp. The majority of the code follows the outline of the pseudocode provided by Wikipedia. The sequential search algorithm moves through the dictionary one character at a time, checking for matches to the first character in the string to be encoded. Assuming equal distribution of characters, there’s a 1 in alphabet size chance of characters matching. One approach to ensuring that the first character of the string being encoded matches the dictionary entry it akgorithm being compared to is to keep a list of all dictionary entries that start with a character for each character in the alphabet.
For example, there would be one list of all the entries that start algoruthm ‘A’, and another of all the entries that start with ‘B’. In the case of a character alphabet, lists must be maintained. Since the dictionary lzxs a sliding window of the last characters encoded by the algorithm, the lists of strings starting with a given character must be updated as old characters are removed from the dictionary and new characters are added to the dictionary.
Linked lists are a natural choice for implementing such dynamic lists. The additional memory overhead required to implement a collection of linked lists is one pointer to the head of each of the lists, plus one pointer to the next character in algorifhm list for each character in the dictionary. The source code implementing a linked list search is contained in the version 0.
In my implementation, all pointers list head and next are int indices into the sliding window dictionary. After implementing string matches with linked listsit seemed like a wasn’t much effort to try matching using hash tables.
In actuality, the linked lists approach, is a algorifhm involving hash tables where the hash key is the first character of the string algoritjm collisions are stored in a linked list. If that’s all I wanted, I’d be done.