Quantitative Comparative Study of the Performance of Lossless Compression Methods Based on a Text Data Model ()
1. Introduction
With the complexity of processing and storage power increasing according to John Moore’s law, the production of data outstrips the increase in the complexity of the hardware and transmission [1]. It is therefore important to make information compact before transmitting or archiving it. Compression techniques need to be used to reduce the size of the data to be transmitted or stored more efficiently. But this can only be achieved by choosing the best information compression methods. Compressed data may lose some information deemed non-essential, which is referred to as lossy compression. When the original data is recovered after reconstruction, it is said to have been losslessly compressed. The information to be compressed can be in the form of audio, text or images. Here, we are using text data with lossless compression. In [1], the author compares the RLE, LZW, arithmetic and Huffman techniques to find out which is the fastest and most efficient by choosing the following evaluation criteria: compression rate, compression time, size and complexity of the text files. He was able to rank them according to the difference in compression, starting with the best. It was not possible to evaluate the compression rate or compression time because it was found that each technique compressed the selected text files differently in terms of word count. According to the work in [2], based on the RLE, LZW and Huffman methods, it was found that RLE gave the best results in terms of compression rate for text data. However, when it comes to HTML files, RLE is second only to LZW. RLE, Delta, Huffman and LZW were also compared in [3]. LZW presented a better result with a good compression ratio of 4:1 and also 76.9% memory space saving. And Delta encoding is better than RLE and Huffman.
The comparative study of lossless compression methods is a priority in a context where the amount of data to be stored or transmitted is constantly increasing. Given this growth, it is essential to determine which compression method offers the best compromise between reducing data size and preserving data integrity.
The problem of this study lies in identifying the relative performance of the different lossless compression methods available. It is to understand how these techniques react to different types of data, while taking into account evaluation criteria. For the purposes of this paper, we have chosen certain text data models to achieve our objective.
In this paper, we carry out a comparative study of the most widely used and relevant lossless compression techniques. In the next section, we present the state-of-the-art in lossless compression methods; then in Section 3, we propose a text dataset and determine the performance models of the lossless compression methods selected. Sections 4 and 5 are devoted to results and discussions respectively. The final section provides a conclusion and outlook.
2. State-of-the-Art Lossless Compression Methods
In the digital domain, in order to compact information, there is a choice of several compression methods: lossy and lossless [3]. Lossy techniques lead to a higher compression ratio than lossless techniques [4]. Lossless compression techniques reduce the size of files so that the original data can be accurately recovered. Some research has focused on the comparison of lossless compression encodings. For example, [5] states that arithmetic coding has a better compression ratio than Huffman coding. In [6], a comparison of the Shannon-Fano, Huffman, RLE and Tunstall algorithms applied to sentences shows that the Huffman and Shannon-Fano compression rates are the highest, while the RLE compression rate is the lowest. Nagamani N. P. et al. used compression ratio and space gain to compare Huff-man, Shannon-Fano and LZW coding. The results show that the Shannon-Fano technique offers better efficiency than the other techniques when the text size is less than 50 Kb [7]. [8] follows suit, but compares LZW, RLE, Huffman, Shannon-Fano, arithmetic coding, LZ77 and LZ78. Here, LZW is the most efficient (81.31% space saving). In [9], Huffman and arithmetic coding were compared. It was found that the arithmetic method was the most reliable, but not as fast as the Huffman method. [3] carried out a comparison between RLE, Huffman, LZW and Delta coding, evaluating them in terms of compression ratio, space gain, compression factor and elapsed time. The conclusion drawn is that LZW has a better compression ratio (4:1) and space saved (76.9%) but this method is complex and takes time to compress. However, Delta’s encoding is better than RLE and Huffman. For Rajput A. A. et al., using the same evaluation criteria, compared the RLE, LZW, Shannon-Fano and Huffman techniques and found that Shannon-Fano coding was better than RLE and Huffman coding was slower than LZW [10]. In [11], LZW is compared with adaptive Huffman coding on English and Arabic texts. We note that for compression time, adaptive Huffman outperforms LZW and the opposite for decompression time and bit rate. In [12], the evaluation of arithmetic and adaptive Huffman coding indicates that the arithmetic method is preferable to the adaptive Huffman method, taking into account the greater number of bits required for adaptive coding than for arithmetic coding. Alif M. et al. compare LZW and Huffman according to space gain and compression time. The result shows that LZW is superior to Huffman [13]. But what we note is that this research does not focus on the characteristics of the data and does not present a tipping point for better comparison.
There are several reversible compression methods, such as: RLE coding, BWT, Huffman, arithmetic coding, LZW, Tunstall coding, Shannon-Fano, Golomb coding, Prediction by Partial Matching (PPM), Delta encoding, Context Tree Weighting (CTW), Move-to-Front, and so on. But in the context of our work, we will only use seven of these methods, which we consider to be the most relevant and the most widely used in the literature for compressing text data.
In this work, we want to qualitatively compare the performances of certain lossless compression methods retained through performance quantities such as entropy, average code length and compression rate, using a text data model.
2.1. RLE Encoding Principle
The RLE (Run-Length Encoding) algorithm is a simple and effective compression technique for compressing data with repetitive sequences.
It runs through the data sequence, character by character, from left to right.
It then identifies sequences of consecutive characters that are identical. When a repetitive sequence is detected, the algorithm counts the number of repetitions.
The repeated sequence is represented by the character followed by the number of repetitions. This representation is then added to the already compressed part.
2.2. Principle of Coding RLE + BWT
2.2.1. Principle of the Burrows-Wheeler Transform (BWT)
This is a technique that does not compress. It rearranges the characters in a string in order to exploit redundancy and therefore facilitate data compression using certain compression techniques [14]. The steps can bedistinguished as follows:
All the rotations in the string under consideration are generated by iteratively moving the last character to the first position. This produces a rotation matrix.
The next step is to sort these rotations lexicographically to obtain a sorted matrix.
The expected result is given by the last column of the sorted matrix.
2.2.2. The Combination of RLE and BWT
The RLE algorithm is applied to the BWT result to obtain compression.
2.3. Principle of Shannon-Fano Coding
This is a lossless coding method for compressing data invented by Claude Shannon and Robert Fano in the 1940s. The basic principleis:
Construct a table of symbol occurrence frequencies sorted in descending order.
Divide this table into two parts. Each must have a sum of frequencies equal (or practically equal) to that of the other.
Assign the binary digit 0 to the lower half, with the upper half taking the value 1.
Repeat operations 2 and 3 on both halves, until each symbol represents only one part of the table.
2.4. Principle of Huffman Coding
The Huffman compression method reduces the number of bits used in the code of a fragment of information to a minimum. The principle can bedescribed in severalstages:
Messages from the leaves of a tree, each carrying a weight equal to the probability P of occurrence of the corresponding message.
Join the 2 nodes with the lowest weights into a parent node to which we attach a weight equal to the sum of these 2 weights.
Repeat step 2 until a single tree root is obtained.
Assign codes 0 and 1 to the root’s direct descendant nodes.
Continue downwards by assigning codes to all the nodes, each pair of descendants receiving codes L0 and L1 where L designates the code associated with the parent.
Example of Huffman encoding [15]
Consider T =< ab bd ac ab be bd ac>, T is compressed using Huffman coding, the letter frequencies are: freq (a) = 4, freq (b) = 5, freq (c) = 2, freq (d) = 2, freq (e) = 1 and freq (space) = 6.
Figure 1 shows the Huffman coding tree.
Text Compression Based on Letter’s Prefix in the Word.
code (a) = 011, code (b) = 01, code (c) = 11111, code (d) = 0111, code (e) = 01111, code (space) = 0. Xo, the encoding of T will be 011010-0101110- 011111110-011010-01011110-0101110-011111110.
Figure 1. Huffman coding tree.
2.5. Arithmetic Coding Principle
Arithmetic coding is a compression technique used to represent a sequence of symbols by a single real number within a defined interval. The variouscoding stages are [16]:
Let X be a sequence of m symbols with X = x1x2....xm which takes its values from a source S = {s1, s2, ...,sn},
Calculate the probability associated with each symbol in source S, denote,
pi = Probability (S = si),
Associate with each symbol sk a sub-interval [Lsk, Hsk[ proportional to its probability, with Hsk − Lsk = pk,
Initialise the lower limit (L) of the working interval to the value 0 and the upper limit (H) to the value 1,
As long as a symbol remains in the sequence to be coded:
width =H – L,
L = L + width * Lsk,
H = L + width * Hsk,
In the end, any value in the range [L, H[ uniquely represents the input sequence.
In general, the lower limit (L) of the interval is chosen as the representative of the sequence, or the mean value ((L + H))/2 can be chosen. The value chosen to represent the sequence to be coded is called code(X). Remember that the code (X) is represented with a number of bits equal to the value [− log2 (P(X))] + 1.
Example of Arithmetic Coding [17]
Table 1 and Figure 2 illustrates an example arithmetic coding. Coding the word “COMCA” (Langdon, 1984).
Table 1. Example of statistics relating to the symbols in a message.
Symbol |
Frequency |
Probability |
Interval |
C |
2 |
0.4 |
[0 - 0.4] |
O |
1 |
0.2 |
[0.4 - 0.6] |
M |
1 |
0.2 |
[0.6 - 0.8] |
A |
1 |
0.2 |
[0.8 - 1] |
TOTAL |
5 |
1 |
|
Figure2. Example of arithmetic coding.
2.6. Principle of Tunstall Coding
Tunstall coding uses fixed-length codes to represent whole blocks of symbols in a single operation. The principleis as follows [6]:
n-bit code for a source S of an alphabet of size N;
The number of code words is 2n;
The entry with the highest probability is taken and concatenated with the other letters of the alphabet;
The size of the code increases from N to N + (N − 1);
Pr (new entry) = Π Pr (letters);
Repeat the same procedure;
Each time the size of the code increases by N − 1;
After K iterations the code size: N + K (N − 1);
Stop condition: N + K (N − 1) ≤ 2n.
2.7. Principle of Lempel Ziv Welch (LZW) Coding [18]
LZW coding is a compression technique which involves building a dictionary [1] as following:
During data processing, character strings are placed one by one in the dictionary;
When a string is already present in the dictionary, its usage frequency code is incremented;
Character strings with high frequency codes are replaced by a “word” with the smallest possible number of characters, and the correspondence code is entered in the dictionary;
Lossless compression techniques are used in a number of areas where reducing data volume without degrading information is crucial since this reduction in the amount of data increases storage capacity and, above all, the capacity of the communication channel [19]. These methods offer a variety of approaches to achieving this objective.
3. Performance Evaluation of Selected Lossless Compression Methods
Evaluating the performance of lossless compression techniques is an essential process in choosing the most suitable and effective for a particular challenge. This requires the application of evaluation criteria such as compression ratio, average code length and entropy.
3.1. Our Model of Data
In order to gain a better understanding of the effectiveness of the compression techniques selected, we propose the following text data models for the various tests:
Generalisation leads us to the data below:
ntimes
ntimes
ntimes
ntimes
Let’s now calculate and present the fate of each of the generalized data sets, before and after the application of the selected compression techniques. Consider the following parameters: the initial data volume (IV), the compressed data volume (CV), the Compression ratio (CR), the average code length (L) and the source entropy (H).these parameters are calculated according to the following equations:
Compression ratio =
(1)
(2)
Source entropy defines the theoretical limit of lossless compression
(3)
The probabilities of the different characters in each of the data:
Data 0G: P(a) = 1;
Data 1G: P(a) = 1/2; P(r) = 1/2;
Data 2G: P(a) = 1/3; P(b) = 1/3; P(c) = 1/3;
Data 3G: P(a) = 1/4; P(b) = 1/4; P(c) = 1/4; P(d) = 1/4.
3.2. Application of Compression Methods to Data
3.2.1. Case of RLE
IV = na = 8n; CV = 8 + log2(n); CR =
L = 1xn = n bits/symbol
H = 1xlog2 (1) = 0 bit/symbol
IV = 8xn; CV = 1n + 8n; CR = -1/8
L = 0.5xn + 0.5n = n bits/symbol
H = 1 bit/symbol
IV = 8xn; CV = 1xn + 8xn; CR = −1/8
L = n bit/symbol
H = 1.583 bit/symbol
IV = 8xn; CV = 1xn + 8xn; CR = −1/8
L = n bit/symbol
H = 2 bits/symbol
3.2.2. Case of RLE + BWT
IV = na = 8xn; CV = 8 +log2(n); CR =
L = n bit/symbol
H = 0 bit/symbol
IV = na = 8xn; CV = 8 + log2(n); CR =
L = n bit/symbol
H = 1 bit/symbol
IV = na = 8xn; CV = 8 + log2(n); CR =
L = n bit/symbol
H = 1.583 bits/symbol
IV = na = 8xn; CV = 8 + log2(n); CR =
L = n bit/symbol
H = 2 bits/symbol
3.2.3. Case of Shannon-Fano
IV = 8n; CV = n; CR = 7/8
L = 1 bit/symbol
H = 0bit/symbol
IV = 8xn; CV = 2xn; CR = 3/4
L = 1 bit/symbol
H = 1bit/symbol
IV = 8xn; CV = 5xn; CR = 3/8
L = 1.65 bit/symbol
H = 1.583bit/symbol
IV = 8xn; CV = 8xn; CR = 0
L = 2 bit/symbol
H = 2bit/symbol
3.2.4. Case of Huffman
IV = 8xn; CV = n; CR = 7/8
L = 1 bit/symbol
H = 0 bit/symbol
IV = 8xn; CV = n; CR = 7/8
L = 1 bit/symbol
H = 1 bit/symbol
IV = 8xn; CV = 5xn/3; CR = 19/24
L = 1.65 bit/symbol
H = 1.583 bit/symbol
IV = 8xn; CV = 4xn; CR = 1/2
L = 2 bit/symbol
H = 2 bit/symbol
3.2.5. Case of Arithmetic Coding
IV = 8xn; CV = 2 bits; CR = 1 − 1/4xn
L = 2 bit/symbol
H = 0 bit/symbol
IV = 8xn; CV = 9 bits; CR = 1 − 9/8xn
L = 4.5 bits/symbol
H = 1 bit/symbol
IV = 8xn; CV = 21 bits; CR = 1 − 21/8xn
L = 7 bits/symbol
H = 1.583 bits/symbol
IV = 8xn; CV = 21 bits; CR = 1 − 21/8xn
L = 5.25 bits/symbol
H = 2 bits/symbol
3.2.6. Case of Tunstall Coding
IV = 8xn; CV = 2 bits; CR = 1 − 2/8xn
L = 2 bits/symbol
H = 0 bit/symbol
IV = 8xn; CV = 8 bits; CR = 1 − 1/n
L = 4 bits/symbol
H = 1 bit/symbol
IV = 8xn; CV = 21 bits; CR = 1 − 21/8xn
L = 7 bits/symbol
H = 1.583 bits/symbol
IV = 8n; CV = 64 bits; CR = 1 − 8/n
L = 16 bits/symbol
H = 2 bits/symbol
3.2.7. Case of LZW
IV = na = 8xn; CV = 2 + log2(n) bits ; CR =
L = 27 bit/symbol
H = 1xlog2(1) = 0 bit/symbol
IV = 8xn; CV = 3 + log2(n) bits; CR =
L = 18 bits/symbol
H = 1 bit/symbol
IV = 8xn; CV = 4 + log2 (n) bits ;CR =
L = 18 bits/symbol
H = 1 bit/symbol
IV = 8xn; CV = 6 + log2 (n) bits; CR =
L = 20. 25 bits/symbol
H = 2 bits/symbol
4. Results
4.1. Summary Table
The quantities L, H, are expressed in bit/symbol and the Compression Rate (CR) to be expressed as a percentage is indicated in Table 2 relating to the data constructed.
Table 2. Summary of the calculation of the formulas of H, L and CR of the constructed data.
DATA |
|
COMPRESSION METHODS |
|
RLE |
BWT + RLE |
Shannon- Fano |
Huffman |
Arithmetic coding |
Tunstall coding |
LZW |
Data 0G |
CR |
|
|
7/8 |
7/8 |
1 − 1/4n |
1 − 1/4n |
|
L |
N |
N |
1 |
1 |
2 |
2 |
27 |
H |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
Data 1G |
CR |
−1/8 |
|
3/4 |
7/8 |
1 − 9/8n |
1 − 1/n |
|
L |
N |
N |
1 |
1 |
4.5 |
4 |
18 |
H |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
Data 2G |
CR |
−1/8 |
|
3/8 |
19/24 |
1 − 21/8n |
1 − 21/8n |
|
L |
N |
N |
1.65 |
1.65 |
7 |
7 |
18 |
H |
1.583 |
1.583 |
1.583 |
1.583 |
1.583 |
1.583 |
1 |
Data 3G |
CR |
−1/8 |
|
0 |
1/2 |
1 − 21/8n |
1 − 8/n |
|
L |
N |
N |
2 |
2 |
5.25 |
16 |
20 |
H |
2 |
2 |
2 |
2 |
2 |
2 |
2 |
4.2. Construction of Compression Ratio Curves as a Function of n (Number of Character Repetitions) and Sign Tables for These Curves
Regarding Table 3, it shows the signs according to the tipping points between the different compression techniques, each expressed as a function (see Table 2). Equations are performed with these functions in pairs in order to find these tipping points (3.10 - 4, 1, 1.8622, 2, 11.5270). Hence, for Data 0G we have the following signs:
(RLE) − (BWT+RLE) = 0 on [0, 100].
(RLE) − (Shannon-Fano) < 0 on [0, 11.527]; (RLE) − (Shannon-Fano) > 0 on [11.5270, 100].
And so on until the end of the table.
Indeed, Figure 3 presents a graphical representation showing the performance comparison of some compression methods on 0G data through the compression ratio (CR) illustrated in Table 3.
Figure 3. Compression Ratio (CR) for Data 0G as a function of parameter n.
Table 3. Signs of Data 0G.
n Name of courves |
0 |
3.10−4 |
1 |
1.8622 |
2 |
11.5270 |
100 |
RLE-BWT + RLE |
|
|
|
|
|
|
|
RLE-Shan |
|
|
- |
|
|
0 |
+ |
RLE-Huf |
|
|
- |
|
|
0 |
+ |
RLE-Arm |
|
+ |
|
0 |
|
- |
|
RLE-Tuns |
|
+ |
|
0 |
|
- |
|
Shan-Arm |
|
|
+ |
|
0 |
|
|
Shan-Tuns |
|
|
+ |
|
0 |
|
|
Shan-LZW |
+ |
0 |
|
|
- |
|
|
Tuns-LZW |
|
- |
0 |
|
+ |
|
|
Amr-Tuns |
|
|
|
|
|
|
|
Shan-Huf |
|
|
|
|
|
|
|
RLE-LZW |
|
|
|
- |
|
|
|
Huf-Arm |
|
|
+ |
|
0 |
- |
|
Huf-Tuns |
|
|
+ |
|
0 |
- |
|
Huf-LZW |
+ |
0 |
|
|
- |
|
|
BWT + RLE-Shan |
|
|
- |
|
|
0 |
+ |
BWT + RLE-Huf |
|
|
- |
|
|
0 |
+ |
BWT + RLE-Arm |
|
+ |
|
0 |
|
- |
|
BWT + RLE-Tuns |
|
+ |
|
0 |
|
- |
|
BWT + RLE-LZW |
|
|
|
- |
|
|
|
Arm-LZW |
|
- |
0 |
|
|
|
|
Regarding Table 4, the commentary remains the same as in Table 3, except that we are dealing with a different data (Data 1G). The tipping points are (0.86, 0.88, 1, 2, 4, 4.5, 5.18, 5.4, 8, 9, 11.5, 32, 64). The example of the meaning of the table is:
[(RLE) − (BWT+RLE)] > 0 on [0, 0.86]; (RLE) − (BWT + RLE) < 0 on [0.86, 100].
[(RLE) − (Shannon-Fano)] < 0 on [0, 100].
Indeed, as in Figure 3, Figure 4 presents a graphical representation showing the performance comparison of some compression methods on 1G data through the compression ratio (CR) illustrated in Table 4.
Figure 4. Compression ratio (CR) for Data 1G as a function of parameter n.
Table 4. Signs of Data 1G.
n Name of courves |
−α |
0 |
0.86 |
0.88 |
1 |
2 |
4 |
4.5 |
5.18 |
5.4 |
8 |
9 |
11.5 |
32 |
64 |
00 |
RLE-BWT + RLE |
|
+ |
0 |
|
|
|
|
|
- |
|
|
|
|
|
|
|
RLE-Shan |
|
|
|
|
|
|
|
|
- |
|
|
|
|
|
|
|
RLE-Huf |
|
|
|
|
|
|
|
|
- |
|
|
|
|
|
|
|
RLE-Arm |
|
|
+ |
|
0 |
|
|
|
|
|
- |
|
|
|
|
|
RLE-Tuns |
|
|
|
0 |
|
|
|
|
|
- |
|
|
|
|
|
|
RLE-LZW |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
BWT + RLE-Shan |
|
|
|
|
- |
|
|
|
0 |
|
|
|
+ |
|
|
|
BWT + RLE-Huf |
|
|
|
|
|
|
- |
|
|
|
|
|
0 |
|
+ |
|
BWT + RLE-Arm |
|
|
+ |
|
|
0 |
|
|
|
|
- |
|
|
|
|
|
BWT + RLE-Tuns |
|
|
+ |
|
0 |
|
|
|
|
|
- |
|
|
|
|
|
BWT + RLE-LZW |
|
|
|
|
|
|
|
|
- |
|
|
|
|
|
|
|
Shan-Huf |
|
|
|
|
|
|
|
|
- |
|
|
|
|
|
|
|
Shan-Arm |
|
|
|
+ |
|
|
|
0 |
|
|
|
- |
|
|
|
|
Shan-Tuns |
|
|
+ |
|
|
|
0 |
|
|
|
|
|
|
|
|
|
Shan-LZW |
|
+ |
|
|
|
0 |
|
|
|
|
|
|
|
|
|
|
Huf-Arm |
|
|
|
|
|
|
+ |
|
|
|
|
0 |
|
- |
|
|
Huf-Tuns |
|
|
|
|
|
+ |
|
|
|
|
0 |
|
- |
|
|
|
Huf-LZW |
|
|
|
|
+ |
|
|
|
|
0 |
|
- |
|
|
|
|
Arm-Tuns |
|
|
|
|
|
|
|
|
- |
|
|
|
|
|
|
|
Arm-LZW |
|
|
|
|
|
|
|
+ |
|
|
|
|
|
|
0 |
- |
Tuns-LZW |
|
|
|
|
|
|
+ |
|
|
|
|
|
|
0 |
- |
|
Regarding Table 5, the comment remains the same as in Table 3, except that we are dealing with a different data (Data 2G). The tipping points are (0.69, 0.86, 1.76, 2.3, 3.47, 4, 4.2, 6.4, 100). The example of the meaning of the table is:
[(RLE) − (BWT + RLE) ] > 0 on [0, 0.86]; [(RLE) − (BWT + RLE)] <0 on [0.86, 100].
[(RLE) − (Shannon − Fano)] < 0 on [0, 100].
Indeed, Figure 5 presents a graphical representation showing the performance comparison of some compression methods on 2G data through the compression ratio (CR) illustrated in Table 5.
Figure 5. Compression Ratio (CR) for Data 2G as a function of parameter n.
Table 5. Signs of Data 2G.
n Name of courves |
−α |
0 |
0.69 |
0.86 |
1.76 |
2.3 |
3.47 |
4 |
4.2 |
6.4 |
100 |
RLE-BWT + RLE |
|
|
+ |
0 |
|
|
- |
|
|
- |
|
RLE-Shan |
|
|
|
|
|
- |
|
|
|
- |
|
RLE-Huf |
|
|
|
|
|
- |
|
|
|
- |
|
RLE-Arm |
|
|
|
+ |
|
0 |
|
|
|
- |
|
RLE-Tuns |
|
|
|
+ |
|
0 |
|
|
|
- |
|
RLE-LZW |
|
|
|
|
|
- |
|
|
|
- |
|
BWT + RLE-Shan |
|
|
- |
|
0 |
|
|
+ |
|
|
|
BWT + RLE-Huf |
|
|
|
|
|
|
|
|
0 |
+ |
|
BWT + RLE-Arm |
|
|
|
|
|
|
|
|
|
- |
+ |
BWT + RLE-Tuns |
|
|
|
|
|
|
|
|
|
- |
+ |
BWT + RLE-LZW |
|
|
|
|
|
|
|
|
|
- |
|
Shan-Huf |
|
|
|
|
|
|
|
|
|
- |
|
Shan-Arm |
|
|
|
|
+ |
|
|
|
0 |
- |
|
Shan-Tuns |
|
|
|
|
+ |
|
|
|
0 |
- |
|
Shan-LZW |
|
+ |
|
|
|
|
|
|
|
- |
|
Huf-Arm |
|
|
|
|
+ |
|
|
|
0 |
- |
|
Huf-Tuns |
|
|
|
|
+ |
|
|
|
0 |
- |
|
Huf-LZW |
|
|
|
+ |
|
0 |
|
|
- |
- |
|
Arm-Tuns |
|
|
|
|
|
|
|
|
|
|
|
Arm-LZW |
|
|
|
|
|
|
|
|
|
0 |
+ |
Tuns-LZW |
|
|
|
|
+ |
|
|
0 |
|
- |
|
The commentary of Table 6 remains the same as in Table 3, except that we are dealing with a different data (data 3G). The tipping points are (0.57, 0.68, 0.86, 1, 1.68, 2.3, 2.33, 2.6, 5.25, 7.1, 8, 16, 100). The example of the meaning of the table is:
[(RLE) − (BWT + RLE)] > 0 on [0, 0.86]; [(RLE) − (BWT + RLE)] < 0 on [0.86, 100].
[(RLE) − (Shannon − Fano)]< 0 on [0, 100].
Indeed, Figure 6 presents a graphical representation showing the performance comparison of some compression methods on 3G data through the compression ratio (CR) illustrated in Table 6.
Figure 6. Compression ratio (CR) for Data 3G as a function of parameter n.
Table 6. Signs of Data 3G.
n Name of courves |
0 |
0.57 |
0.68 |
0.86 |
1 |
1.68 |
2.3 |
2.33 |
2.6 |
5.25 |
7.1 |
8 |
16 |
100 |
RLE-BWT + RLE |
|
+ |
|
0 |
|
|
|
|
- |
|
|
|
|
|
RLE-Shan |
|
|
|
|
|
|
|
- |
|
|
|
|
|
|
RLE-Huf |
|
|
|
|
|
|
|
- |
|
|
|
|
|
|
RLE-Arm |
|
|
|
|
+ |
|
|
0 |
|
|
|
- |
|
|
RLE-Tuns |
|
|
|
|
|
+ |
|
|
|
|
0 |
- |
|
|
RLE-LZW |
+ |
0 |
|
|
|
|
|
- |
|
|
|
|
|
|
BWT + RLE-Shan |
|
|
- |
|
0 |
|
|
|
|
+ |
|
|
|
|
BWT + RLE-Huf |
|
|
|
|
|
|
0 |
|
|
|
+ |
|
|
|
BWT + RLE-Arm |
|
|
|
|
|
|
|
+ |
|
|
|
|
|
- |
BWT + RLE-Tuns |
|
|
|
|
|
|
|
+ |
|
|
|
|
|
- |
BWT + RLE-LZW |
|
|
|
|
|
|
|
- |
|
|
|
- |
|
|
Shan-Huf |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Shan-Arm |
|
|
|
|
+ |
|
|
|
0 |
|
- |
|
|
|
Shan-Tuns |
|
|
|
|
|
|
|
|
|
+ |
|
- |
|
|
Shan-LZW |
+ |
|
|
|
|
|
|
|
|
|
|
- |
|
|
Huf-Arm |
|
|
|
|
|
|
|
+ |
|
|
|
- |
|
|
Huf-Tuns |
|
|
|
|
|
|
|
|
|
|
+ |
|
- |
|
Huf-LZW |
|
|
|
+ |
|
|
|
|
|
|
|
- |
|
|
Arm-Tuns |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Arm-LZW |
|
|
|
|
|
|
|
|
|
|
|
|
0 |
+ |
Tuns-LZW |
|
|
|
|
|
|
|
|
|
|
|
|
0 |
+ |
Applying the selected compression techniques to our selected text data provides relevant results for analysis.
5. Discussion
-The techniques: (RLE, BWT + RLE), (Shannon-Fano coding, Huffman), (arithmetic coding, Tunstall) both offer the same performance.
-Above the value of 11.5270, Shannon-Fano performs less well than all the other techniques selected. But before the values 3.10 - 4, 2, and 11.5270, Shannon-Fano is better than the other techniques.
- Apart from Shannon-Fano, and both before and after 11.5270, RLE is not as good as arithmetic and LZW coding.
- From value 1, arithmetic coding is better than LZW. And it is the best-performing technique in this data.
- RLE is a poor encoder here with a negative CR and is inferior to the other compression techniques selected.
- In the same order of increasing performance, Shannon-Fano comes after RLE. It is above 5.18 that Shannon-Fano coding is worse than the others. But before this value, this coding is better than BWT + RLE, as well as arithmetic coding before 4.5, Tunstall before 4, and LZW before 2.
- Then we have Huffman coding, which performs less well than BWT + RLE, above 11.5, arithmetic coding, above 9, Tunstall, above 8 and LZW, above 5.4.
- LZW, Tunstall, arithmetic, BWT+RLE have similar performances, but LZW has the best result.
- RLE is still the least efficient of the other techniques, from 2.3 upwards. But this technique is better arithmetically before this value where the CR is negative. And for others such as BWT + RLE and LZW, when RLE is more important, it is with an insignificant CR.
- After RLE, Shannon-Fano shows a poor performance compared to Huffman, arithmetic, BWT + RLE, and LZW, from 4.2.
- Huffman performs less well than arithmetic coding, BWT + RLE and LZW, above 4.2. But this technique is better than arithmetic coding before this value, BWT + RLE, before 6.4 and LZW, before 3.47.
- LZW has the best result in this data just after 3.47 compared to Huffman and the others.
- RLE appears to have the lowest performance, with a negative CR after a value of 7.1. However, before this value, this coding performed better than Tunstall, the arithmetic coding, before 2.33, BWT + RLE, before 0.86 and LZW, before 0.57.
- Shannon-Fano performs less well than the others after 8. But before this value, this compression technique is better than Tunstall, arithmetic coding, before 2.3, BWT + RLE, before 1 and LZW, before 0.68.
- Still in the direction of increasing performance, Huffman comes after Shannon-Fano. This performs less well than LZW, arithmetic coding, Tunstall after the value 16. But before this value, Huffman performs better than Tunstall, arithmetic coding, before 5.25, BWT + RLE, before 2.3 and LZW, before 1.68.
- LZW is the best technique here, just ahead of BWT + RLE and arithmetic coding respectively.
6. Conclusions and Perspectives
In this paper, we have carried out a comparative study in terms of the performance of certain lossless compression techniques, namely: RLE, arithmetic coding, Huffman, BWT + RLE, Shannon-Fano, Tunstall, LZW and made proposals for text data models in order to achieve our goal. Our evaluations based on compression ratio, average code length and entropy have enabled us to develop underlying analytical models. We note that the decompression headers have not been considered in the models developed. This could reduce the performance of certain methods.
The results obtained, have indicated that it makes sense to consider tipping points when comparing lossless compression techniques applied to text data in order to be strictly accurate. Apart from 0G data, RLE is inefficient. On the other hand, arithmetic, Tunstall, LZW and BWT + RLE coding give very good results, starting from the toggle values. The LZW coding performed very well in most of our data.
For future work, we will address the following issues:
The optimization of existing compression techniques for improved performance and reduced complexity can be explored;
The adaptation of compression methods to specific types of data may be the subject of further research.