
Theory of Data Compression
Contents

Introduction and Background

Source Modeling

Entropy Rate of a Source

Shannon Lossless Source Coding Theorem

RateDistortion Theory

The Gap Between Theory and Practice

FAQs (Frequently Asked Questions)

References
I. Introduction and Background
Claude E. Shannon

In his 1948 paper,
``A
Mathematical Theory of Communication,''
Claude E. Shannon formulated the theory of data compression.
Shannon established that there
is a fundamental limit to
lossless data compression. This
limit, called the entropy rate, is denoted by
H. The exact value of H depends on the information source
 more specifically, the statistical nature of
the source. It is possible to compress the source, in a lossless manner, with
compression rate
close to H. It is mathematically impossible to do better than H.
Shannon also developed the theory of
lossy data compression.
This is better known as ratedistortion theory.
In lossy data compression, the decompressed data does not have to be
exactly the same as the original data. Instead, some amount
of distortion, D, is tolerated.
Shannon showed that, for a given source (with all its statistical
properties known) and a given distortion measure,
there is a function, R(D), called the ratedistortion function.
The theory says that if D is the tolerable amount of distortion, then
R(D) is the best possible compression rate.
When the compression is lossless (i.e., no distortion or D=0), the
best possible compression rate is R(0)=H (for a finite alphabet
source). In other words, the best
possible lossless compression rate is the entropy rate. In this sense,
ratedistortion theory is a generalization of lossless data compression theory,
where we went from no distortion (D=0) to some distortion (D>0).
Lossless data compression theory and ratedistortion theory
are known collectively as source coding theory.
Source coding theory sets fundamental limits on the performance of all
data compression algorithms. The theory, in itself, does not
specify exactly how to design and implement these algorithms.
It does, however, provide some hints and guidelines on how to
achieve optimal performance.
In the following sections, we will described how Shannon modeled
an information source in terms of a random process, we will present
Shannon lossless source coding theorem, and we will discuss Shannon
ratedistortion theory. A background in probability
theory is recommended.
II. Source Modeling
Available at Amazon.com


Imagine that you go to the library. This library has a large selection
of books  say there are 100 million books in this library. Each book
in this library is very thick  say, for example, that each book has
100 million characters (or letters) in them. When you get to this library,
you will, in some random manner, select a book to check out. This chosen book
is the information source to be compressed. The compressed book
is then stored on your zip disk to take home, or transmitted directly
over the internet into your home, or whatever the case may be.
Mathematically, the book you select is denoted by
where
represents the whole book, represents the first character in
the book, represents the second character, and so on.
Even though in reality the length of the book is finite, mathematically we
assume that it has infinite length. The reasoning is that the book is so
long we can just imagine that it goes on forever. Furthermore, the
mathematics turn out to be surprisingly simpler if we assume an infinite
length book.
To simplify things a little, let us assume that all the characters in all
the books are either a lowercase letter (`a' through `z') or a SPACE
(e. e. cummings style of writing shall we say). The
source alphabet,
,
is defined to be the set of all 27 possible values of the characters:
Now put yourself in the shoes of the engineer who designs the compression
algorithm. She does not know in advance which book you will select. All
she knows is that you will be selecting a book from this library.
From her perspective, the characters in the book
are
random variables
which take values on the alphabet .
The whole book,
is just an infinite sequence of random variables  that is
is a
random process.
There are several ways in which this engineer can model the statistical
properties of the book.
 ZeroOrder Model:
Each character is statistically independent of all other characters
and the 27 possible values in the alphabet
are equally likely to occur. If this model is accurate, then a typical
opening of a book would look like this
(all of these examples came directly from
Shannon's 1948 paper):
xfoml rxkhrjffjuj zlpwcfwkcyj ffjeyvkcqsghyd qpaamkbzaacibzlhjqd
This does not look like the writing of an intelligent being.
In fact, it resembles the writing of a ``monkey sitting at a typewriter.''
 FirstOrder Model:
We know that in the English language some letters occur more frequently
than others. For example, the letters `a' and `e' are more common than
`q' and `z'. Thus, in this model, the character are still
independent of one another, but the probability distribution of the
characters are according to the
firstorder
statistical distribution of English text.
A typical text for this model looks like this:
ocroh hli rgwr nmielwis eu ll nbnesebya th eei alhenhttpa oobttva nah brl
 SecondOrder Model:
The previous two models assumed statistical independence from one
character to the next. This does not accurately reflect the nature
of the English language. For exa#ple, some letters in thi# sent#nce
are missi#g. However, we are still able to figure out what those letters
should have been by looking at the context. This implies that there
are some dependency between the characters. Naturally, characters which
are in close proximity are more dependent than those that
are far from each other.
In this model, the present character depends on the previous
character but it
is conditionally independent of all previous characters
.
According to this model, the probability distribution of the
character varies according to what the previous
character is. For example, the letter `u'
rarely occurs (probability=0.022). However, given that the
previous character is `q', the probability of a `u' in the present
character is much higher (probability=0.995). For a complete
description, see the
secondorder
statistical distribution of English text.
A typical text for this model would look like this:
on ie antsoutinys are t inctore st be s deamy achin d ilonasive tucoowe at teasonare fuso tizin andy tobe seace ctisbe
 ThirdOrder Model:
This is an extension of the previous model. Here, the present
character depends on the previous two characters
but it
is conditionally independent of all previous characters before those:
. In this model, the distribution
of varies according to what are.
See the
thirdorder
statistical distribution of English text.
A typical text for this model would look like this:
in no ist lat whey cratict froure birs grocid pondenome of demonstures of the reptagin is regoactiona of cre
The resemblance to ordinary English text increases quite noticeably at each
of the above steps.
 General Model:
In this model, the book is an arbitrary
stationary random process.
The statistical properties of this model are too complex to be deemed
practical. This model is interesting only from a theoretical point of view.
Model A above is a special case of Model B.
Model B is a special case of Model C.
Model C is a special case of Model D.
Model D is a special case of Model E.
III. Entropy Rate of a Source
Shannon in 1948


The entropy rate of a source is a number which depends only on the statistical
nature of the source. If the source has a simple model, then this
number can be easily calculated. Here, we consider an arbitrary source:
while paying special attention to the case where
is English text.
 ZeroOrder Model:
The characters are statistically independent of each other
and every letter of the alphabet,,
are equally likely to occur. Let be
the size of the alphabet. In this case, the entropy rate is given by
For English text, the alphabet size is m=27.
Thus, if this had been an accurate model for English text, then
the entropy rate would have been
H=log_{2} 27=4.75 bits/character.
 FirstOrder Model:
The characters are statistically independent. Let be
the size of the alphabet and let be the probability of
the ith letter in the alphabet. The entropy rate is
Using the
firstorder
distribution,
the entropy rate of English text would have been 4.07 bits/character had
this been the correct model.
 SecondOrder Model:
Let be the conditional probability that the present character
is the jth letter in the alphabet given that the previous character is
the ith letter. The entropy rate is
Using the
secondorder distribution,
the entropy rate of English text would have been 3.36 bits/character had
this been the correct model.
 ThirdOrder Model:
Let be the conditional probability that the present character
is the kth letter in the alphabet given that the previous character is
the jth letter and the one before that is the
ith letter. The entropy rate is
Using the
thirdorder distribution,
the entropy rate of English text would have been 2.77 bits/character had
this been the correct model.
 General Model:
Let represents the first characters. The
entropy rate in the general case is given by
where the sum is over all possible values of .
It is virtually impossible to calculate the entropy rate according to the
above equation. Using a prediction method, Shannon has been able to estimate
that the entropy rate of the 27letter English text is 2.3 bits/character
(see
Shannon:Collected Papers).
We emphasize that there is only one entropy rate for a given source.
All of the above definitions for the entropy rate are consistent with one
another.
IV. Shannon Lossless Source Coding Theorem
Excellent Textbook


Shannon lossless source coding theorem is based on the concept of block
coding. To illustrate this concept, we introduce a special information
source in which the alphabet consists of only two letters:
Here, the letters `a' and `b' are equally likely to occur. However, given
that `a' occurred in the previous character, the probability that `a'
occurs again in the present character is 0.9. Similarly, given
that `b' occurred in the previous character, the probability that `b'
occurs again in the present character is 0.9.
This is known as a binary symmetric Markov source.
An nth order block code is just a mapping which assigns to
each block of n consecutive characters a sequence of bits of
varying length.
The following examples illustrate this concept.
 FirstOrder Block Code: Each character is
mapped to a single bit.


Codeword 
a 
0.5 
0 
b 
0.5 
1 
R=1 bit/character 
An example:
Note that 24 bits are used to represent 24 characters  an average
of 1 bit/character.
 SecondOrder Block Code: Pairs
of characters are mapped to either one, two, or three bits.


Codeword 
aa 
0.45 
0 
bb 
0.45 
10 
ab 
0.05 
110 
ba 
0.05 
111 
R=0.825 bits/character 
An example:
Note that 20 bits are used to represent 24 characters  an average
of 0.83 bits/character.
 ThirdOrder Block Code: Triplets
of characters are mapped to bit sequence of lengths one through six.


Codeword 
aaa 
0.405 
0 
bbb 
0.405 
10 
aab 
0.045 
1100 
abb 
0.045 
1101 
bba 
0.045 
1110 
baa 
0.045 
11110 
aba 
0.005 
111110 
bab 
0.005 
111111 
R=0.68 bits/character 
An example:
Note that 17 bits are used to represent 24 characters  an average
of 0.71 bits/character.
Note that:
 The rates shown in the tables are calculated from
where is the length of the codeword for block
.
 The higher the order, the lower the rate (better compression).
 The codes used in the above examples are
Huffman codes.
 We are only interested in lossless
data compression code. That is, given the code table and given
the compressed data, we should be able to rederive the original data.
All of the examples given above are lossless.
Theorem: Let
be the rate of an optimal nth order lossless
data compression code (in bits/character). Then
Since both upper and lower bounds of approach the
entropy rate, H, as n goes to infinity, we have
Thus, the theorem established that the entropy rate is the rate of
an optimal lossless data compression code. The limit exists as long
as the source is stationary.
V. RateDistortion Theory
Available at Amazon.com

In lossy data compression, the decompressed data need not be exactly the
same as the original data. Often, it suffices to have a reasonably close
approximation. A distortion measure
is a mathematical entity which specifies exactly how close the
approximation is. Generally, it is a function which assigns to any two
letters and in the alphabet
a nonnegative number denoted as
Here, is the original data, is the approximation,
and is the amount of distortion between
and .
The most common distortion measures are the
Hamming distortion measure:
and the squarederror distortion measure
(which is only applicable when is a set of numbers):
Ratedistortion theory says that for a given source and a given
distortion measure, there exists a function, R(D), called
the ratedistortion function. The typical shape of R(D)
looks like this:
If the source samples are independent of one another, the
ratedistortion function can be obtained by solving the
constrained minimization problem:
subject to the constraints that and
where is the distortion between the ith and
jth letter in the alphabet.
Using Blahut's algorithm, it may be
possible to numerically calculate the ratedistortion function.
Ratedistortion theory is based on the concept of block coding
(similar to above). A lossy block code is known as a
vector quantizer (VQ).
The block length n of the code is known as the VQ dimension.
Theorem: Assuming that the source is stationary and the
source samples are independent. For every
and for every , there exists a VQ of
(sufficiently large) dimension n with distortion no greater than
and rate no more than
Further more, there does not exist
a VQ with distortion and rate less than .
In essence, the theorem states that R(D) is the ratedistortion
performance of an optimal VQ.
The above theorem can be generalized to the case where the source
is stationary and ergodic.
VI. The Gap Between Theory and Practice
 The theory holds when the block length n approaches infinity.
In realtime compression,
the compression algorithm must wait for n consecutive
source samples before it can begin the compression.
When n is large, this wait (or delay) may be too long.
For example, in realtime speech compression, the speech signal is
sampled at 8000 samples/second. If n is say 4000, then
the compression delay is half a second. In a twoway conversation,
this long delay may not be desirable.
 The theory does not take into consideration the complexities
associated with the compression and decompression operations.
Typically, as the block length n increases, the
complexities also increase. Often, the rate of increase is
exponential in n.
 The theory assumes that the statistical properties of the
source is known. In practice, this information may not be
available.
 The theory assumes that there is no error in the compressed bit
stream. In practice, due to noise in the communication channel or
imperfection in the storage medium, there are errors in the compressed
bit stream.
All of these problems have been successfully solved by
researchers in the field.
VII. FAQs (Frequently Asked Questions)
 What is the difference between lossless and lossy compression?
 What is the difference between compression rate and compression ratio?
 What is the difference between ``data compression theory'' and
``source coding theory''?
 What does ``stationary'' mean?
 What does ``ergodic'' mean?
 What is the difference between lossless and lossy compression?
In lossless data compression, the compressedthendecompressed data is an
exact replication of the original data. On the other hand, in lossy
data compression, the decompressed data may be different from the original
data. Typically, there is some distortion between the original and reproduced
signal.
The popular WinZip program is an example of lossless compression.
JPEG is an example of lossy compression.
 What is the difference between compression rate and compression ratio?
Historically, there are two main types of applications of data
compression: transmission and storage. An example of the former
is speech compression for realtime transmission over digital cellular
networks. An example of the latter is file compression (e.g. Drivespace).
The term ``compression rate'' comes from the transmission camp, while
``compression ratio'' comes from the storage camp.
Compression rate is the rate of the compressed data (which we imagined to
be transmitted in ``realtime''). Typically, it is in units of bits/sample,
bits/character, bits/pixels, or bits/second. Compression ratio is the
ratio of the size or rate of the original data to the size or rate of the
compressed data. For example, if a grayscale image is originally
represented by 8 bits/pixel (bpp) and it is compressed to 2 bpp, we say
that the compression ratio is 4to1. Sometimes, it is said that the
compression ratio is 75%.
Compression rate is an absolute term, while compression ratio is a
relative term.
We note that there are current applications which can be considered
as both transmission and storage. For example, the above photograph
of Shannon is stored in JPEG format. This not only saves storage space
on the local disk, it also speeds up the delivery of the image over the
internet.
 What is the difference between ``data compression theory'' and
``source coding theory''?
There is no difference. They both mean the same thing. The
term ``coding'' is a general
term which could mean either ``data compression'' or ``error control coding''.
 What does ``stationary'' mean?
Mathematically, a random process is stationary (or strictsense stationary)
if for every positive integers and the vectors:
and
have the same probability distribution.
We can think of ``stationarity'' in terms of our library example. Suppose
that we look at the first letter of all 100 million books and see how often
the first letter is `a', how often it is `b', and so on. We will then get
a probability distribution of the first letter of the (random) book.
If we repeat this experiment for the fifth and 105th letters, we will get two
other distributions. If all three distributions are the same, we are
inclined to believe that the book process is stationary. To be sure, we
should look at the joint distribution of the first and second letters
and compare it with the joint distribution of the 101st and 102nd letters.
If these two joint distributions are roughly the same, we will be more
convinced that the book process is stationary.
In reality, the book process can not be stationary because the first
character can not be a SPACE.
 What does ``ergodic'' mean?
The strict mathematical definition of ergodicity is too complex to
explain. However, Shannon offered the following intuitive explanation:
``In an ergodic process every sequence produced by the process is the same
in statistical properties. Thus the letter frequencies, (the pairwise)
frequencies, etc., obtained from particular sequences, will, as the lengths of
the sequences increase, approach definite limits independent of the
particular sequence. Actually this is not true of every sequence but the
set for which it is false has probability zero. Roughly the ergodic property
means statistical homogeneity.''
VIII. References
 C. E. Shannon, A Mathematical Theory of Communication (free pdf version)
 C. E. Shannon and W. Weaver,
The Mathematical Theory of Communication. ($13 paper version)
 C. E. Shannon, N. J. Sloane, and A. D. Wyner,
Claude Elwood Shannon: Collected Papers.
 C. E. Shannon,
``Prediction and Entropy of Printed English," available in
Shannon: Collected Papers.
 C. E. Shannon,
``Coding Theorems for a Discrete Source with a Fidelity Criterion," available in
Shannon: Collected Papers.
 T. M. Cover and J. A. Thomas,
Elements of Information Theory.
 R. Gallager,
Information Theory and Reliable Communication.
 A. Gersho and R. M. Gray,
Vector Quantization and Signal Compression.
 K. Sayood,
Introduction to Data Compression.
 M. Nelson and J.L. Gailly,
The Data Compression Book.
 A. LeonGarcia,
Probability and Random Processes for Electrical Engineering.
(Student Solution Manual).
 A. Papoulis,
Probability, Random Variables, and Stochastic Processes.
 M. R. Schroeder,
Computer Speech: Recognition, Compression, Synthesis.
 G. Held and T. R. Marshall,
Data and Image Compression: Tools and Techniques.
 D. Hankerson, P. D. Johnson, and G. A. Harris,
Introduction to Information Theory and Data Compression .
Note: To anyone interested in information theory, I can highly
recommend two books: Shannon's
Collected Papers
and
Cover and Thomas'
Elements of Information Theory.
Collected Papers
includes a 1987 interview of Shannon by OMNI Magazine, almost
all of his papers on information theory and cryptography, his celebrated
Master's Theses on A Symbolic Analysis of Relay and Switching Circuits,
his Ph.D. dissertation on An Algebra for Theoretical Genetics, his
paper on the Scientific Aspects of Juggling, his paper
on A Mind Reading Machine, and much more.
Elements of Information Theory
is now the standard textbook on information theory. I have been using this
textbook in my graduatelevel information theory course for the past seven
years. I have had nothing but positive feedback from students who have
studied this book.
