
Similar eBooks: eBooks related to Data Compression for Real Programmers 
Biographies in Data Compression
Biographies in Data Compression
Handbook of Data Compression, 5th Edition
This volume extends the 4th edition of "Data Compression: The Complete Reference". It features a different chapter structure, much new material, and many small improvements. The following new topics were added:
• The topic of compression benchmarks has been added to the Introduction.
• The paragraphs titled "How to Hide Data" in the Introduction show how data compression can be utilized to quickly and efficiently hide data in plain sight in our computers.
• Several paragraphs on compression curiosities have also been added to the Introduction.
• The new Section 1.1.2 shows why irreversible compression may be useful in certain situations.
• Chapters 2 through 4 discuss the allimportant topic of variablelength codes. These chapters discuss basic, advanced, and robust variablelength codes. Many types of VL codes are known, they are used by many compression algorithms, have different properties, and are based on different principles. The mostimportant types of VL codes are prefix codes and codes that include their own length.
• Section 2.9 on phasedin codes was wrong and has been completely rewritten. An example of the startstepstop code (2, 2,\infty) has been added to Section 3.2.
• Section 3.5 is a description of two interesting variablelength codes dubbed recursive bottomup coding (RBUC) and binary adaptive sequential coding (BASC). These codes represent compromises between the standard binary (beta) code and the Elias gamma codes.
• Section 3.28 discusses the original method of interpolative coding whereby dynamic variablelength codes are assigned to a strictly monotonically increasing sequence of integers.
• Section 5.8 is devoted to the compression of PK (packed) fonts. These are older bitmaps fonts that were developed as part of the huge TeX project. The compression algorithm is not especially efficient, but it provides a rare example of runlength encoding (RLE) without the use of Huffman codes.
• Section 5.13 is about the Hutter prize for text compression.
• PAQ (Section 5.15) is an opensource, highperformance compression algorithm and free software that features sophisticated prediction combined with adaptive arithmetic encoding. This free algorithm is especially interesting because of the great interest it has generated and because of the many versions, subversions, and derivatives that have been spun off it.
• Section 6.3.2 discusses LZR, a variant of the basic LZ77 method, where the lengths of both the search and lookahead buffers are unbounded.
• Section 6.4.1 is a description of LZB, an extension of LZSS. It is the result of evaluating and comparing several data structures and variablelength codes with an eye to improving the performance of LZSS.
• SLH, the topic of Section 6.4.2, is another variant of LZSS. It is a twopass algorithm where the first pass employs a hash table to locate the best match and to count frequencies, and the second pass encodes the offsets and the raw symbols with Huffman codes prepared from the frequencies counted by the first pass.
• Most LZ algorithms were developed during the 1980s, but LZPP, the topic of Section 6.5, is an exception. LZPP is a modern, sophisticated algorithm that extends LZSS in several directions and has been inspired by research done and experience gained by many workers in the 1990s. LZPP identifies several sources of redundancy in the various quantities generated and manipulated by LZSS and exploits these sources to obtain better overall compression.
• Section 6.14.1 is devoted to LZT, an extension of UNIX compress/LZC. The major innovation of LZT is the way it handles a full dictionary.
• LZJ (Section 6.17) is an interesting LZ variant. It stores in its dictionary, which can be viewed either as a multiway tree or as a forest, every phrase found in the input. If a phrase is found n times in the input, only one copy is stored in the dictionary. Such behavior tends to fill the dictionary up very quickly, so LZJ limits the length of phrases to a preset parameter h.
• The interesting, original concept of antidictionary is the topic of Section 6.31. A dictionarybased encoder maintains a list of bits and pieces of the data and employs this list to compress the data. An antidictionary method, on the other hand, maintains a list of strings that do not appear in the data. This generates negative knowledge that allows the encoder to predict with certainty the values of many bits and thus to drop those bits from the output, thereby achieving compression.
• The important term "pixel" is discussed in Section 7.1, where the reader will discover that a pixel is not a small square, as is commonly assumed, but a mathematical point.
• Section 7.10.8 discusses the new HD photo (also known as JPEG XR) compression method for continuoustone still images.
• ALPC (adaptive linear prediction and classification), is a lossless image compression algorithm described in Section 7.12. ALPC is based on a linear predictor whose coefficients are computed for each pixel individually in a way that can be mimiced by the decoder.
• Grayscale TwoDimensional LempelZiv Encoding (GS2DLZ, Section 7.18) is an innovative dictionarybased method for the lossless compression of grayscale images.
• Section 7.19 has been partially rewritten.
• Section 7.40 is devoted to spatial prediction, a combination of JPEG and fractalbased image compression.
• A short historical overview of video compression is provided in Section 9.4.
• The allimportant H.264/AVC video compression standard has been extended to allow for a compressed stream that supports temporal, spatial, and quality scalable video coding, while retaining a base layer that is still backward compatible with the original H.264/AVC. This extension is the topic of Section 9.10.
• The complex and promising VC1 video codec is the topic of the new, long Section 9.11.
• The new Section 11.6.4 treats the topic of syllablebased compression, an approach to compression where the basic data symbols are syllables, a syntactic form between characters and words.
• The commercial compression software known as stuffit has been around since 1987. The methods and algorithms it employs are proprietary, but some information exists in various patents. The new Section 11.16 is an attempt to describe what is publicly known about this software and how it works.
• There is a short appendix that presents and explains the basic concepts and terms of information theory.
A Concise Introduction to Data Compression
It is virtually certain that a reader of this book is both a computer user and an Internet user, and thus the owner of digital data. More and more people all over the world generate, use, own, and enjoy digital data. Digital data is created (by a word processor, a digital camera, a scanner, an audio A/D converter, or other devices), it is edited on a computer, stored (either temporarily, in memory, less temporarily, on a disk, or permanently, on an optical medium), transmitted between computers (on the Internet or in a localarea network), and output (printed, watched, or played, depending on its type).
These steps often apply mathematical methods to modify the representation of the original digital data, because of three factors, time/space limitations, reliability (data robustness), and security (data privacy). These are discussed in some detail here:
The first factor is time/space limitations. It takes time to transfer even a single byte either inside the computer (between the processor and memory) or outside it over a communications channel. It also takes space to store data, and digital images, video, and audio files tend to be large. Time, as we know, is money. Space, either in memory or on our disks, doesn't come free either. More space, in terms of bigger disks and memories, is becoming available all the time, but it remains finite. Thus, decreasing the size of data files saves time, space, and moneythree important resources. The process of reducing the size of a data file is popularly referred to as data compression, although its formal name is source coding (coding done at the source of the data, before it is stored or transmitted).
In addition to being a useful concept, the idea of saving space and time by compression is ingrained in us humans, as illustrated by (1) the rapid development of nanotechnology and (2) the quotation at the end of this Preface.
The second factor is reliability. We often experience noisy telephone conversations (with both cell and landline telephones) because of electrical interference. In general, any type of data, digital or analog, sent over any kind of communications channel may become corrupted as a result of channel noise. When the bits of a data file are sent over a computer bus, a telephone line, a dedicated communications line, or a satellite connection, errors may creep in and corrupt bits. Watching a highresolution color image or a long video, we may not be able to tell when a few pixels have wrong colors, but other types of data require absolute reliability. Examples are an executable computer program, a legal text document, a medical Xray image, and genetic information. Change one bit in the executable code of a program, and the program will not run, or worse, it may run and do the wrong thing. Change or omit one word in a contract and it may reverse its meaning. Reliability is therefore important and is achieved by means of errorcontrol codes. The formal name of this mathematical discipline is channel coding, because these codes are employed when information is transmitted on a communications channel.
The third factor that affects the storage and transmission of data is security. Generally, we do not want our data transmissions to be intercepted, copied, and read on their way. Even data saved on a disk may be sensitive and should be hidden from prying eyes. This is why digital data can be encrypted with modern, strong encryption algorithms that depend on long, randomlyselected keys. Anyone who doesn't possess the key and wants access to the data may have to resort to a long, tedious process of either trying to break the encryption (by analyzing patterns found in the encrypted file) or trying every possible key. Encryption is especially important for diplomatic communications, messages that deal with money, or data sent by members of secret organizations. A close relative of data encryption is the field of data hiding (steganography). A data file A (a payload) that consists of bits may be hidden in a larger data file B (a cover) by taking advantage of ``holes'' in B that are the result of redundancies in the way data is represented in B.
Overview and goals
This book is devoted to the first of these factors, namely data compression. It explains why data can be compressed, it outlines the principles of the various approaches to compressing data, and it describes several compression algorithms, some of which are general, while others are designed for a specific type of data.
The goal of the book is to introduce the reader to the chief approaches, methods, and techniques that are currently employed to compress data. The main aim is to start with a clear overview of the principles behind this field, to complement this view with several examples of important compression algorithms, and to present this material to the reader in a coherent manner.
Organization and features
The book is organized in two parts, basic concepts and advanced techniques. The first part consists of the first three chapters. They discuss the basic approaches to data compression and describe a few popular techniques and methods that are commonly used to compress data. Chapter 1 introduces the reader to the important concepts of variablelength codes, prefix codes, statistical distributions, runlength encoding, dictionary compression, transforms, and quantization. Chapter 2 is devoted to the important Huffman algorithm and codes, and Chapter 3 describes some of the many dictionarybased compression methods.
The second part of this book is concerned with advanced techniques. The original and unusual technique of arithmetic coding is the topic of Chapter 4. Chapter 5 is devoted to image compression. It starts with the chief approaches to the compression of images, explains orthogonal transforms, and discusses the JPEG algorithm, perhaps the best example of the use of these transforms. The second part of this chapter is concerned with subband transforms and presents the WSQ method for fingerprint compression as an example of the application of these sophisticated transforms. Chapter 6 is devoted to the compression of audio data and in particular to the technique of linear prediction. Finally, other approaches to compressionsuch as the BurrowsWheeler method, symbol ranking, and SCSU and BOCU1are given their due in Chapter 7.
The many exercises sprinkled throughout the text serve two purposes, they illuminate subtle points that may seem insignificant to readers and encourage readers to test their knowledge by performing computations and obtaining numerical results.
Other aids to learning are a prelude at the beginning of each chapter and various intermezzi where interesting topics, related to the main theme, are examined. In addition, a short summary and selfassessment exercises follow each chapter. The glossary at the end of the book is comprehensive, and the index is detailed, to allow a reader to easily locate all the points in the text where a given topic, subject, or term appear.
Other features that liven up the text are puzzles (with answers at the end of the book) and various boxes with quotations or with biographical information on relevant persons.
Target audience
This book was written with undergraduate students in mind as the chief readership. In general, however, it is aimed at those who have a basic knowledge of computer science; who know something about programming and data structures; who feel comfortable with terms such as bit, mega, ASCII, file, I/O, and binary search; and who want to know how data is compressed. The necessary mathematical background is minimal and is limited to logarithms, matrices, polynomials, calculus, and the concept of probability. This book is not intended as a guide to software implementors and has few programs.
The book's web site, with an errata list, BibTeX information, and auxiliary material, is part of the author's web site, located at http://www.ecs.csun.edu/~dsalomon/. Any errors found, comments, and suggestions should be directed to dsalomon@csun.edu.
Acknowlegments
I would like to thank Giovanni Motta John Motil for their help and encouragement. Giovanni also contributed to the text and pointed out numerous errors.
In addition, my editors at Springer Verlag, Wayne Wheeler and Catherine Brett, deserve much praise. They went over the manuscript, made numerous suggestions and improvements, and contributed much to the final appearance of the book.
Lakeside, California, David Salomon
August 2007
To see a World in a Grain of Sand And a Heaven in a Wild Flower, Hold Infinity in the palm of your hand And Eternity in an hour.
William Blake, Auguries of Innocence
VariableLength Codes for Data Compression
Most data compression methods that are based on variablelength codes employ the Huffman or Golomb codes. However, there are a large number of lessknown codes that have useful properties  such as those containing certain bit patterns, or those that are robust  and these can be useful. This book brings this large set of codes to the attention of workers in the field and to students of computer science.
David Salomon's clear style of writing and presentation, which has been familiar to readers for many years now, allows easy access to this topic. This comprehensive text offers readers a detailed, readerfriendly description of the variable length codes used in the field of data compression. Readers are only required to have a general familiarity with computer methods and essentially an understanding of the representation of data in bits and files.
Topics and Features:
Discusses codes indepth, not the compression algorithms, which are readily available in many books
Includes detailed illustrations, providing readers with a deeper and broader understanding of the topic
Provides a supplementary authormaintained website, with errata and auxiliary material "http://www.davidsalomon.name/VLCadvertis/VLC.html"
Easily understood and used by computer science majors requiring only a minimum of mathematics
Can easily be used as a main or auxiliary textbook for courses on algebraic codes or data compression and protection
An ideal companion volume to David Salomon's fourth edition of Data Compression: The Complete Reference
Computer scientists, electrical engineers and students majoring in computer science or electrical engineering will find this volume a valuable resource, as will those readers in various physical sciences and mathematics.
Data Compression: The Complete Reference
I was pleasantly surprised when in December 2002 a message arrived from the editor asking me to produce the third edition of the book and proposing a deadline of late April 2003. I was hoping for a third edition mainly because the field of data compression has made great strides since the publication of the second edition
Fundamental Data Compression
Fundamental Data Compression
Data Compression for Real Programmers  Free eBook Data Compression for Real Programmers  Download ebook Data Compression for Real Programmers free


