Data Compression

Zusammenfassung

Every photo you send, every song you stream, every web page you load arrives smaller than it really is. Data compression — the art of removing redundancy so information fits in less space — grew directly out of Claude Shannon’s 1948 information theory, which set the hard limit on how far you can squeeze. A 1951 graduate student named David Huffman found an optimal code as a way to skip a final exam. Two Israeli researchers, Lempel and Ziv, made compression universal in the late 1970s. And in the 1990s, lossy schemes that throw information away on purpose — JPEG for images, MP3 for audio — quietly built the internet’s media layer. Compression is the invisible infrastructure that made the digital world affordable.

The Limit: Shannon and Entropy

Data compression is older than computers — Morse code already gave the common letter E a single dot and the rare Q a long sequence — but it became a science in 1948, when Claude Shannon published A Mathematical Theory of Communication (see Claude Shannon and Information Theory). Shannon defined entropy: the minimum average number of bits needed to encode a message from a given source. Entropy is the floor. No lossless scheme can beat it, and Shannon’s source coding theorem proved it.

This split the whole field in two. Lossless compression removes only redundancy; the original is recovered bit-for-bit (text, executables, ZIP files). Lossy compression also discards detail the recipient won’t miss — and can go far below the entropy of the raw signal because it changes the message (photos, music, video). Everything that follows is a story of getting closer to Shannon’s floor, or deciding how far past it you can cheat.

Huffman’s Exam Trick (1952)

In 1951, David Huffman was a graduate student at MIT in an information-theory course taught by Robert Fano. Fano offered the class a deal: solve a stated coding problem in a term paper and skip the final exam. The problem — find the provably most efficient way to assign variable-length binary codes to symbols — had defeated Fano and Shannon themselves, who had a good-but-suboptimal method (Shannon–Fano coding).

Huffman, about to give up and study for the exam, hit on the answer: build the code tree from the bottom up, repeatedly merging the two least-frequent symbols. The result, published in 1952, was Huffman coding — provably optimal among methods that assign a whole number of bits per symbol. Frequent symbols get short codes, rare ones get long codes, and no code is a prefix of another so the stream decodes unambiguously. Seventy years later it is still everywhere: it is the final entropy-coding stage inside JPEG, MP3, and the DEFLATE format behind ZIP and PNG.

Lempel–Ziv: Compression Without a Codebook (1977–1984)

Huffman coding needs to know symbol frequencies in advance. The breakthrough that made compression universal — working on any data without prior knowledge — came from Abraham Lempel and Jacob Ziv at the Technion in Israel. Their LZ77 algorithm (1977) and LZ78 (1978) replace repeated strings with back-references to earlier occurrences: instead of storing “the " ten times, store it once and point back. The dictionary builds itself from the data as it streams past.

The LZ family became the backbone of general-purpose compression:

LZW (Lempel–Ziv–Welch, 1984) — Terry Welch’s refinement of LZ78, used in the Unix compress utility, in GIF images, and in TIFF. Its patent, held by Unisys, set off a famous controversy in the 1990s when Unisys began demanding royalties for GIF, directly motivating the royalty-free PNG format.
DEFLATE — Phil Katz combined LZ77 with Huffman coding for his PKZIP tool. This hybrid became the engine of .zip, gzip, PNG, and HTTP compression — arguably the single most-used compression algorithm in history.

LZ77 and LZ78 won Lempel and Ziv the IEEE Richard W. Hamming Medal and a place among the most-cited results in the field.

Throwing Information Away: JPEG and MP3

Text must survive compression intact. Images and sound do not — the eye and ear are forgiving, and that slack is enormous. Lossy compression exploits the limits of human perception.

JPEG (Joint Photographic Experts Group), standardized in 1992 as ISO/IEC 10918, compresses photographs using the discrete cosine transform (DCT). It splits an image into 8×8 blocks, converts each to frequency components, and aggressively discards the high-frequency detail the eye barely registers — then entropy-codes the rest with Huffman. A 10:1 size reduction with no visible loss is routine; JPEG became, and remains, the most widely used image format in the world.

MP3 (formally MPEG-1 Audio Layer III) did the same for sound. Developed largely at Germany’s Fraunhofer Institute under Karlheinz Brandenburg and standardized by the ISO MPEG committee in the early 1990s, MP3 uses a psychoacoustic model: it computes which sounds are masked — drowned out by louder nearby tones, or below the threshold of hearing — and simply does not encode them. A song shrinks roughly tenfold. That ratio is what made music small enough to download over 1990s modems, which made file-sharing and eventually the iPod and streaming possible. MP3 didn’t just compress audio; it detonated the music industry.

The same DCT-and-prediction lineage scaled up to video — MPEG, H.264, and successors — where exploiting redundancy between frames (most of the picture doesn’t change frame to frame) is what makes streaming video physically possible over consumer bandwidth.

⚠️ Dead End: Arithmetic Coding’s Patent Thicket

Arithmetic coding, developed in the late 1970s, is theoretically superior to Huffman coding: it can assign fractional bits per symbol and so squeeze closer to the Shannon entropy floor, especially for skewed distributions. By rights it should have replaced Huffman everywhere. It largely did not — for decades.

The reason was not technical but legal. IBM and others held a dense thicket of patents on arithmetic coding through the 1980s and 1990s. Standards bodies, terrified of royalty claims, kept defaulting to patent-free Huffman coding; the original JPEG standard specified an arithmetic-coding mode that almost no software ever implemented because of the patent risk. Arithmetic coding’s near-optimal compression sat mostly unused in mainstream tools until the core patents expired. Its successor, the range/ANS family (asymmetric numeral systems, 2009, deliberately placed in the public domain by Jarosław Duda), finally delivered arithmetic-coding efficiency without the legal hazard and now powers Zstandard, modern JPEG XL, and video codecs. The lesson is a recurring one in computing history: the best algorithm does not win if its license terms scare everyone away.