Bioinformatics and the Human Genome

Zusammenfassung

In 1965 a chemist named Margaret Dayhoff did something no one had thought to do: she treated protein sequences as data, collected them into an atlas, and built mathematical tools to compare them. She was, in effect, the first bioinformatician — years before the word existed. Half a century later, the Human Genome Project spent thirteen years and $2.7 billion to read all three billion letters of human DNA, finishing in April 2003, and could not have produced a single result without computers. Bioinformatics is the field where biology became an information science: where genomes are files, evolution is an alignment problem, and a protein’s shape is something you predict rather than crystallize.

Margaret Dayhoff: Biology as Data

Margaret Oakley Dayhoff (1925–1983) is often called the mother of bioinformatics, and the title is earned. At the National Biomedical Research Foundation she compiled the first Atlas of Protein Sequence and Structure (1965), a printed database of the few dozen protein sequences then known — the ancestor of every sequence database since. To make sequences comparable, she invented the PAM (Point Accepted Mutation) substitution matrices (published 1978), which score how likely one amino acid is to mutate into another over evolutionary time. Those matrices, and their later cousins (BLOSUM, 1992), are still the scoring engine inside sequence-search tools.

The field even got its name from theoretical biology, not computing: the Dutch biologist Paulien Hogeweg, with Ben Hesper, coined “bioinformatics” in 1970 to mean “the study of information processes in biotic systems” — the idea that life itself is a computational phenomenon.

The Alignment Problem

The central computational problem of molecular biology is sequence alignment: given two strings of DNA or protein, line them up to reveal their shared ancestry, accounting for insertions, deletions, and substitutions. It is, at heart, a string-edit-distance problem solved by dynamic programming.

The Needleman–Wunsch algorithm (1970) gave the first rigorous method for global alignment of two sequences.
The Smith–Waterman algorithm (1981) adapted it for local alignment — finding the best-matching subregions, which is what biologists usually need.

Both are exact but slow ($O(nm)$), which became a crisis as databases exploded. The answer was BLAST (Basic Local Alignment Search Tool, 1990), a heuristic that trades guaranteed optimality for speed by first finding short exact “seed” matches and extending them. BLAST made it possible to take a newly sequenced gene and search the entire known database in seconds; it is one of the most-cited methods in all of science and remains a daily tool for working biologists.

Reading a Genome

DNA sequencing — determining the order of the A, C, G, T bases — was made practical by Frederick Sanger’s chain-termination method (1977), for which he won his second Nobel Prize. But Sanger sequencing reads only a few hundred bases at a time, while a human chromosome is hundreds of millions of bases long. Bridging that gap is a pure computer-science problem: shotgun sequencing shreds the genome into millions of random fragments, reads each one, and then reassembles the original by finding overlaps — a giant, error-tolerant jigsaw puzzle that only an algorithm (a “genome assembler”) can solve.

The Human Genome Project

The Human Genome Project (HGP) formally began on October 1, 1990, an international public consortium led by the U.S. NIH and Department of Energy and the UK’s Wellcome Trust Sanger Centre, with the goal of sequencing all ~3 billion base pairs of human DNA.

In 1998 it acquired a rival. Craig Venter and the Perkin-Elmer Corporation founded Celera Genomics, betting that a more aggressive whole-genome shotgun strategy — sequencing the entire genome at once and reassembling it computationally on a large server farm — could beat the methodical public effort. The competition was bitter but productive, and it was ultimately settled as a tie: on June 26, 2000, Venter and the consortium’s Francis Collins jointly announced a “working draft” at a White House event. Papers from both groups appeared in February 2001, and the finished sequence was declared complete on April 14, 2003 — two years ahead of schedule and, at $2.7 billion (FY1991 dollars), under budget.

The results upended expectations. Humans turned out to have only about 20,000 protein-coding genes — far fewer than the 100,000 once guessed, and not many more than a worm — and protein-coding DNA makes up under 2% of the genome. The “junk” between the genes turned out to be full of regulatory machinery, a discovery that reshaped molecular biology.

Finishing the job, twenty years later

The 2003 “complete” genome still had gaps — roughly 8% of the sequence, mostly in repetitive regions that the era’s technology and algorithms could not assemble. Only in 2022 did the Telomere-to-Telomere (T2T) Consortium, using long-read sequencing and modern assemblers, publish the first truly gapless human genome.

What Bioinformatics Became

The genome was not an end but a starting gun. Reading DNA cheaply (the cost of sequencing a genome fell from billions of dollars to roughly a thousand within two decades — far faster than Moore’s Law) created a flood of data and an entire computational discipline to handle it:

Genomics and variant calling — comparing your genome to a reference to find the differences that matter for disease.
Phylogenetics — reconstructing evolutionary trees, used in real time to trace virus lineages during the COVID-19 pandemic.
Structural biology and protein folding — culminating in DeepMind’s AlphaFold, which in 2020 effectively solved the 50-year-old problem of predicting a protein’s 3D shape from its sequence, a triumph of deep learning over a problem that had defied physics-based simulation.

Biology has become, irreversibly, a data science — one where the information-theoretic view of the genome as a code is not a metaphor but the working model.

⚠️ Dead End: The “One Gene, One Function” Genome

The HGP was sold, in part, on a tidy promise: read the genome, find the genes, and you read the “book of life” — a near-deterministic blueprint in which each gene maps to a trait. That model largely collapsed on contact with the data. The surprisingly small gene count, the dominance of non-coding regulatory DNA, alternative splicing (one gene making many proteins), epigenetics, and the messy statistics of complex traits all showed that the genome is not a blueprint but a parts list whose meaning depends on context. The expected flood of simple “gene for X” cures was slow to arrive. The lasting payoff was different and arguably bigger: not a deciphered book, but the infrastructure — databases, algorithms, and cheap sequencing — that turned biology into a computational science.