Zum Inhalt springen

The Database Revolution: From Filing Cabinets to the Relational Model

Zusammenfassung

This article traces the history of databases — from the chaos of flat files and hierarchical systems, through Edgar Codd’s 1970 paper that invented the theoretical foundation of modern data management, to SQL, the Oracle Corporation that beat IBM to market with IBM’s own ideas, and the eventual challenge from the NoSQL movement. It is a story of a single mathematical insight that reorganized an industry, a company that became a monopoly by moving faster than its inventor, and a recurring tension between consistency and scale that has never been fully resolved.

The World Before Relations: Hierarchies and Chaos

In the 1950s and 1960s, storing data on a computer meant storing it in files — sequential records on magnetic tape or disk, read from beginning to end. To find a customer’s order, you read every record until you found the right one. Updating a phone number meant finding every place that number appeared and changing it manually. Data was trapped in the structure of the files that contained it.

The first attempt at a solution was the hierarchical database. IBM’s Information Management System (IMS), developed in 1966 for NASA’s Apollo program to track the millions of parts in a Saturn V rocket, organized data as a tree: a root record (say, a customer) had child records (orders), which had their own children (line items). Navigating the tree was fast; querying across it — “find all customers who ordered product X in the last three months” — required custom programs that traversed the structure explicitly. Every new question required a programmer.

The network model (CODASYL, 1969) allowed records to have multiple parents, not just one, solving some cross-cutting query problems but introducing new complexity. The data model was a graph. Programs navigating it had to manage pointers and traversal paths by hand. The code that stored and retrieved data was inextricably tangled with the code that used it.

The fundamental problem was physical data dependence: the way data was stored on disk determined how it could be accessed, and any change to the storage structure — adding a new index, reorganizing a file — could break every program that touched that data.

Edgar Codd and the Relational Model

Edgar F. Codd was a British mathematician working at IBM’s San Jose Research Laboratory in the late 1960s. He had a PhD in mathematics from the University of Michigan and a programmer’s frustration with the clumsiness of existing database systems. In June 1970, he published a twelve-page paper in the Communications of the ACM that changed the field permanently.

The paper was titled “A Relational Model of Data for Large Shared Data Banks.”

Codd’s central idea was to separate the logical structure of data from its physical storage. In his model, data was organized as relations — tables of rows and columns, with each row representing a single fact and each column a single attribute. A customer table had rows of customers; an orders table had rows of orders; the relationship between them was represented not by a pointer or a tree path but by a shared value — a key — that appeared in both tables.

This was not merely a reorganization. It was a mathematical foundation. Codd drew on set theory and predicate logic to define operations on relations — select (filter rows), project (choose columns), join (combine two tables on a shared key) — and proved that these operations were sufficient to answer any question expressible about the data. The resulting framework, relational algebra, meant that queries were not programs — they were logical expressions, and the system could figure out the most efficient way to execute them.

Data Independence: The Central Promise

Codd distinguished two types of data independence that his model provided:

  • Physical independence: the way data is stored on disk can change — new indexes added, files reorganized, storage engines replaced — without any change to the programs that query it.
  • Logical independence: the logical structure of the data can be extended — new tables added, new columns introduced — without breaking existing queries.

These properties, which sound technical, have enormous economic consequences. Before the relational model, every database change required corresponding changes to every program that used the data. After it, programs could be written against a stable logical interface, insulated from the physical reality beneath it. This is why relational databases became the default infrastructure for every application from banking to e-commerce: they absorb change.

Codd went further. In 1985, he published twelve rules (numbered with a foundational Rule 0, so thirteen in all) in a two-part Computerworld article — a standard any system had to satisfy to be considered truly relational, which, he pointedly noted, most systems claiming the label did not meet. (His 1972 work had already laid the groundwork, defining relational completeness and database normalization.)

IBM’s Reluctance and the Birth of SQL

IBM recognized the significance of Codd’s paper. In 1973, IBM Research launched System R — a project to build a prototype relational database and demonstrate that the model was practical at scale. Two researchers on the project, Donald Chamberlin and Raymond Boyce, designed a query language for it: originally called SEQUEL (Structured English QUEry Language), later renamed SQL (Structured Query Language) for trademark reasons.

SQL expressed relational algebra in syntax resembling English:

SELECT customer_name, order_total
FROM customers
JOIN orders ON customers.id = orders.customer_id
WHERE orders.date > '1979-01-01'
ORDER BY order_total DESC;

The language was not mathematically pure — Codd had reservations about several of its design choices — but it was learnable by non-mathematicians, which was the point. Business analysts could write SQL queries without programming skills. System R demonstrated, by the late 1970s, that relational databases were not just theoretically elegant but practically fast.

IBM, however, was in no hurry to ship a product. The company earned enormous revenues from IMS — its hierarchical database system used by the majority of large enterprises. Shipping a relational database that was genuinely better would cannibalize that revenue. IBM published detailed technical papers about System R; it did not ship a commercial product until DB2 in 1983.

Someone was reading those papers.

Larry Ellison and the Oracle Gambit

Larry Ellison was a programmer in the San Francisco Bay Area who had read Codd’s 1970 paper and, when IBM published its System R research results in 1976, read those too. He recognized that IBM had proved the relational model worked — and that IBM, for institutional reasons, was not going to productize it quickly.

In 1977, Ellison co-founded Relational Software, Inc. (later renamed Oracle Corporation) with Bob Miner and Ed Oates. They set out to build a commercial relational database based on the System R papers. Their first product, shipped in 1979 for the CIA (customer number one), was named Oracle Version 2. There was no Version 1; the version number was a marketing decision to imply maturity.

Oracle beat IBM’s DB2 to market by four years.

The company’s early survival depended on aggressive salesmanship and a willingness to claim capabilities the product did not yet fully have. Ellison’s Oracle was faster to market, faster to adopt new hardware, and faster to court customers than IBM’s more cautious organization. By the time DB2 shipped, Oracle had established itself as the default choice for new enterprise systems.

The relational database market of the 1980s became a multi-competitor race: Oracle, Sybase (1984), Informix (1985), and Ingres (an academic project at Berkeley led by Michael Stonebraker, commercialized in 1980) all fought for enterprise customers. IBM competed with both DB2 and a smaller product, SQL/DS. The wars were won largely on performance benchmarks, sales force size, and the ability to port to the emerging Unix workstation market.

Jim Gray and the Science of Transactions

Alongside the database product wars, a quieter research program was establishing the theoretical foundations of reliability. Jim Gray at IBM (and later Tandem Computers) spent the 1970s and 1980s working out what it meant for a database to be correct in the presence of failures — hardware crashes, power outages, concurrent updates from thousands of simultaneous users.

His answer was the transaction — a sequence of operations that must either complete entirely or not at all — and the ACID properties that any transaction system had to guarantee:

  • Atomicity: a transaction either fully commits or fully rolls back; no partial results.
  • Consistency: a transaction transforms the database from one valid state to another.
  • Isolation: concurrent transactions behave as if they ran sequentially.
  • Durability: once committed, a transaction’s effects survive any subsequent failure.

These four properties, formalized by Gray and his collaborators, became the contract on which all financial computing depended. Every bank transfer, every stock trade, every insurance claim update rests on ACID semantics. Gray received the Turing Award in 1998. He disappeared at sea in a solo sailing trip in 2007 and was never found.

Dead End: The Object-Oriented Database

In the early 1990s, the rise of object-oriented programming created an apparent mismatch. Programs were now organized around objects — bundles of data and the methods that operated on them, with inheritance hierarchies and complex relationships. Relational databases stored flat tables of rows and columns. Translating between the two — an object in memory and its representation in a table — required tedious, error-prone code. Critics called it the “object-relational impedance mismatch.”

The proposed solution was the object-oriented database management system (OODBMS). Systems like GemStone (1987), ObjectStore, and Versant stored objects directly, as they existed in memory, without translation to rows and columns. For applications with highly complex, interconnected data — CAD systems, telecommunications networks — they offered genuine advantages.

The Ecosystem Problem

Object-oriented databases failed for reasons that had little to do with their technical merits. Relational databases had SQL — a standard query language understood by an entire profession of database administrators, report writers, and analysts. Object databases had proprietary query languages, no standardization, and no path for existing tools and skills to transfer. When organizations evaluated the trade-off — somewhat better performance for complex object graphs, versus complete loss of all existing database tooling and expertise — almost all chose to stay relational and write the translation code. The OODBMS market peaked in the late 1990s and largely vanished.

The impedance mismatch was eventually solved not by replacing the database but by automating the translation layer: Object-Relational Mappers (ORMs) like Hibernate (Java, 2001) and ActiveRecord (Ruby on Rails, 2004) generated the SQL automatically, hiding the relational model from application developers while retaining all the benefits of a relational database beneath the surface.

The NoSQL Challenge

By the mid-2000s, the companies building the web’s largest systems had encountered a different limit of the relational model: scale.

A single relational database, even a fast one, runs on one server. When that server is full, the standard solution is a bigger server (vertical scaling). But there are physical limits to how large a single machine can be, and those limits had been reached by companies like Google, Amazon, and Facebook, which needed to store not gigabytes but petabytes, accessed by millions of simultaneous users.

The relational model’s ACID properties — particularly the isolation guarantee — required coordination between all parts of the system before committing any transaction. At the scale of thousands of servers, that coordination became a bottleneck that no amount of engineering could fully eliminate.

In 2006, Google published a paper describing Bigtable — a distributed storage system for structured data that deliberately abandoned the relational model in favor of a simpler key-value structure that could be partitioned across thousands of machines. In 2007, Amazon published a paper on Dynamo, making a similar trade-off explicit: in a distributed system, you could have strong consistency (every read sees the most recent write) or high availability (every request gets a response) or tolerance of network failures — but not all three simultaneously. This became known as the CAP theorem, formalized by Eric Brewer in 2000.

The response was a wave of new systems, collectively branded NoSQL (Not Only SQL):

  • MongoDB (2009): document storage; JSON-like records instead of rows
  • Cassandra (Facebook, 2008, open-sourced 2009): wide-column store optimized for write-heavy workloads across many nodes
  • Redis (2009): in-memory key-value store for caching and session data
  • HBase (2008): open-source Bigtable implementation

NoSQL is Not Anti-SQL

The “NoSQL” label was always misleading. These systems did not reject SQL as a language — they rejected the relational model and, specifically, the ACID transaction model that required tight coordination. Most NoSQL systems offered “eventual consistency”: given enough time without updates, all nodes would converge on the same value. For a social media feed, this was acceptable — seeing a post a few milliseconds late was harmless. For a bank balance, it was not. The choice of database became an architectural decision about which properties the application actually required, not a blanket preference for one technology.

Relational databases adapted. PostgreSQL — the open-source descendant of Stonebraker’s Ingres, whose full history is part of The Open Source Revolution — added JSON column types and document-style queries. Oracle and Microsoft SQL Server added horizontal partitioning. The distinction between “relational” and “NoSQL” blurred as each camp absorbed the other’s best ideas.

Legacy: The Language That Outlasted Everything

SQL, designed in 1974, standardized by ANSI in 1986, is today one of the most widely used programming languages in the world — ahead of Python, JavaScript, and Java by many measures of deployment, if not by developer surveys. Every data analyst, every data scientist, every backend developer works with it. It runs in embedded systems, in cloud data warehouses, in spreadsheet applications.

Codd did not live to see NoSQL; he died in 2003. His twelve rules were never fully implemented by any commercial system, and he spent his later years publicly complaining that products calling themselves “relational” were not. He was, in this as in most things, technically correct and commercially irrelevant.

The relational database he invented processes every ATM withdrawal, every airline reservation, every tax filing, every hospital record in the developed world. The infrastructure is invisible precisely because it works.

For the storage and memory architecture that databases run on, see The Von Neumann Architecture. For the open-source ecosystem that produced PostgreSQL and MySQL, see The Open Source Revolution.


📚 Sources