The Rise of Artificial Intelligence: Dreams, Winters, and the Deep Learning Revolution

Zusammenfassung

This article traces the history of Artificial Intelligence from Alan Turing’s 1950 thought experiment, through the founding optimism of the Dartmouth Conference, two prolonged “AI Winters” caused by unfulfilled promises, the brief triumph of expert systems, and the statistical revolution that eventually culminated in deep learning and the large language model era. It is a history of extraordinary ambition, repeated disappointment, and a final vindication that arrived decades later than its founders expected — and in a form most of them did not anticipate.

The Question That Started Everything

In 1950, Alan Turing published a paper in the journal Mind titled “Computing Machinery and Intelligence.” It opened with a question that has driven an entire field ever since: “Can machines think?”

Turing immediately sidestepped the philosophical quagmire of defining “thinking” and replaced it with an operational test. He proposed what he called the Imitation Game: a human interrogator communicates, via text, with both a human and a machine. If the interrogator cannot reliably determine which is which, the machine has passed. The test — universally known today as the Turing Test — shifted the debate from metaphysics to measurement. The question was no longer whether a machine truly thinks, but whether it behaves indistinguishably from something that does.

Turing predicted that by the year 2000, machines would be capable of passing his test. He was, as we will see, both right and wrong in ways he could not have imagined. His fuller story is told in Alan Turing and the Enigma.

The Dartmouth Conference: Naming a Field

In the summer of 1956, a small group of researchers gathered at Dartmouth College in New Hampshire for what would become the founding moment of a discipline. The conference was organized by John McCarthy (a young mathematician at Dartmouth), Marvin Minsky (Harvard), Nathaniel Rochester (IBM), and Claude Shannon (Bell Labs). McCarthy had chosen the name for his funding proposal: Artificial Intelligence.

The choice of name was deliberate and consequential. “Artificial Intelligence” was more provocative than the alternatives — “machine learning,” “complex information processing,” “automata theory” — and McCarthy intended it to be. He wanted a field with ambition commensurate with its goals.

The Dartmouth proposal stated the conference’s founding assumption with breathtaking confidence:

“Every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it.”

This was not a hypothesis. It was an axiom. The question the conference participants planned to answer over that summer was not whether machines could be made intelligent, but how to do it — soon.

Herbert Simon and Allen Newell arrived at Dartmouth having already built something remarkable: the Logic Theorist, a program that could prove mathematical theorems from Whitehead and Russell’s Principia Mathematica. At the conference, they demonstrated it proving 38 of the first 52 theorems. Simon, characteristically understated, declared that they had “solved the venerable mind-body problem.” He predicted that within ten years, a computer would be chess champion of the world and would prove a major new mathematical theorem.

He was off by about thirty years on the chess prediction. The mathematical theorem took longer still.

The Golden Age: Confidence and Early Victories

The decade following Dartmouth was characterized by genuine progress and extravagant optimism in roughly equal measure.

John McCarthy made two enduring contributions. He designed Lisp (1958) — the programming language that would become the medium of AI research for decades, celebrated for its ability to manipulate symbolic expressions and its support for recursive functions. (The relationship between Lisp and the specialized hardware built to run it is explored in The Lisp Machine Era.) He also developed the concept of time-sharing — the idea that multiple users could share a single computer simultaneously — which shaped how computing resources were allocated throughout the 1960s.

Marvin Minsky at MIT pursued the dream of general intelligence with characteristic boldness. His early work on neural networks was promising; his later conviction that symbolic reasoning was the path to machine intelligence would shape — and constrain — the field for two decades.

Early programs produced results that seemed miraculous to contemporaries:

ELIZA (1966, Joseph Weizenbaum, MIT): A program that simulated a psychotherapist by pattern-matching user sentences and reflecting them back as questions. Users who knew it was a program often reported feeling genuinely understood. Weizenbaum was disturbed rather than gratified — he had intended ELIZA as a demonstration of the superficiality of such interactions, not as a genuine achievement.
SHRDLU (1970, Terry Winograd, MIT): A program that could answer questions and follow instructions about a simulated world of colored blocks, using natural language. Within its narrow domain, it was startlingly capable.
DENDRAL (1965, Edward Feigenbaum, Stanford): The first expert system, capable of identifying organic molecular structures from mass spectrometry data — outperforming many trained chemists within its domain.

Each success was real. Each was also narrowly bounded — impressive precisely because the domain was controlled and the real world was kept out.

The First AI Winter: Reality Intervenes

The optimism could not survive contact with harder problems.

In 1966, the ALPAC report — commissioned by the U.S. government to assess machine translation research — concluded that after a decade of work and millions of dollars, automatic translation was nowhere near practical. Funding for machine translation was cut drastically. It was the first major indication that the problems AI researchers had assumed were nearly solved were, in fact, enormously harder than they appeared.

In 1969, Minsky and his MIT colleague Seymour Papert published “Perceptrons” — a mathematically rigorous analysis of single-layer neural networks (the dominant connectionist model of the era). They proved, correctly, that single-layer perceptrons could not learn certain classes of functions, including the simple XOR operation. Their book was widely interpreted — and Minsky made little effort to discourage this interpretation — as demonstrating that neural networks in general were a dead end.

A Correct Proof, Misapplied

The mathematics of Perceptrons was sound. The conclusion drawn from it — that neural networks could not solve interesting problems — was not. Minsky and Papert had analyzed single-layer networks. Multi-layer networks (with “hidden” layers between input and output) were not subject to the same limitations. But the proof’s authority, and Minsky’s stature, effectively ended funding for neural network research for nearly fifteen years. The lesson is not that Minsky was wrong — he was right about what he proved — but that the field drew broader conclusions than the mathematics warranted.

In 1973, the British government commissioned Sir James Lighthill to assess the state of AI research. His report was devastating: after twenty years of promises, AI had produced no results of practical significance outside of the toy domains where it had always worked. The combinatorial explosion — the exponential growth in computational requirements as problems grew larger — had proven insurmountable for every technique then available.

The Lighthill Report triggered severe funding cuts in the UK. The U.S. Department of Defense, having invested heavily in AI for military applications, similarly pulled back. The first AI Winter had arrived.

Expert Systems: A Second Spring

The late 1970s brought a narrower but more commercially viable approach: instead of trying to build general intelligence, researchers built expert systems — programs that encoded the specific knowledge of human domain experts as explicit rules.

The archetype was XCON (originally R1), developed at Carnegie Mellon for Digital Equipment Corporation starting in 1978. XCON encoded the expertise needed to configure VAX computer systems — a process so complex that human engineers regularly made errors costing DEC significant money. By 1986, XCON was processing 80,000 orders per year and saving DEC an estimated $40 million annually. It was the first AI system to generate measurable commercial value.

The expert system wave triggered enormous excitement. Edward Feigenbaum at Stanford — one of its leading proponents — predicted that expert systems would transform every industry. In Japan, the government launched the Fifth Generation Computer Project in 1982: a ten-year, $400 million program to build AI computers using Prolog and logic programming that would overtake American computing capability. The U.S. and UK, alarmed, launched their own crash programs in response.

The Knowledge Acquisition Bottleneck

Expert systems required experts to articulate their knowledge as explicit rules — a process called knowledge acquisition. In practice, this was far harder than it sounded. Human experts often could not explain how they made decisions; their expertise was partly tacit, embedded in pattern recognition that resisted verbalization. Building an expert system for a new domain required months of painstaking interviews and iteration with domain specialists. This “knowledge acquisition bottleneck” imposed a hard ceiling on how quickly expert systems could be built and how broadly they could be deployed.

The Second AI Winter

By the late 1980s, the expert system boom was collapsing under its own weight. The knowledge acquisition bottleneck had proven as intractable as predicted. Expert systems were brittle: they performed well within their coded rules and failed catastrophically at anything outside them. They could not learn from new cases, adapt to changing domains, or handle the ambiguity and incompleteness of real-world information.

The Fifth Generation Project, for all its ambition and funding, produced no breakthroughs. Prolog and logic programming, the project’s theoretical foundation, scaled no better than earlier symbolic approaches. When the project ended in 1992, it had generated research but nothing approaching commercial relevance.

The AI hardware market — companies like Symbolics and LMI that had built specialized Lisp machines for AI research — collapsed as general-purpose workstations from Sun and the rising IBM PC-compatible world overtook their price-performance ratios. The story of this collapse is told in The Lisp Machine Era.

A second, deeper AI Winter had arrived. The term “artificial intelligence” itself became a liability in research proposals — too associated with broken promises. Researchers began describing their work with neutral terms: “machine learning,” “knowledge-based systems,” “computational linguistics.”

The Statistical Turn

The recovery, when it came, was built on a philosophical shift as much as a technical one.

The symbolic AI tradition had sought to make machines intelligent by giving them explicit knowledge and logical reasoning rules. The alternative — statistical machine learning — made no attempt to encode knowledge explicitly. Instead, it exposed algorithms to large quantities of data and let them discover statistical regularities.

Bayesian networks, support vector machines, and decision trees began outperforming rule-based systems on practical tasks: spam filtering, optical character recognition, speech recognition. The wins were unglamorous compared to the dreams of the 1950s — nobody was claiming these systems “understood” anything — but they were consistent and commercially valuable.

IBM’s Deep Blue defeating world chess champion Garry Kasparov in 1997 was a watershed moment, though not in the way AI researchers had originally imagined. Deep Blue won not through insight or learning but through massive computational search, evaluating 200 million positions per second. Simon’s prediction from 1957 was finally fulfilled — forty years late, and by a method no one at Dartmouth had envisioned.

Dead End: The Limits of Symbolic AI

The decades-long dominance of symbolic AI — and its eventual displacement — constitutes one of computing’s most instructive dead ends.

The symbolic approach rested on a foundational assumption: that intelligence could be captured as explicit symbols and logical rules. This assumption was plausible, elegant, and ultimately insufficient.

Cyc, begun in 1984 by Douglas Lenat at MCC (Microelectronics and Computer Technology Corporation), was the most ambitious attempt to test it directly. Lenat’s thesis was that AI systems failed because they lacked common sense — the vast background knowledge that humans take for granted. His solution: build a system containing all of it. Cyc would encode every fact of human common knowledge, explicitly, as logical assertions.

After four decades of continuous work and over $100 million in investment, Cyc contains more than 25 million assertions. It remains unable to perform tasks that a child manages effortlessly. The project demonstrated, more convincingly than any theoretical argument could, that human knowledge is not primarily stored as explicit propositions — and that trying to encode it as such hits a combinatorial wall that no amount of additional rules can surmount.

The deeper problem was what AI researchers came to call the frame problem: in a dynamic world, how does a reasoning system know what has not changed when something happens? Formal logic, which proved so powerful in mathematics, offered no natural answer to this question. The real world’s context-dependence defeated every symbolic approach that tried to engage it directly.

Geoffrey Hinton and the Neural Network Revival

While symbolic AI dominated funding and prestige through the 1970s and 80s, a small group of researchers continued working on neural networks — the connectionist approach that Minsky’s Perceptrons had delegitimized.

Geoffrey Hinton, a British cognitive scientist, had been convinced since the 1970s that the brain’s architecture — massively parallel, weighted connections between neurons, modified by experience — was the right model for machine intelligence. In 1986, Hinton, together with David Rumelhart and Ronald Williams, published a landmark paper demonstrating backpropagation as an efficient method for training multi-layer neural networks. The algorithm had been independently discovered earlier by others, but this paper made it accessible and demonstrated its power on practical problems.

Backpropagation worked by measuring the error at the output of a network and propagating it backward through the layers, adjusting each weight slightly to reduce the error. Run across millions of examples, this process could teach a network to recognize patterns that no human programmer had explicitly encoded.

The technique worked for shallow networks. For deeper networks — with many layers — it failed: gradients vanished or exploded as they were propagated backward, making training unstable.

Yann LeCun, working at Bell Labs, found a partial answer. His Convolutional Neural Network (ConvNet, 1989) exploited the spatial structure of images: rather than connecting every neuron to every other, each neuron connected only to a small local region of the input. This drastically reduced parameters, preserved spatial relationships, and proved highly effective for image recognition. Bell Labs deployed LeCun’s network for reading handwritten digits on checks — it processed 10-20% of all checks written in the United States by the mid-1990s.

Despite these successes, neural networks remained a minority interest through the 1990s and early 2000s. Support vector machines and other statistical methods often matched or exceeded them on benchmark tasks, with less computational cost and more theoretical grounding.

The ImageNet Moment: 2012

The transformation came from data and compute as much as from algorithms.

In 2009, Fei-Fei Li at Stanford released ImageNet: a dataset of over 14 million labeled images, painstakingly assembled over several years. The annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC) asked competing algorithms to correctly classify images into 1,000 categories.

In 2012, a team from the University of Toronto — Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton — entered the competition with a deep convolutional neural network they called AlexNet, trained on two NVIDIA graphics cards. AlexNet achieved a top-5 error rate of 15.3% — compared to 26.2% for the second-place team. The margin was not an incremental improvement; it was a discontinuity.

The field had found its watershed. Within two years, deep neural networks dominated every major computer vision benchmark. Within three, they had spread to speech recognition, natural language processing, and game-playing. In 2016, AlphaGo (DeepMind) defeated the world Go champion — a game that experts had predicted would remain beyond machine reach for decades, because its branching factor made exhaustive search impossible and because strong play seemed to require human intuition.

The lesson of AlphaGo’s victory, and of deep learning generally, was one that the founding generation of AI researchers had explicitly rejected: machines did not need symbolic knowledge or logical rules. They needed data, computation, and the right architecture.

The Transformer and the Language Revolution

The final act — so far — began in 2017 when a team at Google published a paper titled “Attention Is All You Need.” Its authors — Ashish Vaswani, Noam Shazeer, Niki Parmar, and others — introduced the Transformer architecture: a neural network design built around a mechanism called self-attention, which allowed every element of a sequence to directly attend to every other element, regardless of distance.

Transformers proved extraordinarily effective for language tasks. Models trained on vast amounts of text learned statistical patterns at every scale — from spelling to sentence structure to factual associations to reasoning style. GPT (OpenAI, 2018), BERT (Google, 2018), and their successors grew exponentially in scale. By 2020, GPT-3 (175 billion parameters) produced text that was, in many contexts, indistinguishable from human writing.

Whether these systems “understand” language — or anything else — in any meaningful sense remains one of the most contested questions in contemporary science. What is undisputed is that Alan Turing’s 1950 prediction has been operationally fulfilled: in controlled tests, a significant fraction of human evaluators cannot reliably distinguish GPT-4’s responses from a human’s. The Imitation Game, seventy years later, is effectively won.

The question Turing carefully avoided — whether passing the test constitutes genuine intelligence — remains, as he surely knew it would, entirely open.

For the programming languages and infrastructure that enabled this revolution, see The Evolution of Language and The Open Source Revolution.