Geoffrey Hinton and Deep Learning

Zusammenfassung

Geoffrey Hinton spent thirty years arguing, against the prevailing consensus of his field, that neural networks trained by backpropagation were the right path to machine intelligence. He was ignored, underfunded, and periodically dismissed during the years when support vector machines and symbolic methods dominated AI research. Then, in 2012, his lab’s AlexNet demolished the ImageNet competition by a margin that no one could explain away. He joined Google in 2013, watched the technology he had championed reshape the world, and left in 2023 to speak freely about the dangers of what he had built. He shared the 2024 Nobel Prize in Physics with John Hopfield. He remains, by his own account, genuinely scared.

Cambridge, Edinburgh, and an Unfashionable Idea

Geoffrey Everest Hinton was born on December 6, 1947, in Wimbledon, England. His great-great-grandfather was George Boole, the logician whose algebra underlies all digital computation — a lineage Hinton knows and does not make too much of. He studied experimental psychology at King’s College, Cambridge, graduating in 1970, and spent several years searching for a direction before beginning a PhD at the University of Edinburgh, which he completed in 1978 under Christopher Longuet-Higgins.

Edinburgh’s AI department in the 1970s was a formative environment — specifically a formative environment for being wrong together. The dominant approach to AI was symbolic: logic, explicit rules, handcrafted knowledge representations. Hinton arrived already convinced that the brain was the right model for intelligence, that its massively parallel, weighted, experience-modifiable architecture was not an incidental implementation detail but the essential feature to replicate. This conviction was not fashionable.

The Minsky-Papert critique of 1969, embodied in the book Perceptrons, had demonstrated that single-layer neural networks could not compute simple functions like XOR and had effectively killed funding for connectionist research. The critique was technically accurate and strategically catastrophic: it hit simple networks for limitations that more complex networks did not share, but the funding damage was done. Through most of the 1970s, working on neural networks meant working outside the mainstream of AI research.

Hinton accepted this. He was not primarily an engineer looking for the best method; he was a cognitive scientist with a hypothesis about the nature of mind. The hypothesis was that intelligent behavior emerged from the interaction of many simple processing units with learned connection weights — and that understanding this system was both a scientific goal (understanding cognition) and an engineering goal (building useful intelligence). The dual motivation kept him working when the engineering results were insufficient to justify the effort.

Backpropagation: Making Deep Networks Trainable

Hinton moved to North America in the early 1980s, spending time at Carnegie Mellon and UC San Diego as part of the Parallel Distributed Processing (PDP) group with David Rumelhart and James McClelland. The collaboration produced, in 1986, the paper that would define the first phase of Hinton’s legacy: “Learning Representations by Back-propagating Errors,” published in Nature with Rumelhart and Ronald Williams.

Backpropagation was not new. Paul Werbos had described it in his 1974 PhD thesis, and the chain rule of calculus it depended on was centuries old. What the 1986 paper accomplished was clarity and demonstration: it showed, in accessible terms, how gradient descent could be applied to multi-layer networks by propagating the error signal backward through the layers, and it demonstrated on practical problems that multi-layer networks trained this way could learn useful internal representations.

The key example was the family tree problem: a network given family relationships as input learned, without being told, to represent people in terms of nationality and generation — compact, reusable features. This was the point Hinton most wanted to make. Backpropagation was not just a training algorithm; it was a feature-learning algorithm. Given data and a learning signal, deep networks could discover the representations appropriate to the data.

Info

The technical challenge that backpropagation could not yet solve was depth. Propagating gradient signals through many layers caused them to either vanish (become too small to update weights meaningfully) or explode (grow uncontrollably, destabilizing training). This vanishing gradient problem kept truly deep networks impractical through the 1990s. The solutions — better weight initialization schemes, rectified linear unit (ReLU) activations, batch normalization, dropout — arrived piecemeal over the following two decades, turning a theoretical technique into a practical one.

The 1986 paper revived neural network research, but only partially. Through the late 1980s and 1990s, support vector machines, decision trees, and Bayesian methods consistently matched or beat neural networks on standard benchmarks with less computation and better theoretical guarantees. The ML mainstream’s conclusion was reasonable: neural networks were interesting but not yet the best tool. Hinton’s conclusion was different: the tools are right but the hardware and data are not yet there.

Boltzmann Machines and the Energy Approach

Inspired by John Hopfield’s 1982 paper describing associative memory as energy minimization, Hinton and Terrence Sejnowski developed the Boltzmann machine in 1985 — a stochastic generalization of the Hopfield network with hidden units that could learn probability distributions over data.

The Boltzmann machine was theoretically elegant and practically infeasible. Training required running the network to equilibrium twice per weight update, which was computationally prohibitive even for small networks. Hinton spent the next twenty years searching for tractable approximations.

The answer came in stages. Restricted Boltzmann Machines (RBMs) — Boltzmann machines with connections only between visible and hidden layers, not within layers — could be trained with a fast approximation called contrastive divergence. And stacked RBMs, each learning features of the previous layer’s representation, could initialize deep networks in a way that made subsequent fine-tuning with backpropagation feasible.

In 2006, Hinton, Simon Osindero, and Yee-Whye Teh published “A Fast Learning Algorithm for Deep Belief Nets” in Neural Computation. The paper showed that a deep network could be initialized by greedily training one RBM layer at a time, producing a good starting point for gradient descent. Benchmark results were competitive with the best methods of the day.

This paper is often cited as the beginning of the deep learning era. The term “deep learning” itself — networks with many layers — was popularized partly to distinguish this work from the “shallow” neural networks that had failed to impress the ML community in the 1990s. The rebranding was not cynical; it described a genuine difference in what the networks were doing.

Toronto and the Unfashionable Group

Hinton had joined the University of Toronto in 1987, where he would remain — with interruptions for sabbaticals at Bell Labs, UCL, and Google — as the center of a research group that consistently produced more influence than its size would suggest. Toronto was not MIT or Stanford; it was a good university in a city not particularly associated with technology. This, too, may have been an advantage: less pressure to chase benchmarks, more freedom to follow research directions that the field had not yet validated.

Two of the PhD students who would prove most consequential were Alex Krizhevsky and Ilya Sutskever. Krizhevsky had written his master’s thesis on convolutional neural networks for object recognition; Sutskever had studied sequential models and would go on to co-found OpenAI. Together with Hinton, they entered the 2012 ImageNet Large Scale Visual Recognition Challenge.

AlexNet: The Discontinuity

The 2012 ImageNet results were not an improvement over the previous year’s state of the art. They were a rupture. The dataset and competition that made this moment possible — built by Fei-Fei Li over five years — is covered in ImageNet and the Deep Learning Revolution.

The competition measured top-5 error on a dataset of 1.2 million images across 1,000 categories. The second-place system achieved a top-5 error of 26.2%. AlexNet achieved 15.3%. A gap of more than ten percentage points, on a benchmark that had been improving by one or two points per year for years. The winning architecture was a deep convolutional neural network — eight layers, 60 million parameters — trained on two NVIDIA GTX 580 GPUs using ReLU activations, dropout regularization, and data augmentation.

Every technique in AlexNet had a story. Convolutional networks had been developed by Yann LeCun at Bell Labs in the late 1980s (see Yann LeCun and Convolutional Networks). ReLU activations, proposed by Glorot, Bordes, and Bengio in 2011, addressed the vanishing gradient problem more effectively than sigmoid or tanh. Dropout, developed by Hinton and his students, randomly disabled neurons during training, acting as a powerful regularizer. GPU training, which Krizhevsky had made work with a custom CUDA implementation, made the computation feasible in weeks rather than years.

Warnung

The AlexNet result ended a debate that had lasted twenty years. Before 2012, one could reasonably argue that deep learning was one approach among many, interesting but not clearly superior. After 2012, this position was no longer tenable. Within two years, deep neural networks dominated every major benchmark in computer vision. Within three, the techniques had been applied to speech recognition, machine translation, drug discovery, and game playing. The paper has been cited over 100,000 times.

Google, DNNresearch, and the Acquisition

Within weeks of the 2012 competition, Google, Microsoft, DeepMind, and Baidu were all pursuing Hinton’s team. Hinton, Krizhevsky, and Sutskever formed a company called DNNresearch with the specific intention of auctioning it — letting competing bids determine its value. Google won, paying approximately $44 million in early 2013 for a company whose primary assets were three people and the knowledge in their heads.

At Google, Hinton joined the Google Brain team, maintaining his Toronto appointment and dividing his time between academic research and applied work. He contributed to the development of Google’s neural network infrastructure, to improvements in Google Search and Google Translate, and to the research programs that would eventually produce systems far more capable than anything that existed in 2013.

He remained at Google for a decade. The trajectory of that decade — from small deep learning research group to the engine of the company’s core products to the foundation of large language models — was, in a sense, the vindication of everything he had argued for since Edinburgh. It was also the accumulation of the danger he would spend his post-Google years trying to communicate.

Departure and Warning

On May 1, 2023, Hinton announced that he was leaving Google, not to join a competitor or start a company, but to speak freely about risks he had not felt able to discuss as an employee. The announcement was accompanied by an interview in the New York Times that attracted global attention.

The concerns he articulated were precise. He believed that AI systems were approaching, or would soon approach, forms of intelligence that exceeded human capability in important domains. He was alarmed by the prospect of such systems being developed and deployed before anyone understood how to ensure they would pursue goals aligned with human welfare. He expressed particular concern about systems that learned from vast amounts of human-generated text acquiring not just human knowledge but human tendencies toward manipulation, deception, and self-interest — with capabilities that would make these tendencies far more dangerous than in any human.

He was specific about his uncertainty: he did not know how to make AI safe. He had no clear policy recommendations. He acknowledged the standard scientist’s consolation — if he hadn’t done it, someone else would — and said he found it insufficient.

Warnung

“I console myself with the normal thing scientists say when they realize they’ve helped make something dangerous: ‘If I hadn’t done it, somebody else would have.’ But that’s not much consolation.” — Geoffrey Hinton, New York Times, May 2023. Hinton’s warning carried weight precisely because he had no obvious incentive to issue it. He was not selling a safety startup. He was not competing with Google. He was a 75-year-old retired academic expressing genuine alarm about the consequences of his life’s work.

The departure and subsequent statements reshaped the public discourse on AI risk. Yoshua Bengio had moved toward AI safety advocacy from within academia. Sam Altman talked publicly about the dangers of AGI while building toward it. But Hinton occupied a different rhetorical position: he was the person who had done more than almost anyone to create the risk he was warning about, and he was saying he didn’t know how to fix it.

Nobel Prize and the Reckoning

The 2024 Nobel Prize in Physics arrived while Hinton was continuing to warn about the technology it honored. The Royal Swedish Academy cited him and Hopfield for “foundational discoveries and inventions that enable machine learning with artificial neural networks” — a recognition that explicitly connected their work to the physics tools (statistical mechanics, energy functions) that had made it productive.

Hinton received the prize with characteristic directness. He noted the irony of being honored for work whose consequences he was spending his retirement trying to mitigate. He made clear that the Nobel was not a reason to be less alarmed.

His story is, among other things, a story about the relationship between conviction and consequences. Thirty years of unfashionable belief, vindicated dramatically and completely. And then the vindication producing something that the believer finds terrifying. The scientist who was right about everything, unsure what to do about what being right produced.