The Transformer: "Attention Is All You Need"

Zusammenfassung

In 2017, eight researchers at Google Brain and Google Research published a twelve-page paper titled “Attention Is All You Need” that discarded the dominant architecture of sequence modeling — recurrent neural networks — and replaced it with a mechanism called self-attention. The result, the Transformer, became the foundation of nearly every major AI system that followed: BERT, GPT, ChatGPT, DALL-E, AlphaFold2, and GitHub Copilot all trace their lineage to a single paper from a group of engineers who mostly left Google within five years. It is the most influential AI paper since the backpropagation papers of the 1980s, and arguably in the history of computing.

The Problem with Recurrent Networks

To understand why the Transformer was revolutionary, it helps to understand what it replaced.

By 2017, the dominant architecture for processing sequences — sentences, time series, audio — was the recurrent neural network (RNN) and its more capable variant, the Long Short-Term Memory (LSTM) network, developed by Hochreiter and Schmidhuber in 1997. RNNs processed sequences one element at a time, left to right, maintaining a “hidden state” that encoded everything the network had seen so far. The hidden state was passed to the next step, creating a kind of rolling memory.

This sequential architecture had fundamental problems. First, it was slow to train: because each step depended on the previous step’s output, the computation could not be parallelized. Training on long sequences was bottlenecked by this serial dependency — you could not use the full power of a modern GPU. Second, it was bad at long-range dependencies: information from early in a sequence could be diluted or lost by the time it was needed hundreds of tokens later, a phenomenon called the vanishing gradient problem. LSTMs alleviated this somewhat through gating mechanisms, but the problem persisted. For translating long sentences or reasoning over extended passages, RNNs struggled.

Attention mechanisms had been used as additions to RNNs — allowing a decoder to “look back” at encoder states — since Bahdanau et al.’s influential 2014 paper on neural machine translation. But the attention was always supplementary, a spotlight on top of a sequential base. What Ashish Vaswani and his colleagues proposed was radical: throw out the recurrence entirely. Make attention the only mechanism.

The Team and the Paper

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Łukasz Kaiser, and Illia Polosukhin wrote “Attention Is All You Need,” presented at NeurIPS in December 2017. The team was assembled mostly within Google Brain and Google Research, many of them working on Google Translate’s neural translation system. They had a practical problem to solve — make translation faster and better — and a theoretical hunch that attention alone might be sufficient.

The paper’s title was a deliberate provocation. “Attention is all you need” announced that the complexity of LSTMs, the engineering effort invested in managing hidden states, the years of tricks to make RNNs tractable — all of it was unnecessary scaffolding around the thing that actually worked. It was a confident claim that turned out to be correct in ways even the authors did not fully anticipate.

Self-Attention: Every Token Sees Every Other Token

The core insight of the Transformer is self-attention: each position in a sequence can directly attend to every other position simultaneously, with the strength of attention determined by learned weights.

In concrete terms: when processing the word “bank” in the sentence “The bank by the river was steep,” self-attention allows the model to simultaneously examine “river” and “steep” and “the” in forming its representation of “bank.” An RNN would have processed left to right, its representation of “bank” reflecting primarily what preceded it, with no guaranteed access to “river” which comes later.

Self-attention works by transforming each input token into three vectors — a Query, a Key, and a Value — through learned linear projections. The attention score between two tokens is computed as the dot product of one token’s Query vector with another’s Key vector, scaled and passed through a softmax function to produce weights. These weights are then used to produce a weighted sum of the Value vectors. The result is a new representation of each token that incorporates information from all other tokens, weighted by relevance.

Multi-head attention runs this process multiple times in parallel, with different learned projections for each “head,” allowing the model to simultaneously attend to different aspects of the relationships — syntax, semantics, coreference — in separate subspaces. The outputs of all heads are concatenated and projected back to the model’s dimension.

Because all positions attend to all other positions in a single operation, and because the operation is a matrix multiplication, the computation is fully parallelizable across the entire sequence. This is what allowed Transformers to exploit the massive parallel processing capability of modern GPUs in a way RNNs never could.

Positional Encoding and the Encoder-Decoder Structure

Self-attention has no inherent sense of order: it treats a bag of tokens, with each attending to all others, with no notion of which came first. To inject sequence order, the paper introduced positional encoding: sinusoidal functions of different frequencies added to each token’s embedding, giving each position a unique signature that the model could use to infer order. Later work replaced fixed sinusoidal encodings with learned positional embeddings, but the principle — encode position as part of the representation — remained.

The original Transformer followed the encoder-decoder structure standard for machine translation. The encoder processed the input sequence (the source language sentence) into a rich set of contextual representations through a stack of self-attention and feedforward layers. The decoder generated the output sequence (the translated sentence) one token at a time, attending both to previously generated tokens (via masked self-attention) and to the encoder’s output (via cross-attention).

Each encoder and decoder layer consisted of: a multi-head self-attention sublayer, a position-wise feedforward network, residual connections around each sublayer, and layer normalization. This modular structure made the architecture easy to scale by stacking more layers.

Why It Worked: Parallelism and Long-Range Dependencies

The Transformer’s practical advantages became clear immediately. On the WMT 2014 English-to-German translation benchmark, the Transformer achieved 28.4 BLEU — beating the previous best by over 2 points. More importantly, it achieved this using significantly less training time than comparable RNN architectures, because the parallel computation allowed the same number of gradient updates in a fraction of the wall-clock time.

The long-range dependency problem effectively disappeared. Because every token attends to every other token directly — rather than information traveling through hundreds of sequential steps — there is no information bottleneck imposed by distance. The attention weight between token 1 and token 500 in a long document is computed by exactly the same mechanism as between token 1 and token 2.

The Transformer also avoided the vanishing gradient problem that plagued deep RNNs: gradients flow directly through attention connections and residual paths, making deep stacking feasible in a way that had required special engineering tricks in RNN architectures.

BERT, GPT, and the Architecture That Took Over Everything

The Transformer’s publication in late 2017 triggered an explosion. Within a year, two landmark systems demonstrated its potential:

BERT (Bidirectional Encoder Representations from Transformers), published by Jacob Devlin and colleagues at Google in October 2018, used only the Transformer encoder in a novel training setup: predicting masked tokens in a sentence (masked language modeling) and predicting whether two sentences were adjacent (next sentence prediction). Trained on BooksCorpus and the entire English Wikipedia, BERT could be fine-tuned on downstream tasks with just a small labeled dataset and achieved state of the art on eleven natural language processing benchmarks simultaneously. BERT established the pre-train then fine-tune paradigm that would define the next era of NLP.

GPT (Generative Pre-trained Transformer), published by Alec Radford, Ilya Sutskever, and colleagues at OpenAI also in 2018, used only the Transformer decoder in an autoregressive training setup: predict the next token given all previous tokens. Where BERT was bidirectional, GPT was unidirectional — generating text left to right. The generative paradigm pointed toward a different set of capabilities. GPT-2 (2019, 1.5B parameters) and GPT-3 (2020, 175B parameters) scaled this approach with results that astonished the field, and ultimately led to ChatGPT in November 2022.

The Transformer architecture subsequently spread beyond language. The Vision Transformer (ViT), published by Google in 2020, split images into patches and fed them as token sequences to a Transformer encoder, achieving competitive results on image classification without convolutional layers — threatening the dominance of the architectures that had defined computer vision since AlexNet in 2012. AlphaFold2 (DeepMind, 2021), which effectively solved the 50-year-old protein structure prediction problem, used Transformer-based attention mechanisms as its central architectural component. DALL-E, Stable Diffusion’s text encoder, and GitHub Copilot all built directly on Transformer foundations.

The Irony: Google Built the Architecture That Challenged Google

The Transformer’s story has a distinctive institutional irony. The paper was written by Google employees, using Google’s research resources, to improve Google’s products. But several of the eight original authors left Google within a few years to found or join competing AI companies:

Aidan Gomez co-founded Cohere, an enterprise AI company. Llion Jones co-founded Sakana AI. Jakob Uszkoreit co-founded Inceptive, working on RNA biology. Noam Shazeer and others involved in early Transformer work joined Character.AI, which by 2023 was one of the most-used AI consumer products in the world. The broader Google Transformer alumni — including key contributors to BERT and subsequent systems — dispersed across Anthropic, DeepMind, and numerous startups.

Google itself, having created the architecture underpinning the AI revolution, found itself playing catch-up in the commercial deployment of large language models as OpenAI — using Google’s own invention — launched ChatGPT and captured the public imagination.

Dead End: The Pre-Transformer Paradigm

The architectures the Transformer displaced — LSTMs, GRUs, attention-augmented RNNs — did not fail suddenly but were abandoned with striking speed once the Transformer’s advantages became clear. The massive engineering investment the field had made in making RNNs tractable — gradient clipping, careful initialization, complex gating mechanisms, beam search with coverage mechanisms — became largely irrelevant.

What the Transformer era left behind was a fundamental assumption: that sequential, step-by-step processing was the right inductive bias for sequential data. The Transformer showed that global, parallel attention could learn sequential structure without being structurally sequential. It also raised deeper questions: if attention mechanisms could replace convolutions for images and recurrence for sequences, what architectural assumptions were truly necessary versus merely habitual?

The most honest answer is that the Transformer succeeded in part because it made training easy to scale — not because anyone proved it was the theoretically optimal architecture. The architecture’s dominance may itself eventually be displaced; structured state space models (SSMs) and other approaches have shown competitive performance with improved computational efficiency. But the Transformer’s place in history — as the architecture that enabled the transition from narrow AI tools to general-purpose language and vision systems — is secure.