Zum Inhalt springen

Machine Translation

Zusammenfassung

Machine translation (MT) — automatically converting text from one human language to another — is among the oldest goals of computing, attempted before AI even had a name, and the site of some of the field’s most spectacular failures and most dramatic successes. Its history is a near-perfect arc of paradigm shifts: from the Cold War optimism of rule-based systems that collapsed under the ambiguity of human language; through the long, surprising reign of statistical translation learned from mountains of parallel text; to the neural revolution that, in the span of about two years (2014–2016), roughly halved translation errors and made tools like Google Translate genuinely useful for the first time. MT is also where the Transformer — the architecture behind every modern large language model — was born. This article traces the field, complementing the broader NLP revolution.

The Cold War Dream and the ALPAC Winter

Machine translation predates almost everything else in AI. A 1949 memorandum by Warren Weaver of the Rockefeller Foundation framed translation as a kind of decoding problem — could one treat Russian as “English in code”? — and launched serious research. In January 1954, the Georgetown–IBM experiment publicly translated about 60 Russian sentences into English using six grammar rules and a 250-word vocabulary. The demonstration generated enormous press and predictions that the translation problem would be solved within three to five years.

It was not. The early systems were rule-based: dictionaries plus hand-written grammatical and syntactic rules. They foundered on the deep ambiguity of language — the same word has many meanings, syntax is genuinely ambiguous, and translation requires world knowledge. The era’s emblematic (likely apocryphal but illustrative) failure had a system render “the spirit is willing but the flesh is weak” into Russian and back as something like “the vodka is good but the meat is rotten.” After more than a decade of funding and slow progress, the U.S. government’s ALPAC report in 1966 concluded that MT was slower, less accurate, and more expensive than human translation, and recommended cutting funding. The report triggered a long MT winter, choking research for roughly two decades.

The Statistical Era: Learning from Parallel Text

MT’s revival came from the same source as speech recognition’s: IBM Research in the late 1980s and early 1990s, where a group including Peter Brown and the Della Pietra brothers (and again influenced by Frederick Jelinek’s data-driven philosophy) reframed translation as a statistical problem. The insight: rather than encode linguistic rules, learn translation from large collections of human-translated documents — parallel corpora such as the bilingual proceedings of the Canadian Parliament (Hansard) and later the European Parliament.

The IBM Models treated translation probabilistically: given a sentence in the source language, find the target sentence that is most probable, decomposed into a translation model (which words/phrases map to which) and a language model (which target sentences are fluent). The approach matured in the 2000s into phrase-based statistical machine translation (SMT), which translated chunks of words rather than single words and handled local reordering. The open-source Moses toolkit democratized SMT research. Google Translate, launched in 2006, used statistical methods and scaled them across dozens of languages using Google’s vast web-crawled bilingual data.

Statistical MT was a genuine, useful advance — good enough to get the gist of a foreign web page — but it had characteristic flaws. Because it stitched together local phrases, output was often disfluent, grammatically broken, and prone to losing long-range structure and agreement. “Translationese” was instantly recognizable.

The Neural Revolution

Between roughly 2014 and 2016, Neural Machine Translation (NMT) transformed the field faster than almost any prior shift in AI. Instead of separate models for phrases and fluency, a single neural network learned to map an entire source sentence to an entire target sentence. The key architecture was the sequence-to-sequence (seq2seq) model with an encoder–decoder structure, built from recurrent networks (often LSTMs), introduced in 2014 by teams including Sutskever, Vinyals, and Le at Google and Cho and Bengio’s group in Montreal.

The decisive innovation was the attention mechanism (Bahdanau, Cho, Bengio, 2014–2015), which let the decoder “look back” at relevant parts of the source sentence for each word it produced, solving the bottleneck of cramming a whole sentence into one fixed vector. In 2016, Google switched Google Translate to a neural system (GNMT) and reported error reductions of roughly 55–85% on several language pairs — a leap users immediately noticed in dramatically more fluent output.

Then, in 2017, researchers pursuing better translation published “Attention Is All You Need,” which discarded recurrence entirely and built a model purely from attention: the Transformer. It was created to improve machine translation — and it went on to become the foundation of essentially all modern large language models. In one of the field’s great ironies, the quest to translate languages produced the architecture that now powers general AI. Today translation is increasingly handled by large multilingual and multimodal models that translate as one capability among many, with quality on common language pairs approaching human parity for many everyday texts.

Dead End: Interlingua and the “Perfect Pivot Language”

A recurring dream in rule-based MT was the interlingua approach: instead of building a separate translation system for every pair of languages (which scales as the square of the number of languages), translate each language into a single, universal, language-neutral representation of meaning — an “interlingua” — and then generate any target language from it. With N languages you would need only 2N components instead of N² translators. It was elegant, and it captured a genuine philosophical aspiration: to extract pure meaning from the accidents of any particular language.

In practice, interlingua systems (and the related “transfer” architectures) largely failed for the hardest of reasons: no one could define a complete, language-neutral representation of meaning. Human languages carve up the world differently — distinctions one language forces (grammatical gender, evidentiality, levels of politeness, tense systems) another lacks entirely — so any fixed interlingua either lost information or became impossibly baroque. Decades of effort produced working systems only in narrow domains (like weather bulletins).

The irony is that neural MT achieved something like the interlingua dream, but by accident and without anyone designing it. Google’s multilingual NMT (2016) trained a single network on many language pairs at once and discovered it could perform “zero-shot” translation — translating between pairs it had never been explicitly trained on — suggesting the network had learned an internal, shared semantic representation on its own. The lesson echoes the rest of AI’s history: the meaning representation that engineers could not hand-design, a neural network learned implicitly from data. The clean, human-specified interlingua was a dead end; the messy, learned, uninterpretable one inside a Transformer was not.

📚 Sources