The Chomsky Hierarchy: Grammar, Languages, and the Spine of Every Parser
Zusammenfassung
In 1956 and 1959, Noam Chomsky published two papers on the formal structure of human language that were quickly recognized by computer scientists as the theoretical foundation of programming language design. Chomsky was not trying to design compilers; he was trying to understand how children acquire grammar. But the hierarchy of formal grammars he described — from the most powerful (unrestricted) to the most constrained (regular) — mapped precisely onto the computational mechanisms available for language recognition. Every programming language parser, every regular expression engine, every XML validator, and every protocol that communicates between computers operates within the framework Chomsky described. His work is among the most consequential theoretical contributions ever borrowed from one field by another.
Noam Chomsky and the Linguistics Problem
Noam Chomsky was a graduate student in linguistics at the University of Pennsylvania when he began working on the mathematical foundations of grammar in the early 1950s. His central concern was a problem in theoretical linguistics: how is it possible for a human child to learn a language from a finite set of examples and then use it to produce and understand an infinite set of sentences never heard before?
The answer, Chomsky argued, had to involve generative grammar: a finite set of rules that could produce (generate) the infinite set of sentences that constituted a language. These rules were not a list of phrases to memorize; they were a structural description of how sentences were built — what could follow what, how constituents could be nested, how transformations related active and passive forms.
His 1956 paper “Three Models for the Description of Language,” published in the IRE Transactions on Information Theory, classified grammars by their generative power. His 1959 paper “On Certain Formal Properties of Grammars” formalized the classification into what became known as the Chomsky Hierarchy.
The Four Levels
The hierarchy organized grammars by the form of their production rules — the rules that defined how symbols could be rewritten or expanded. More powerful grammars allowed more general forms of rules; more constrained grammars allowed less, but could be recognized by simpler computational machinery.
Type 0: Unrestricted Grammars (recognized by Turing machines). Production rules could rewrite any string of symbols into any other string. These grammars could generate any language that any computer program could generate — their expressive power was equivalent to computation itself. The price was that determining membership in an unrestricted language was, in general, undecidable.
Type 1: Context-Sensitive Grammars (recognized by linear-bounded automata). Production rules were constrained: the right side of a rule could be no shorter than the left side. These grammars could handle languages where the validity of a construction depended on surrounding context — hence “context-sensitive.” Natural human languages were generally believed to be context-sensitive. Most programming languages were not, being simpler.
Type 2: Context-Free Grammars (CFG) (recognized by pushdown automata). Production rules rewrote a single nonterminal symbol into a string of terminals and nonterminals, without reference to context. The key property: a nonterminal could be expanded the same way regardless of where it appeared. Context-free grammars were sufficient to describe most programming language syntax. The pushdown automaton that recognizes them — a finite-state machine augmented with a stack — was efficiently implementable.
Type 3: Regular Grammars (recognized by finite automata). The most constrained level. Production rules could only produce a single terminal symbol, optionally followed by a single nonterminal. These grammars generated regular languages — the languages describable by regular expressions. Finite automata, which could be efficiently implemented as lookup tables, recognized them. Regular grammars were too limited for nested structures (like matched parentheses) but were fast enough for lexical analysis (tokenizing source code into keywords, identifiers, and operators).
The Stack Is the Key
The distinction between regular and context-free grammars maps directly onto a computational difference. Regular languages can be recognized without memory beyond the current state — a finite automaton needs only to know “what state am I in?” Context-free languages require a stack: to recognize that a closing parenthesis matches the correct opening parenthesis, nested arbitrarily deep, the parser must remember all the opening parentheses it has seen, in order. Regular expressions cannot match nested structures; context-free grammars can. This is why grep cannot parse HTML (HTML has nested structure; grep uses regular expressions) and why XML parsers must use a different mechanism.
The Impact on Computer Science
Chomsky published his hierarchy as linguistics theory. John Backus and Peter Naur, independently developing a formal notation for the ALGOL 60 language specification, arrived at Backus-Naur Form (BNF) — a notation for context-free grammars — in 1959–1960. When computer scientists recognized that BNF was equivalent to Chomsky’s context-free grammars, the theoretical and practical traditions connected.
The implications were immediate. If programming language syntax was context-free, the theory of pushdown automata provided a complete account of what parsers needed to do and what computational resources they required. The parsing algorithms that followed — LL parsers (top-down, reading left to right), LR parsers (bottom-up, reading left to right) — were grounded in Chomsky’s theoretical framework.
Don Knuth’s 1965 paper “On the Translation of Languages from Left to Right” introduced LR parsing, proving that it could handle a strict superset of LL languages and establishing the theoretical basis for the YACC (“Yet Another Compiler Compiler”) tool developed by Stephen Johnson at Bell Labs in 1975. YACC took a context-free grammar specification and generated a parser — mechanically, from the theory. Its successors (Bison, ANTLR, PEG parsers) are standard tools in language implementation today.
The regular expression — the practical language for specifying Type 3 grammars — entered programming through Ken Thompson’s 1968 implementation in QED and later grep. Thompson’s construction algorithm (converting a regular expression to a nondeterministic finite automaton, then to a deterministic finite automaton) is the foundation of every regular expression engine used today. The grep utility’s name is a shorthand for the QED editor command for globally applying a regular expression search.
What Falls Outside Context-Free Grammars
Not everything in programming language syntax fits within context-free grammars. Several common requirements exceed Type 2:
Variable declarations before use: the requirement that a variable must be declared before it is used in most programming languages cannot be captured in a context-free grammar. The grammar can describe the syntactic form of a variable reference; it cannot check that the name was previously declared. This is a context-sensitive property — it depends on what appeared earlier in the program.
Type checking: whether an expression’s type is compatible with the context in which it appears requires context that context-free grammars cannot represent.
C’s parsing ambiguity: the C language is famously not context-free because the parser cannot determine whether (A)(B) is a type cast or a function call without knowing whether A is a type name — which requires the symbol table maintained during parsing.
These context-sensitive aspects of real languages are handled outside the pure grammar, in a separate semantic analysis phase that runs after the context-free parser has built a parse tree. The separation between syntactic analysis (context-free) and semantic analysis (context-sensitive) is a direct consequence of the Chomsky Hierarchy’s structure.
The Hierarchy Beyond Compilers
The Chomsky Hierarchy’s influence extends beyond programming language parsing:
Network protocols: the syntax of HTTP headers, email addresses (RFC 5321), and URL structures are specified using regular grammars or context-free grammars. The choice of grammar level determines what kind of parser is appropriate.
DNA and bioinformatics: biologists use formal grammars (particularly stochastic context-free grammars) to model RNA secondary structure, where the folding of a molecule involves nested complementary base pairs — a context-free structure.
Natural language processing: computational linguistics borrowed back from computer science. Large language models (transformer-based neural networks) implicitly learn distributions over natural language text rather than using explicit grammars — a fundamentally different approach that bypasses the hierarchy. Whether the representations learned by transformers correspond to the levels of the Chomsky Hierarchy is an active research question.
📚 Sources
- Chomsky, Noam: “Three Models for the Description of Language” — IRE Transactions on Information Theory, Vol. 2, No. 3 (1956)
- Chomsky, Noam: “On Certain Formal Properties of Grammars” — Information and Control, Vol. 2, No. 2 (1959)
- Hopcroft, John E.; Motwani, Rajeev & Ullman, Jeffrey D.: Introduction to Automata Theory, Languages, and Computation, 3rd ed. (2006), Addison-Wesley
- Knuth, Donald E.: “On the Translation of Languages from Left to Right” — Information and Control, Vol. 8, No. 6 (1965)
- Thompson, Ken: “Regular Expression Search Algorithm” — Communications of the ACM, Vol. 11, No. 6 (1968)
- Aho, Alfred V. & Ullman, Jeffrey D.: The Theory of Parsing, Translation, and Compiling (1972), Prentice Hall