Zum Inhalt springen

The History of Parsing: From BNF to ANTLR

Zusammenfassung

Parsing — the process of analyzing a string of symbols according to a formal grammar — is the invisible machinery that transforms source code into something a computer can execute. Every compiler, interpreter, data format reader, and protocol implementation contains a parser. The history of parsing is the history of computer scientists discovering, repeatedly, that the problem of recognizing formal languages was harder than expected, more theoretically interesting than it appeared, and more practically important than any other component of language processing. From John Backus’s informal notation for ALGOL in 1959 through Terence Parr’s ANTLR in the 1990s, parser technology evolved from hand-coded state machines into sophisticated generator tools that produced parsers automatically from grammar descriptions.

The Pre-BNF Era: Hand-Written Parsers

The first compilers were written before formal parsing theory existed. John Backus’s FORTRAN compiler (1957) — the first optimizing compiler for a high-level language — used hand-coded recognition logic that reflected Backus’s intuition about the language’s structure rather than a systematic theory. Each language construct was recognized by ad hoc code that the compiler team wrote specifically for it.

The limitation was maintenance. Modifying the language meant modifying the parser by hand, with no systematic guide to whether the changes were consistent or complete. A grammar change that looked simple — adding a new form of loop — could require touching a dozen places in the parser where related constructs were recognized, missing some, and introducing subtle bugs.

The first compiler compilers — programs that could generate parsers from grammar descriptions — emerged around 1960. Tony Brooker and Derrick Morris at Manchester built the first system to use the name “compiler-compiler” (presented in 1960, used to generate compilers for the Atlas computer), among the earliest demonstrations that compiler construction could be partially automated. The theoretical foundation for doing so was the Chomsky Hierarchy, which Chomsky was publishing at the same time — but the connection between Chomsky’s formal linguistics and compiler construction was not yet clear.

Backus-Naur Form: Grammar as Notation

John Backus and Peter Naur independently developed the notation for describing context-free grammars that became the lingua franca of programming language specification. Backus described the ALGOL 58 language using a notation he called “metalinguistic formulas” in 1959; Naur revised and applied it to ALGOL 60 in 1960, and the result was published in the Communications of the ACM.

Backus-Naur Form (BNF) described grammar rules as:

<nonterminal> ::= <alternative1> | <alternative2>
<expression>  ::= <term> | <expression> "+" <term>
<term>        ::= <factor> | <term> "*" <factor>
<factor>      ::= <number> | "(" <expression> ")"

Each rule defined how a nonterminal could be expanded. The double colon-equals (::=) meant “can be replaced by.” The vertical bar (|) meant “or.” The entire grammar was a finite set of such rules, from which the infinite set of valid programs could be derived.

BNF was both a specification tool and a design tool. Writing a grammar in BNF forced language designers to be precise about syntax in ways that English descriptions did not. Ambiguities — places where a sentence could be derived in more than one way — became visible in the formal notation. The ALGOL 60 report’s grammar was the first formal language specification using what would become the standard notation; every language specification since has used BNF or one of its descendants.

Niklaus Wirth later extended BNF to EBNF (Extended BNF) — in his 1977 Communications of the ACM note “What can we do about the unnecessary diversity of notation for syntactic definitions?” — adding repetition ({...}) and optionality ([...]) operators, reducing the number of rules needed to express common patterns. (Knuth’s contribution to the notation was separate: he proposed renaming Backus’s “Backus Normal Form” to “Backus-Naur Form” in 1964.) ISO standardized EBNF in 1996.

LL Parsers: Top-Down Recognition

The first parsing algorithms worked top-down: beginning with the grammar’s start symbol and repeatedly expanding nonterminals according to grammar rules, trying to produce the input string. This approach was intuitive but fragile.

Recursive descent parsing — implementing each grammar rule as a procedure that called other procedures for its subparts — was simple to write by hand and naturally produced readable code. A grammar rule like <statement> ::= if <expression> then <statement> became a procedure parseStatement() that called parseExpression() and parseStatement() recursively. The approach worked well for languages with no ambiguity and sufficient lookahead — knowledge of the next few input tokens — to choose between alternatives.

The systematic version, LL(k) parsing (Left-to-right scan, Leftmost derivation, k-token lookahead), was analyzed theoretically in the mid-1960s. An LL(k) parser required that, at every point in parsing, looking at the next k tokens would uniquely determine which grammar rule to apply. Grammars that satisfied this constraint could be parsed deterministically in linear time.

The limitation was grammatical: left-recursive rules (rules where a nonterminal’s first symbol was itself: <expr> ::= <expr> + <term>) could not be handled by LL parsers — they caused infinite recursion. Left recursion had to be eliminated by grammar transformation, which was mechanical but changed the grammar’s structure in ways that made the parse tree less directly meaningful.

LR Parsers: Bottom-Up Power

Donald Knuth’s 1965 paper “On the Translation of Languages from Left to Right” introduced the LR(k) parsing framework: Left-to-right scan, Rightmost derivation in reverse, k-token lookahead. LR parsers worked bottom-up: they read input tokens and reduced sequences of tokens to nonterminals when a grammar rule matched, building the parse tree from leaves to root.

The power of LR over LL was substantial. LR parsers could handle all LL languages plus a large class of grammars that LL could not. Left recursion, which was fatal for LL, was natural for LR. Most programming languages that were awkward to describe in LL-parseable grammars were naturally LR-parseable.

The practical problem was implementation complexity. Computing the LR parse tables — the automaton states representing partial parses and the actions to take on each input — was theoretically mechanical but computationally intensive. Knuth’s original LR(k) required tables that grew exponentially with k. The breakthrough came from two simplifications:

SLR (Simple LR, 1969) by Frank DeRemer reduced table size by using a coarser analysis of which actions were valid in which states.

LALR (Lookahead LR) by Frank DeRemer (1969) further reduced tables by merging states that differed only in their lookahead sets; DeRemer and Tom Pennello later (1982) gave an efficient algorithm for computing LALR lookahead sets. LALR(1) — LALR with 1-token lookahead — was sufficient for most practical programming languages and produced tables small enough to compile and distribute efficiently.

LALR(1) became the standard for automatic parser generation.

YACC and the Parser Generator Era

Stephen Johnson at Bell Labs implemented YACC (Yet Another Compiler Compiler) in 1975 as an LALR(1) parser generator. The programmer wrote a grammar specification in BNF with embedded C actions; YACC produced a C source file implementing the parser.

expr : expr '+' term  { $$ = $1 + $3; }
     | term
     ;
term : term '*' factor { $$ = $1 * $3; }
     | factor
     ;

YACC was distributed with Unix and became the standard tool for parser construction through the 1980s and 1990s. Its companion, lex (later flex), was a lexer generator — it took regular expressions and produced a lexical analyzer (tokenizer) that fed tokens to the YACC-generated parser. The lex/YACC pipeline became the standard compiler front-end construction kit.

GNU Bison (1988) was the FSF’s YACC-compatible replacement, adding additional grammar classes (GLR — Generalized LR — for ambiguous grammars) and becoming the standard tool on Linux systems. ANTLR (Another Tool for Language Recognition), developed by Terence Parr beginning in 1989, took a different approach: LL(k) parsing with sophisticated lookahead analysis that could handle a larger class of grammars than classical LL(1). ANTLR 4 (2013) used Adaptive LL() (ALL()) parsing — a dynamic algorithm that computed lookahead during parsing rather than statically, eliminating most grammar restrictions while maintaining linear time for most real inputs. ANTLR 4 became the parser generator of choice for many complex language tools, including Eclipse IDE’s language tooling and numerous DSL and query-language implementations. (Some major compilers went the opposite way: Clang/LLVM’s C and C++ front-end uses a hand-written recursive-descent parser rather than a generator, for finer control over error recovery.)

PEG Grammars and Packrat Parsing

Parsing Expression Grammars (PEGs), introduced by Bryan Ford in 2004, offered a fundamentally different model. Where CFG rules defined sets of strings (any string that could be derived was valid), PEG rules defined ordered choice: alternatives were tried left to right, and the first match was taken. This made PEGs deterministic by definition — no ambiguity was possible — but it also meant that PEGs could describe non-context-free languages.

Ford’s packrat parsing algorithm implemented PEGs in guaranteed linear time through memoization: each subparsing result was cached so that repeated attempts to parse the same position with the same rule returned the cached result rather than re-examining the input.

PEGs became popular for language tools where the programmer wanted the grammar to exactly specify parsing behavior without the ambiguity concerns of LALR or the restrictions of LL. The Rust programming language’s macro system used a PEG-like grammar; many modern language implementations used PEG-based tools (pest, nom, Parsec) rather than traditional LALR generators.

Dead End: Operator Precedence Parsers

Operator precedence parsing, developed by Niklaus Wirth and Helmut Weber in 1966 and refined by Vaughan Pratt’s Pratt parser (1973), handled expression parsing with operator precedence more naturally than either LL or LR approaches. Pratt parsers associated parsing functions directly with token types rather than grammar rules, making them easy to write, modify, and extend with new operators at runtime.

The Pratt parser was technically elegant and practically underused for decades — academic parser theory focused on the generality of LR methods. It was rediscovered in the 2000s when language designers working on interpreted languages (JavaScript engines, Lua, scripting languages) found that Pratt parsers were easier to implement correctly and extend dynamically than YACC-generated parsers. The V8 JavaScript engine and many modern language interpreters use Pratt-derived expression parsing. Its delayed mainstream adoption is a case study in how theoretically elegant tools can be displaced by tools that arrived first with institutional support.


📚 Sources