[plt-scheme] Re: from hell to paradise

From: Prabhakar Ragde (plragde at uwaterloo.ca)
Date: Wed Feb 18 11:12:00 EST 2009

Hendrik Boom wrote:

>> This week it's "Why the hell did I have to learn the theory of LR 
>> > parsing with explicit stacks and state transitions instead of 
>> > mutually-recursive functions?" --PR
> It's a completely different algorithm, with quite different 
> capabilities.  I think, if you're going to write compilers, you should 
> know both, so you'll be able to use the one that suits the language.

See, you're a victim of the conspiracy, too. They are not completely 
different algorithms. There is a thread of continuity between them 
explored in a series of papers, the central one being Leermakers, 
Augusteijn, and Kruseman Aretz (TCS 1992).

Here it is in a nutshell. If we take a context-free grammar G and create 
a function A() for each nonterminal A whose job it is to consume (from 
an input list provided as a parameter to the function A) an initial 
sequence of terminals derivable from A, it is easy to write what we call 
a naive recursive-descent parser. It can even handle ambiguous grammars 
if we have it produce a list of input suffixes (that is, make it 
nondeterministic). But it takes exponential time on some grammars, and 
gets into infinite loops with left-recursive productions (e.g. A -> Ab).

So we transform G to ensure there are at most two nonterminals on the 
right-hand side of every rule. There are many related ways to do this 
(cf Chomsky normal form) but we choose one that introduces items as 
nonterminals. An item is a grammar rule with a dot in the right-hand 
side to indicate partial progress. The rules in the transformed grammar 
have an item on the left side and on the right side, the nonterminal 
after the dot followed by the item with the dot moved over one position.

If we apply recursive descent to the transformed grammar (with a 
function for each item), it still has the same problems. But if we 
memoize the computation, it's not hard to show that it takes O(n^3) 
time. [Aside: up to this point should be core material for any reputable 
CS curriculum, but it is not.] We can cut this down by introducing 
lookahead (first and follow sets), giving LL(k).

But we still have the problem with left recursion, which tends to be 
natural in grammars for programming languages (and, I am told, for 
natural language grammars as well). There are techniques for avoiding 
the infinite loops, but the resulting parse trees "slant the wrong way". 
We can handle that by transforming the parse tree, but that involves 
operations on trees, not to mention building the parse tree in the first 
place. (Such things are hard for students instructed using Java.)

So we transform the grammar again to eliminate left recursion. There are 
many related ways to do this (cf Greibach normal form), but we choose 
one that involves left corners of the symbol after the dot (a left 
corner being any initial symbol generated by rewriting the leftmost 
symbol zero or more times). Then we apply recursive descent to the newly 
transformed grammar.

Expressed in terms of the original grammar, this involves not only a 
function for each item, but a function for each (left corner, item) 
pair. This is nondeterministic recursive ascent parsing, so named 
because the recursive functions "ascend" from a terminal which is a left 
corner through successive undone leftmost rewrites.

To make this more general, we use the idea from NFA->DFA conversion, and 
replace functions on items by functions on sets of items (called 
states). Where before the natural successor of an item I with respect to 
a nonterminal A just after a dot was the item with the dot moved over 
the A, now the successor of a state is the collection of such plus the 
introduction of new items with A on the left and some right-hand side 
with a dot all the way to the left. This is the LR(0) automaton, and 
what I've described is its transition or "goto" function.

If we apply recursive ascent to this, and work out the details, we get 
an LR(0) parser. (Pepper at TU-Berlin explains it entirely in terms of 
grammar transformations.) Shift-reduce and reduce-reduce conflicts are 
explained by being exactly the points in the function definitions where 
nondeterminism is possible. Adding lookahead as we did for LL(k) gives 
LR(k). Cutting down on the number of states and the complexity of the 
transition function by including various restrictions gives SLR or LALR.

These sets of mutually-recursive functions don't look much like the 
traditional LR(0) or LR(k) parser. To get there, we make the recursion 
stack explicit, and treat the input list as a stack. Then the core of 
the algorithm becomes an iterative loop, popping and pushing. Finally, 
we notice that we can combine the two stacks into one, processing the 
input sequentially, and we have the conventional presentation (which, 
unlike everything I have described up to this point, cannot handle 
nondeterminism caused by grammars that don't conform to the restrictions).

But all the intuition has been drained out of it, as the blood has been 
drained out of a slaughtered animal that ends up as shrink-wrapped 
pieces of meat in a supermarket. Yes, they're convenient, and we adults 
make the connection with living animals if we think about it enough, 
though usually we don't. The traditional LR(k) algorithm is efficient in 
time and space (to an extent not really needed in many applications any 
more, and certainly not needed in a pedagogical presentation), but 
students don't make the connection with more natural or simpler 
techniques. Instead, they either use tools (e.g. Lex/YACC) that they 
only dimly understand, flailing away at shift-reduce or reduce-reduce 
conflicts by randomly rewriting their grammars, or they learn the 
algorithm by rote but glean what intuition they can from one or two 
worked examples.

I went further, learning the theories of LL and LR parsing from 
researchers in formal languages at two universities, but until recently 
I also thought these were "completely different" algorithms that could 
only be expressed iteratively using mutation. To be fair, I learned said 
theories before the above connections were worked out; the first papers 
in this series (Roberts and Kruseman Aretz) appeared in 1988. Twenty 
years later, we no longer have an excuse. --PR

Posted on the users mailing list.