[plt-scheme] Re: from hell to paradise
Hendrik Boom wrote:
>> This week it's "Why the hell did I have to learn the theory of LR
>> > parsing with explicit stacks and state transitions instead of
>> > mutually-recursive functions?" --PR
>
> It's a completely different algorithm, with quite different
> capabilities. I think, if you're going to write compilers, you should
> know both, so you'll be able to use the one that suits the language.
See, you're a victim of the conspiracy, too. They are not completely
different algorithms. There is a thread of continuity between them
explored in a series of papers, the central one being Leermakers,
Augusteijn, and Kruseman Aretz (TCS 1992).
Here it is in a nutshell. If we take a context-free grammar G and create
a function A() for each nonterminal A whose job it is to consume (from
an input list provided as a parameter to the function A) an initial
sequence of terminals derivable from A, it is easy to write what we call
a naive recursive-descent parser. It can even handle ambiguous grammars
if we have it produce a list of input suffixes (that is, make it
nondeterministic). But it takes exponential time on some grammars, and
gets into infinite loops with left-recursive productions (e.g. A -> Ab).
So we transform G to ensure there are at most two nonterminals on the
right-hand side of every rule. There are many related ways to do this
(cf Chomsky normal form) but we choose one that introduces items as
nonterminals. An item is a grammar rule with a dot in the right-hand
side to indicate partial progress. The rules in the transformed grammar
have an item on the left side and on the right side, the nonterminal
after the dot followed by the item with the dot moved over one position.
If we apply recursive descent to the transformed grammar (with a
function for each item), it still has the same problems. But if we
memoize the computation, it's not hard to show that it takes O(n^3)
time. [Aside: up to this point should be core material for any reputable
CS curriculum, but it is not.] We can cut this down by introducing
lookahead (first and follow sets), giving LL(k).
But we still have the problem with left recursion, which tends to be
natural in grammars for programming languages (and, I am told, for
natural language grammars as well). There are techniques for avoiding
the infinite loops, but the resulting parse trees "slant the wrong way".
We can handle that by transforming the parse tree, but that involves
operations on trees, not to mention building the parse tree in the first
place. (Such things are hard for students instructed using Java.)
So we transform the grammar again to eliminate left recursion. There are
many related ways to do this (cf Greibach normal form), but we choose
one that involves left corners of the symbol after the dot (a left
corner being any initial symbol generated by rewriting the leftmost
symbol zero or more times). Then we apply recursive descent to the newly
transformed grammar.
Expressed in terms of the original grammar, this involves not only a
function for each item, but a function for each (left corner, item)
pair. This is nondeterministic recursive ascent parsing, so named
because the recursive functions "ascend" from a terminal which is a left
corner through successive undone leftmost rewrites.
To make this more general, we use the idea from NFA->DFA conversion, and
replace functions on items by functions on sets of items (called
states). Where before the natural successor of an item I with respect to
a nonterminal A just after a dot was the item with the dot moved over
the A, now the successor of a state is the collection of such plus the
introduction of new items with A on the left and some right-hand side
with a dot all the way to the left. This is the LR(0) automaton, and
what I've described is its transition or "goto" function.
If we apply recursive ascent to this, and work out the details, we get
an LR(0) parser. (Pepper at TU-Berlin explains it entirely in terms of
grammar transformations.) Shift-reduce and reduce-reduce conflicts are
explained by being exactly the points in the function definitions where
nondeterminism is possible. Adding lookahead as we did for LL(k) gives
LR(k). Cutting down on the number of states and the complexity of the
transition function by including various restrictions gives SLR or LALR.
These sets of mutually-recursive functions don't look much like the
traditional LR(0) or LR(k) parser. To get there, we make the recursion
stack explicit, and treat the input list as a stack. Then the core of
the algorithm becomes an iterative loop, popping and pushing. Finally,
we notice that we can combine the two stacks into one, processing the
input sequentially, and we have the conventional presentation (which,
unlike everything I have described up to this point, cannot handle
nondeterminism caused by grammars that don't conform to the restrictions).
But all the intuition has been drained out of it, as the blood has been
drained out of a slaughtered animal that ends up as shrink-wrapped
pieces of meat in a supermarket. Yes, they're convenient, and we adults
make the connection with living animals if we think about it enough,
though usually we don't. The traditional LR(k) algorithm is efficient in
time and space (to an extent not really needed in many applications any
more, and certainly not needed in a pedagogical presentation), but
students don't make the connection with more natural or simpler
techniques. Instead, they either use tools (e.g. Lex/YACC) that they
only dimly understand, flailing away at shift-reduce or reduce-reduce
conflicts by randomly rewriting their grammars, or they learn the
algorithm by rote but glean what intuition they can from one or two
worked examples.
I went further, learning the theories of LL and LR parsing from
researchers in formal languages at two universities, but until recently
I also thought these were "completely different" algorithms that could
only be expressed iteratively using mutation. To be fair, I learned said
theories before the above connections were worked out; the first papers
in this series (Roberts and Kruseman Aretz) appeared in 1988. Twenty
years later, we no longer have an excuse. --PR