-
Notifications
You must be signed in to change notification settings - Fork 66
Left Recursion
Traditional PEG does not support left-recursion in grammars. Pegged does support all sorts of left-recursion since v0.3.0, so you don't need to worry about it. The remainder of this page will explain what left-recursion is, and how Pegged detects and deals with it.
Suppose we wanted to match strings like n
, n+n
, n+n+n
, etc. with a grammar like
Left:
S <- E eoi
E <- E '+n' / 'n'
Because the first element of the rule E
is itself E
, a naive implementation would construct a parser function that would call itself as the first instruction with the same input, thereby recursing immediately without doing anything and without ever to return, until you run out of stack space and crash.
Just changing the order of options is not an option, because the first option would then always only match the first n
, and no attempts would be made to match the rest of the input.
A more subtle form of the same problem is that the recursive element is hidden behind other elements that can succeed without consuming input, like an optional:
E <- F? E '+n' / 'n'
Even though F
is tried first, F?
does nothing if F
doesn't match, and in that case execution proceeds to call E
on the same input and thereby falls into the same trap.
Hidden left-recursion can even be more obscured. Consider this:
E <- F E '+n' / 'n'
F <- A B C / D*
Because D*
matches the empty string, F
succeeds without consuming input, and hides left-recursion in E
. You can imagine that there is no end to how deep left-recursion can be hidden in your grammar.
We've seen that direct left-recursion happens when E
calls into itself on the same input. Indirect left-recursion happens when E
as its first step calls any number of other rules, that at one or more points call back into E
with the same input. For example:
E <- F '+n' / 'n'
F <- G H / J
J <- K / E L
This causes this left-recursive cycle: E <- F <- J <- E
on an input that doesn't match G H
and K
.
It can yet be more complicated. For inputs like nlm-n+(aaa)n
, this grammar has several interlocking left-recursive cycles:
Left:
S <- E eoi
E <- F 'n' / 'n'
F <- E '+' I* / G '-'
G <- H 'm' / E
H <- G 'l'
I <- '(' A+ ')'
A <- 'a'
E <- F <- E
G <- H <- G
E <- F <- G <- E
Meaning that there is more than one way to left-recurse at the same input.
Another example: two rules that are mutually left-recursive.
Left:
M <- L eoi
L <- P '.x' / 'x'
P <- P '(n)' / L
L <- P <- L
P <- P
The traditional approach to deal with left-recursion is at the grammar level, by cleverly splitting up and rearranging the rules (without changing the language). This puts a large burden on the grammar writer, changes the structure of the grammar and obscures intentions, changes the parse trees, and changes associativity. I won't go into details here, but this requires the grammar writer to be aware of all left-recursions, however hidden and indirect. And this process gets harder the more rules are involved, I doubt it is even possible to eliminate all left-recursions from multiple interlocking left-recursive cycles like the above.
Luckily, an alternative has been discovered: bounded recursion [1, 2]. It involves two critical concepts:
- A way to control recursion step by step.
- A condition for when to stop recursion.
Let's look at the second concept first.
Medeiros et al. [1] show is that when in left-recursion, the part of the input that can be matched increases with every recursion, but only up to a certain depth. Recursing beyond thiss depth matches a shorter part, before increasing again up to the maximum. So, recursion should stop when the length of the match stops to grow.
Writing things out by hand will clarify, using the recursive rule from before
E <- E '+n' / 'n'
Using superscript for recursion depth, E
n is the nth recursion, which looks like
E
n <- E
n-1 '+n' / 'n'
We can make E
recourse no more than n times if we define E
0 to fail. You then get the following progression:
Let's simplify things a bit, and assume that the recursive rule at the top of this page can be implemented as
ParseTree E(ParseTree p)
{
PareTree result = E(p);
/* ... */
return result;
}
This of course recurses with no end. Now, extend this with keeping track of the position in the input, and keeping a log or memoization table of the results of previous recursions:
ParseTree E(ParseTree p, size_t position)
{
static ParseTree[size_t] results; // Known results for successfully parsed positions
PareTree result = E(p, position);
results[position] = result;
/* ... */
if result.successful
position++;
return result;
}
[1] Sergio Medeiros, Fabio Mascarenhas, Roberto Ieruésalimschy, Left Recursion in Parsing Expression Grammars, Programming Languages, Volume 7554 of the series Lecture Notes in Computer Science pp 27-41, 2012. http://www.inf.puc-rio.br/~roberto/docs/sblp2012.pdf
[2] Nicolas Laurent, Kim Mens, Parsing Expression Grammars Made Practical, Proceedings of the 2015 ACM SIGPLAN International Conference on Software Language Engineering, pp 167-172. http://arxiv.org/pdf/1509.02439v1.pdf