Skip to content
Bastiaan Veelo edited this page Mar 1, 2017 · 25 revisions

Traditional PEG does not support left-recursion in grammars. Pegged does support all sorts of left-recursion since v0.3.0, so you don't need to worry about it. The remainder of this page will explain what left-recursion is, and how Pegged detects and deals with it.

Direct Left-Recursion

Suppose we wanted to match strings like n, n+n, n+n+n, etc. with a grammar like

      Left:
        S <- E eoi
        E <- E '+n' / 'n'

Because the first element of the rule E is itself E, a naive implementation would construct a parser function that would call itself as the first instruction with the same input, thereby recursing immediately without doing anything and without ever to return, until you run out of stack space and crash.

Just changing the order of options is not an option, because the first option would then always only match the first n, and no attempts would be made to match the rest of the input.

Hidden Left-Recursion

A more subtle form of the same problem is that the recursive element is hidden behind other elements that can succeed without consuming input, like an optional:

        E <- F? E '+n' / 'n'

Even though F is tried first, F? does nothing if F doesn't match, and in that case execution proceeds to call E on the same input and thereby falls into the same trap.

Hidden left-recursion can even be more obscured. Consider this:

        E <- F E '+n' / 'n'
        F <- A B C / D*

Because D* matches the empty string, F succeeds without consuming input, and hides left-recursion in E. You can imagine that there is no end to how deep left-recursion can be hidden in your grammar.

Indirect Left-Recursion

We've seen that direct left-recursion happens when E calls into itself on the same input. Indirect left-recursion happens when E as its first step calls any number of other rules, that at one or more points call back into E with the same input. For example:

        E <- F '+n' / 'n'
        F <- G H / J
        J <- K / E L

This causes this left-recursive cycle: E <- F <- J <- E on an input that doesn't match G H and K.

Interlocking cycles

It can yet be more complicated. For inputs like nlm-n+(aaa)n, this grammar has several interlocking left-recursive cycles:

      Left:
        S <- E eoi
        E <- F 'n' / 'n'
        F <- E '+' I* / G '-'
        G <- H 'm' / E
        H <- G 'l'
        I <- '(' A+ ')'
        A <- 'a'
  1. E <- F <- E
  2. G <- H <- G
  3. E <- F <- G <- E

Meaning that there is more than one way to left-recurse at the same input.

Another example: two rules that are mutually left-recursive.

      Left:
        M <- L eoi
        L <- P '.x' / 'x'
        P <- P '(n)' / L
  1. L <- P <- L
  2. P <- P

Bounded Left-Recursion

The traditional approach to deal with left-recursion is at the grammar level, by cleverly splitting up and rearranging the rules (without changing the language). This puts a large burden on the grammar writer, changes the structure of the grammar and obscures intentions, changes the parse trees, and changes associativity. I won't go into details here, but this requires the grammar writer to be aware of all left-recursions, however hidden and indirect. And this process gets harder the more rules are involved, I doubt it is even possible to eliminate all left-recursions from multiple interlocking left-recursive cycles like the above.

Luckily, an alternative has been discovered: bounded recursion [1, 2]. It involves two critical concepts:

  1. A way to control recursion step by step.
  2. A condition for when to stop recursion.

Let's look at the second concept first.

Stopping Condition

Medeiros et al. [1] show is that when in left-recursion, the part of the input that can be matched increases with every recursion, but only up to a certain depth. Recursing beyond this depth matches a shorter part, before increasing again up to the maximum. So, recursion should stop when the length of the match stops to grow.

Writing things out by hand will clarify, using the recursive rule from before

E <- E '+n' / 'n'

Let's indicate recursion depth between parentheses, where E(i) is the ith recursion, so

E(i) <- E(i-1) '+n' / 'n'

We can make E recourse no more than i times if we define E(0) to fail. You then get the following progression:

E(0) <- fail

E(1) <- E(0) '+n' / 'n' = 'n'

E(2) <- E(1) '+n' / 'n' = 'n' '+n' / 'n'

E(3) <- E(2) '+n' / 'n' = ('n' '+n' / 'n') '+n' / 'n'

Etcetera.

When applied to an input string of n+n, E(1) will only match the first n, E(2) will match the complete input n+n, but E(3) will again only match the first n. So in this case recursion should stop after E(2). You can continue this yourself and you'll see that on longer inputs like n+n+n an extra +n is matched for each recursion, until again only the first n can be matched when the optimum recursion depth is exceeded.

Recursion Control

Let's simplify things a bit, and assume that the recursive rule above can be implemented (pseudo code) as

ParseTree E(ParseTree p)
{
    PareTree result = E(p) ~ "+n" || "n";
    return result;
}

This of course recurses with no end. Now, instead of a recursive call loop, allow only one recursive call by means of global data and implement the loop as a controlled while loop:

ParseTree E(ParseTree p)
{
    static ParseTree prev = none;             // No recursion has happened.
    if (prev != none)                         // We are recursing. Instead of evaluating E anew,
        return prev;                          //     return the memoized result.
    ParseTree current = fail;                 // E(0).
    prev = current;                           // Stop on next recursive call. 
    while (true)                              // Controlled loop, try E with increasing recursion bound.
    {
        PareTree result = E(p) ~ "+n" || "n"; // This is E(n) <- E(n-1) "+n" / "n"
        if (result.length > current.length)   // The match length is growing, continue the loop.
        {
            prev = result;                    // Memoize the result, this is E(n-1) when we recurse.
            current = result;
        }
        else                                  // Optimum bound exceeded, current is the best match.
        {
            prev = none;                      // Clean up.
            return current;                   // Done.
        }
    }
}

References

[1] Sergio Medeiros, Fabio Mascarenhas, Roberto Ieruésalimschy, Left Recursion in Parsing Expression Grammars, Programming Languages, Volume 7554 of the series Lecture Notes in Computer Science pp 27-41, 2012. http://www.inf.puc-rio.br/~roberto/docs/sblp2012.pdf

[2] Nicolas Laurent, Kim Mens, Parsing Expression Grammars Made Practical, Proceedings of the 2015 ACM SIGPLAN International Conference on Software Language Engineering, pp 167-172. http://arxiv.org/pdf/1509.02439v1.pdf

Clone this wiki locally