Skip to content
Bastiaan Veelo edited this page Apr 15, 2017 · 25 revisions

Traditional PEG does not support left-recursion in grammars. Pegged does support all sorts of left-recursion since v0.3.0, so you don't need to worry about it. The remainder of this page will explain what left-recursion is, and how Pegged detects and deals with it.

Direct Left-Recursion

Suppose we wanted to match strings like n, n+n, n+n+n, etc. with a grammar like

      Left:
        S <- E eoi
        E <- E '+n' / 'n'

Because the first element of the rule E is itself E, a naive implementation would construct a parser function that would call itself as the first instruction with the same input, thereby recursing immediately without doing anything and without ever to return, until you run out of stack space and crash.

Just changing the order of options is not an option, because the first option would then always only match the first n, and no attempts would be made to match the rest of the input.

Hidden Left-Recursion

A more subtle form of the same problem is that the recursive element is hidden behind other elements that can succeed without consuming input, like an optional:

        E <- F? E '+n' / 'n'

Even though F is tried first, F? does nothing if F doesn't match, and in that case execution proceeds to call E on the same input and thereby falls into the same trap.

Hidden left-recursion can even be more obscured. Consider this:

        E <- F E '+n' / 'n'
        F <- A B C / D*

Because D* matches the empty string, F succeeds without consuming input, and hides left-recursion in E. You can imagine that there is no end to how deep left-recursion can be hidden in your grammar.

Indirect Left-Recursion

We've seen that direct left-recursion happens when E calls into itself on the same input. Indirect left-recursion happens when E as its first step calls any number of other rules, that at one or more points call back into E with the same input. For example:

        E <- F '+n' / 'n'
        F <- G H / J
        J <- K / E L

This causes this left-recursive cycle: E <- F <- J <- E on an input that doesn't match G H and K. Indirect left-recursion can also be hidden, of course.

Interlocking cycles

It can yet be more complicated. For inputs like nlm-n+(aaa)n, this grammar has several interlocking left-recursive cycles:

      Left:
        S <- E eoi
        E <- F 'n' / 'n'
        F <- E '+' I* / G '-'
        G <- H 'm' / E
        H <- G 'l'
        I <- '(' A+ ')'
        A <- 'a'
  1. E <- F <- E
  2. G <- H <- G
  3. E <- F <- G <- E

Meaning that there is more than one way to left-recurse at the same input.

Another example: two rules that are mutually left-recursive.

      Left:
        M <- L eoi
        L <- P '.x' / 'x'
        P <- P '(n)' / L
  1. L <- P <- L
  2. P <- P

Bounded Left-Recursion

The traditional approach to deal with left-recursion is at the grammar level, by cleverly splitting up and rearranging the rules (without changing the language). This puts a large burden on the grammar writer, changes the structure of the grammar and obscures intentions, changes the parse trees, and changes associativity. I won't go into details here, but this requires the grammar writer to be aware of all left-recursions, however hidden and indirect. And this process gets harder the more rules are involved, I doubt it is even possible to eliminate all left-recursions from multiple interlocking left-recursive cycles like the above.

Luckily, an alternative has been discovered: bounded recursion [1, 2]. It involves two critical concepts:

  1. A way to control recursion step by step.
  2. A condition for when to stop recursion.

Let's look at the second concept first.

Stopping Condition

Medeiros et al. [1] show that when in left-recursion, the part of the input that can be matched increases with every recursion, but only up to a certain depth. Recursing beyond this depth matches a shorter part, before increasing again up to the maximum. So, recursion should stop when the length of the match stops to grow.

Writing things out by hand will clarify, using the recursive rule from before

E <- E '+n' / 'n'

Let's indicate recursion depth between parentheses, where E(i) is the ith recursion, so

E(i) <- E(i-1) '+n' / 'n'

We can make E recurse no more than i times if we define E(0) to fail. You then get the following progression:

E(0) <- fail

E(1) <- E(0) '+n' / 'n' = 'n'

E(2) <- E(1) '+n' / 'n' = 'n' '+n' / 'n'

E(3) <- E(2) '+n' / 'n' = ('n' '+n' / 'n') '+n' / 'n'

Etcetera.

When applied to an input string of n+n, E(1) will only match the first n, E(2) will match the complete input n+n, but E(3) will again only match the first n. So in this case recursion should stop after E(2). You can continue this yourself and you'll see that on longer inputs like n+n+n an extra +n is matched for each recursion, until again only the first n can be matched when the optimum recursion depth is exceeded.

Recursion Control

Let's simplify things a bit, and assume that the recursive rule above can be implemented (pseudo code) as

ParseTree E(ParseTree p)
{
    PareTree result = E(p) ~ "+n" || "n";
    return result;
}

This of course recurses with no end. Now (closely following [2]) instead of a recursive call loop, allow only one recursive call by means of a check on global data and implement the loop as a controlled while loop:

ParseTree E(ParseTree p)
{
    static ParseTree prev = none;             // No recursion has happened.
    if (prev != none)                         // We are recursing. Don't evaluate E anew
        return prev;                          //     return the memoized result instead.
    ParseTree current = fail;                 // E(0).
    prev = current;                           // Stop on next recursive call. 
    while (true)                              // Controlled loop, try E with increased
    {                                         //      recursion bound.
        PareTree result = E(p) ~ "+n" || "n"; // This is E(n) <- E(n-1) "+n" / "n"
        if (result.length > current.length)   // The match length is growing, continue.
        {
            prev = result;                    // Memoize E(n-1) for when we recurse.
            current = result;
        }
        else                                  // Optimum bound exceeded, current is
        {                                     //       the best match.
            prev = none;                      // Clean up.
            return current;                   // Done.
        }
    }
}

You should be able to see how this implements the manual progression that we did above:

  1. When E is evaluated for the first time, E(0) (prev) is set to fail before we get to the while loop.
  2. In the while loop, E(1) (result) is produced. The recursive call to E returns E(0) immediately. So E(1) = "n".
  3. The match length of "n" is longer than the match length of fail which is 0, so E(1) is memoized and the loop continues.
  4. Round two in the loop produces E(2) (result). Again, the recursive call to E returns E(1) immediately. So E(2) = "n+n" || "n".
  5. E(2) matches the input "n+n", and its length is longer than that of "n", so E(2) is memoized and the loop continues.
  6. Round three in the loop produces E(3). E(2) is obtained immediately, so E(3) = ("n+n" || "n") "+n" || "n".
  7. Although E(3) matches the input with "n", its length is not longer than the match of E(2). Therefore E(2) is returned which ends the loop.

Note that the function E is not actually called recursively i times when E is evaluated with a recursive bound of i. The bound is increased step by step, while the previous step is kept readily available. So complexity is linear.

Implementation

In Pegged the principle above is extended in two ways:

  1. prev is replaced by an associative array so that E can be evaluated at various positions in the input.
  2. Ordinary memoization of left-recursive rules is temporarily blocked, so they don't interfere with the recursive progression.

It is important that this controlled left-recursion is not implemented in every rule. Only one rule in a left-recursive cycle needs this. The module pegged/introspection.d is used prior to parser construction to analyse the grammar and detect left-recursive cycles, then identify rules that will get a bounded left-recursion implementation, as well as the rules for which memoization needs to be prevented during left-recursion. All this happens by default without the user noticing.

Caveat

Currently, memoization is not supported at compile time. For the same reason, compile time parser construction is not supported if the grammar is left-recursive. A warning will be issued in that case, recommending to generate the parser asModule, which is more efficient most of the time anyway.

References

[1] Sergio Medeiros, Fabio Mascarenhas, Roberto Ieruésalimschy, Left Recursion in Parsing Expression Grammars, Programming Languages, Volume 7554 of the series Lecture Notes in Computer Science pp 27-41, 2012. http://www.inf.puc-rio.br/~roberto/docs/sblp2012.pdf

[2] Nicolas Laurent, Kim Mens, Parsing Expression Grammars Made Practical, Proceedings of the 2015 ACM SIGPLAN International Conference on Software Language Engineering, pp 167-172. http://arxiv.org/pdf/1509.02439v1.pdf

Clone this wiki locally