Grammar presets. #26

modulovalue · 2023-03-29T19:41:41Z

(I hope you don't mind another issue)

I think it would be awesome if there was a way to load sample grammars (and perhaps make it easy to submit them via PRs?). I'm thinking of something like the "Load Sample" feature in edotor, but with a short description for why the grammar is considered to be interesting.

"Mysterious conflict"

Here's one grammar that I think is interesting source:

on grammophone

It demonstrates a grammar that is LALR(1), but not LR(1). The conflict is known as a "mysterious conflict".

modulovalue · 2023-04-05T00:07:22Z

Grammar examples

I've just found this: http://smlweb.cpsc.ucalgary.ca/example-grammars/ by chasing links starting at the readme of this project. This contains a bunch of grammar examples for various grammar classes.

mdaines · 2023-04-05T01:46:11Z

Yes, those would be good examples. They're already imported for test fixtures: https://github.com/mdaines/grammophone/blob/76612cb618f22fa7aaf9ec53f2ddefb6194f85b2/test/fixtures/example_grammars.js

modulovalue · 2023-04-05T17:09:36Z

Technique: Determinization via Unit Production Elimination

Here are two more that I think are interesting because they have practical relevance to e.g. parameter/argument lists that allow optional trailing commas:

This grammar attempts to capture interspersed lists where the separator can optionally occur at the end of the list.

The following grammar can be disambiguated by SLR(1): on grammophone

S -> L ','.
S -> L.
L -> L ',' a.
L  -> a.

The following grammar wraps the list from above in a separate production rule which makes it not LR(1): on grammophone

S -> P ','.
S -> P.
P -> L.
L -> L ',' a.
L  -> a.

A parser generator that naively transforms an "interspersion" construct to one with a level of indirection (like in the second grammar) would generate a grammar that is not LR(1).

modulovalue · 2023-05-24T07:51:16Z

Grammar examples

Here's a gist I found that contains more: https://gist.github.com/lin/dc83bb38eb458ded3ff01aec4a327d54

modulovalue · 2023-06-19T20:35:35Z

Testcase: unnecessary LR(1) splits

This grammar demonstrates a source of inefficiency in the automaton of the standard LR(1) construction.

Looking at the LR(1) automaton, I think that state 13 & 18 and state 17 & 12 could be merged and the resulting automaton would still resolve the ambiguity that LR(1) was meant to resolve. only 6 & 9 and 14 & 19 are required to be distinct new states.

This grammar is given as an example in some slides that introduce langcc. However, I'm not sure if langcc eliminates this source of inefficiency.

It appears that the act of eliminating this inefficiency is known as "minimizing" an LR automaton.

Edit: the author of the booze-tools project wrote a little something about this grammar: https://boozetools.readthedocs.io/en/latest/minimal.html

Note: Minimizing LR(1) State Machines is NP-Hard

modulovalue · 2023-07-22T18:16:49Z

Testcase: NQLALR guard

Here are two other interesting grammars:

When implementing an LALR parser using the Deremer and Pennello relation-based algorithm, it is easy to implement something that is known as NQLALR (Not-Quite-LALR), and apparently, back when this was a hot research topic, many people made this mistake.

According to some papers, NQLALR doesn't fit into the SLR -> LALR hierarchy, meaning, there are grammars that are SLR and not NQLALR. This observation can be used to debug an LALR implementation.

The following papers present example grammars that can be used to do that:

Simple Computation of LALR(1) Lookahead Sets, Manuel Bermudez

An LALR(1) grammar that is non-SLR(1) and non-NQLALR(1).

https://mdaines.github.io/grammophone/?s=UyAtPiBhIGcgZC4KUyAtPiBhIEEgYy4KUyAtPiBiIEEgZC4KUyAtPiBiIGcgYy4KQSAtPiBCLgpCIC0+IGcu#/

On the (non-) relationship between SLR(1) and NQLALR(1) grammars, Manuel E. Bermudez, Karl M. Schimpf

An SLR(1) grammar that is not NQLARL(1).

https://mdaines.github.io/grammophone/?s=SSAtPiBTLgpTIC0+IGEgQiBkLgpTIC0+IEMgQiBnLgpTIC0+IGMgZC4KQiAtPiBEIEUuCkQgLT4gLgpFIC0+IC4KQyAtPiBjLg==

modulovalue · 2023-07-29T17:38:38Z

Testcase: LALR correctness

The following paper discusses an incorrect oversimplification that one can make while implementing their algorithm.

Efficient Computation of LALR(1) Look-Ahead Sets, Frank DeRemer & Thomas Pennello

The paper doesn't discuss this, but I think, it is possible to look at the whole algorithm as a single dataflow problem. Their wording implies that two separate SCC runs are necessary. In any case, that grammar might also be valuable in verifying that an implementation is not incorrect in some obvious way.

According to the paper, the following grammar should be LALR(1) and an incorrect implementation would report it as not being LALR(1).

https://mdaines.github.io/grammophone/?s=UyAtPiBJLgpJIC0+IGEgQiBkLgpJIC0+IEMgQiBnLgpJIC0+IGMgZC4KQiAtPiBEIEUuCkMgLT4gYy4KRCAtPiAuCkUgLT4gLg==

modulovalue · 2023-07-29T21:49:13Z

Testcase: LALR(2)

The following paper discusses a grammar that is "almost" LALR(1). It has 9 inadequate states under LR(0). 8 of them can be resolved using LALR(1) and there's one left that needs LALR(2).

Practical Arbitrary Lookahead LR Parsing, Manuel E. Bermudez and Karl M. Schimpf

https://mdaines.github.io/grammophone/?s=UyAtPiBTICQuClMgLT4gQi4KQiAtPiBBRSBkZXEgQUUuCkIgLT4gU0UgdGVxIFNFLgoKQUUgLT4gQUUgcCBBVC4KQUUgLT4gQUUgbSBBVC4KQUUgLT4gQVQuCgpBVCAtPiBBVCBzIEFGLgpBVCAtPiBBRi4KCkFGIC0+IGlkLgoKU0UgLT4gU0UgcCBTVC4KU0UgLT4gU0UgbSBTVC4KU0UgLT4gU1QuCgpTVCAtPiBTVCBzIFNGLgpTVCAtPiBTRi4KClNGIC0+IGlkLg==

modulovalue · 2023-07-31T03:30:37Z

Metric: Includes Theorem

DeRemer & Pennello conjectured that if the "includes" relation of their LALR(1) construction algorithm contains a nontrivial SCC (... this misses the read condition, see the paper), then the grammar can't be LR(k) for any k. This was proven by A Short Proof of a Conjecture of DeRemer and Pennello
THOMAS J. SAGER. These are example grammars that Sager gives where that is the case:

G1:
https://mdaines.github.io/grammophone/?s=UyAtPiBCLgpBIC0+IGMgQy4KQSAtPiBhLgpCIC0+IGQgQS4KQyAtPiBiIEIuCkMgLT4gYiBkIEEgdC4=

G2:
https://mdaines.github.io/grammophone/?s=UyAtPiBCLgpBIC0+IGMgQy4KQSAtPiBhLgpCIC0+IGQgQSBELgpDIC0+IGIgZCBBIHQuCkMgLT4gYiBCLgpEIC0+IC4=

G3:
https://mdaines.github.io/grammophone/?s=UyAtPiBCLgpBIC0+IGMgQy4KQSAtPiBhLgpCIC0+IGQgQS4KQyAtPiBiIGQgQSBFIHQuCkMgLT4gYiBCLgpFIC0+IC4=

G4:
https://mdaines.github.io/grammophone/?s=UyAtPiBCLgpBIC0+IGMgQy4KQSAtPiBhLgpCIC0+IGQgQSBELgpDIC0+IGIgZCBBIEUgdC4KQyAtPiBiIEIuCkQgLT4gLgpFIC0+IC4=

In general, this property is undecidable. This is one of the few heuristics that are known for detecting whether a CFG is not LR(k) for any k.

modulovalue · 2023-07-31T04:06:15Z

Transformation: Recursion Scheme Fusion

The PhD Thesis by Philippe Charles on LALR(k) parsing and automatic error recovery discusses the following three grammars extensively (Figure 3.1, Page 31). They all model a simple BNF grammar. He uses them to highlight the importance of LALR(k) where k is > 1. ...

Example Grammar: LALR(2)

... (a) is LALR(2), (b) is a version of (a) that was converted to LALR(1) and (c) is a version of (a) that requires an explicit separator to be LALR(1) to maintain the shape of (a).

(a) (b) (c)

Note: this comment is very similar to this comment #26 (comment), the transformation from (a) to (b) appears to be the same.

modulovalue · 2023-07-31T04:16:08Z

Metric: Read Theorem

In Efficient Computation of LALR(1) Look-Ahead Sets, Frank DeRemer & Thomas Pennello, the authors prove that if there's a nontrivial SCC in the reads relation of their LALR(1) algorithm, then the grammar is not LR(k) for any k.

They give 2 grammars as examples. One is ambiguous, the other is not. This is the unambiguous grammar that is not LR(k) for any k. (The other ambiguous grammar is that one without the f at the end of the A production on grammophone.)

modulovalue · 2023-09-12T23:10:32Z

Transformation: (Right-)CPS

The following grammar is presented in the following paper that was published alongside langcc (cf. page 5 bullet 4):

Field -> S E.
Field -> E E.
S -> s.
S -> .

It is not LR(1), but can be made SLR(1) by the CPS transformation in that paper, which would transform it to the following grammar:

Field -> S.
Field -> E E.
S -> s E.
S -> E.

It can also be made SLR(1) by simply inlining S on grammophone.

modulovalue · 2023-10-24T08:44:45Z

Metric: Context splitting only resolves R/R conflicts

In his dissertation on LR parsing (page 94, first bullet, third paragraph), Xin Chen claims that it is widely known that only reduce/reduce conflicts can be resolved by state splitting and not shift/reduce conflicts.

I think this is an interesting fact that doesn't seem obvious to me and I wanted to capture it here.

It can be observed in the first grammar in this issue and in the grammar mentioned in this comment. The LALR(1) parsing tables contain only reduce/reduce conflicts so, I guess, in an efficient implementation, it would make sense to start with state splitting to resolve those reduce/reduce conflicts and not to consider shift/reduce conflicts at all.

Considering that claim, it is clear that the (a) grammar in #26 (comment) can't be determinized with state splitting alone. As the paper claims, either it needs to be transformed to (b), or more lookahead is needed.

modulovalue · 2023-10-24T11:35:40Z

Technique: Moving LR(k) to runtime

This LR(2) grammar contains a reduce/reduce conflict in its LR(1) automaton and is used as an example for doing LR(k) at runtime in the dissertation by Xin Chen (Page 164)

Testcase: LR(1) to LR(2) inefficient splitting

It is also an example of a source of inefficiency (similar to #26 (comment)) that the canonical LR(1) construction has. LR(1) doesn't help with resolving any inadequate states (LR(0) has 3, LALR(1) resolves two, LR(2) resolves one), but the canonical LR(1) construction still splits one state. An efficient implementation of LR(k) should consider that and not split states that don't help with resolving inadequate states. (TODO or maybe it needs to split this state in LR(1) so that LR(2) can be effective?)

modulovalue · 2023-10-24T13:16:47Z

Transformation: ?

This mentions a yacc-grammar-snippet grammar for a language that is LR(2), but the grammar is not. This is also being discussed in Xin Chen's dissertation (here on Page 167). The first link in this comment contains some prose that discusses different ways for dealing with it by, for example, rewriting it to an LR(1) grammar such as b or c

After a little bit of cleaning up, we get the following grammar:

on grammophone

# rule+
yacc -> rule | yacc rule.

# rulebody ";"?
rule -> rulebody | rulebody ";".

# (RULENAME ':' alt."|"+)?
rulebody -> RULENAME ':' alt | rulebody "|" alt.

# (TOKEN | RULENAME)*
alt -> alt TOKEN | alt RULENAME | .

TODO: http://www.cs.man.ac.uk/~pjj/complang/howto.html
TODO: http://www.cs.man.ac.uk/~pjj/complang/howto2.html
TODO: http://www.cs.man.ac.uk/~pjj/complang/grammar.html
TODO: http://www.cs.man.ac.uk/~pjj/complang/g2lr.html
TODO: http://www.cs.man.ac.uk/~pjj/cs2121/ho/node7.html
TODO: http://www.cs.man.ac.uk/~pjj/complang/g2ll.html
TODO: How can we go from that to (b) or (c), see derivation from above.

Consider (b):

we can right factor yacc and move the contents of the base case of rulebody out from the left (inverse of Left-CPS):

on grammophone

yacc -> RULENAME ':' rulebody T.

T -> ";" | yacc | ";" yacc | .

rulebody -> 
	| rulebody TOKEN
	| rulebody RULENAME	
	| rulebody "|".

we can turn the left recursion into right recursion and apply Right-CPS

on grammophone

yacc -> RULENAME ':' rulebody.

T -> ";" | yacc | ";" yacc | .

rulebody -> T
	| TOKEN rulebody
	| RULENAME rulebody	
	| "|" rulebody.

modulovalue · 2023-10-24T13:23:59Z

Class: LR(closed)/LR(infinite)

Chris Clark suggested a grammar class that he calls LR(closed) here. An example grammar can be found on grammophone.

Quote:

Here is a grammar that isn't LR(k) for any k, for a language that
could have an LR(1) grammar for it--in fact the language is regular.
The language is (a|b)a+(a|b).

S : a A a ;
S : b A b ;
S : a B b ;
S : b B a ;
A : A a ;
B : B a ;
A : a ;
B : a ;

Note, this grammar is particularly interesting to me as it is part of
what we tried to solve in Yacc++, when we did LR(infinite) (or what I
sometimes call LR(closed) parsing. If you look carefully, you will
see that you can parse it deterministically, However, you have to push
an indefinite amount of rules on the "stack". Still, when you get
done, and you see the EOF marker, you can then deterministically
resolve the rules you had pushed. I believe that every unambiguous
CFG has such a deterministic parse, and I believe the computing the
"closure" is the method to find it (if it exists), which is why I
sometimes refer to the method as LR(closed). Unfortunately, due to
the halting probloem, there are grammars that attempting to compute
the closure on will never terminate.

It is also mentioned in Xin Chen's dissertation here on page 170.

This is interesting in a different way. Here, an LR split removes a local ambiguity on a 'b', but since it doesn't remove the local ambiguity on 'a', that ambiguity is duplicated with the split. the amount of local ambiguities can therefore increase when going from LALR(1) to LR(1) if one is not careful about how he counts them.

Determinization: Convert left recursion to right recursion -> Procedure cloning -> Right-CPS

We can clone A and B to get the following grammar: on grammophone Not LR(1)

We can convert the left recursion to right recursion: on grammophone Not LR(1)

We can then apply CPS to move the right context to the left into the base case of the recursion: on grammophone SLR(1)

modulovalue · 2023-10-24T13:32:08Z

Minimal Grammar Example: LR(2)

Here's possibly the simplest possible LR(2) grammar (source).

modulovalue · 2023-10-24T13:40:38Z

Minimal Grammar Example: LR(1)

~~Here's~~ possibly the simplest possible LR(1) grammar that is not LALR(1) which doesn't introduce inefficient state splits in the canonical LR(1) construction. ~~(source)~~

Smaller

mdaines · 2023-10-24T13:41:08Z

@modulovalue I decided to enable discussions as experiment, and created an example grammars section: https://github.com/mdaines/grammophone/discussions/categories/example-grammars

I appreciate you posting these grammars, but I was wondering if discussions might be a better format than a single issue. Let me know what you think; I'm not sure if I like GitHub discussions or not.

I still plan to add a few example grammars to Grammophone when I get a chance.

mdaines · 2023-10-24T14:20:30Z

It's not bothersome. Feel free to keep adding them here if you like.

Also, if you end up publishing the examples you've collected somewhere else I'm happy to link to that.

modulovalue · 2023-10-24T14:39:44Z

Grammar Example LR(k) for any k

Here's a simple formula for constructing a sample LR(k) grammar for k > 0 (source):

// Here c^k means the concatenation of k letters 'c'.
// This is a LR(k+1) grammar, depending on the value of k.
S -> a A D a.
S -> b A D b.
S -> a B D b.
S -> b B D a.
A -> a.
B -> a.
D -> c^k.

LR(1), LR(2), LR(3), LR(4), LR(5), ...

modulovalue · 2023-10-25T13:27:13Z

Metric: Ill-foundedness

Here is an example of an "ill-founded" grammar as defined by the booze-tools project.

It seems to be similar to the "unrealizable nonterminals" metric that grammophone exposes.

It would be interesting to know whether the "ill-founded" concept adds something new or whether unrealizable nonterminals and ill-foundedness are one and the same thing.

on grammophone

Source

Here, "well-founded" means "can possibly produce a finite sequence of terminals."
The opposite is called "ill-founded".

Here are two examples of ill-founded grammars:
	S -> x S   -- This is ill-founded because there's always one more S.
	A -> B y;  B -> A x     -- Similar, but with mutual recursion.

A terminal symbol is well-founded. So is a nullable symbol, since zero is finite.
A rule with only well-founded symbols in the right-hand side is well-founded.
A non-terminal symbol with at least one well-founded rule is well-founded.
Induction applies. That induction happens to be called ``bipartite_closure``.

A grammar with only well-founded symbols is well-founded.

modulovalue · 2023-10-25T13:48:36Z

Testcase: GLR hidden left recursion termination issues

A naive implementation of GLR would have trouble with certain grammars containing epsilon productions. This is one: on grammophone

This is known as hidden left recursion.

Source

## Tomita's GSS Algorithm:

When you boil it all down, Tomita's chief contribution is the idea to use
a directed acyclic graph rather than a tree of states: each node could have
more than one predecessor, and so after any given machine cycle, you have
at most one active graph-node per LR table state. (In practice, you'll have
many fewer such active nodes).

Neglecting epsilon rules, this works pretty well: You still have search
problem to carry out reductions, but the performance bounds are polynomial
in the size of the input string.

The problem with epsilon-rules is not immediately obvious, but it's easy
to trigger with a hidden-left-recursive grammar such as this:

S -> E S a     (rule 1)
S -> b         (rule 2)
E -> :epsilon  (rule 3)

It should be clear upon inspection that this context-free grammar describes
in fact a regular language (`ba*`) but the obvious variations on Tomita all
fail to recognize the string `baa`. Why? Well, with lookahead `b` we can
certainly shift the first token of rule 2 (where we'll soon recognize `S`)
but we can also possibly reduce rule 3 before shifting the `b`, but we wind
up in the same state after shifting `b` both ways. Now if you carefully
play this scenario forward: The first branch dies; the second allows `ba`
but then can't go back in time to recognize however many `E` symbols must
have preceeded the first `b` so as to consume all the right number of `a`
tokens which follow.

There are a number of hacks which solve the problem for subsets of the
epsilon-grammars, but no simple solution for all of them.

Tomita recognized these difficulties but did not offer a complete solution.

modulovalue · 2023-10-28T09:51:19Z

Tutorial: LR(1)

The LALRPOP project provides an excellent explanation of how to implement LR(1) through the lane tracing method:

First example: on grammophone & the explanation
Second example: on grammophone & the explanation

This blogpost appears to be an extended version of that readme.

modulovalue · 2023-10-31T09:12:24Z

Testcase: LR(1) disambiguation related incompleteness

The IELR paper claims:

In this paper, we demonstrate that a well known algorithm described by David Pager and implemented in Menhir, the most robust minimal LR(1) implementation we have discovered, does not always achieve the full power of canonical LR(1) when the given grammar is non-LR(1) coupled with a specification for resolving conflicts.

This is the grammar that they use on grammophone.

Notice how grammophone identifies that there are 4 example sentences.

[...] If he has declared a as left- associative, the parser chooses to reduce in the cases of both aa·a and aa·aa. In this way, the grammar author has specified a new language consisting of only 3 sentences: aaa, bab, and baab.

So, apparently, a strategy that uses associativity declarations for removing ambiguities could lead to a different language being recognized.

modulovalue · 2023-11-01T19:27:12Z

Transformation: Deduplication

A very simple LR(2) grammar on grammophone with some prose (source) that shows how to rewrite that grammar to LR(0) by removing duplicates.

This grammar shows that it might make sense to have some form of duplicate detection that identifies identical production rules.

modulovalue · 2023-11-01T20:03:22Z

Grammar Examples

This repo contains A TON of random LR(k) grammars and a bunch of LR(1) grammars in bison syntax.

modulovalue · 2023-11-08T18:42:06Z

Class: LR-regular (LRR)

This is an example grammar that is meant to be LRR (LR-regular): on grammophone.

It is a proper superclass of LR(k). That grammar comes from the following paper that introduces LRR:

LR-regular grammars—an extension of LR(k) grammars, Karel Čulik II and Rina Cohen

There's a somewhat common source of ambiguities in LR-parsers that many people are pointing out where infinite lookahead would be needed to remove all inandequate states, so those grammars can't be LR(k) for any k. I'm wondering whether LRR could help with that. I'm wondering about the relationship between Chris Clark's LR(closed) grammars and LRR.

It would be nice to find some minimal LRR grammars so that their automata can be inspected more easily.

I'm also wondering if LRR could be generalized to something like "LR-CF" where the lookahead sets themselves are not just regular, but descriptions of a context-free language. Or maybe at least visibly pushdown languages (https://github.com/ianh/owl), which support a limited form of recursion, which regular languages don't, and have desirable properties that context free languages don't have, but regular languages have.

Leo optimizes Earley-style parsers to take linear time for LRR grammars.

One parser that claims to be able to parser LRR grammars is: Marpa, A practical general parser: the recognizer, Jeffrey Kegler.

modulovalue · 2023-11-18T13:04:04Z

Class: ECFGs

This paper by Kristensen and Madsen proposes to extend the LR-theory from CFGs to ECFGs. It argues that by doing that, many conflicts can be avoided.

(See page 8 for reference):

star-operator

The following grammar:

S -> E.
E -> a b* b d.
E -> a b* b c.

can be desugared into a left-recursive version (on grammophone) or a right-recursive version (on grammophone), both of which are not LR(1).

The LR(1) automaton of the left-recursive version contains an r/r conflict where both productions are epsilon productions in one inadequate state.

The LR(1) automaton of the right-recursive version contains 2 of same r/r conflicts, but with an additional shift conflict for each in two inadequate states.

If we canonicalize b*, then the left-recursive version becomes SLR(1) (on grammophone), the same trick doesn't work with the right-recursive version (on grammophone). It still has a s/r conflict. More lookahead might be able to resolve this conflict.

Left-recursive definitions in bottom up parsers are more efficient because they don't need O(n) stack space. However, something that is not clear to me is the question of which kinds of inadequate states can benefit from a right-recursive definition. right-recursive definitions don't appear to be strictly worse (excluding performance).

From a performance perspective, left-recursive definitions are superior. But left-recursive definitions can introduce inadequate states that right-recursive definitions can't introduce (#26 (comment)).

On page 9, Kristensen and Madsen show an alternative desugaring that is SLR(1). (There's a typo there, the last production in both grammars should be B -> b, otherwise they recognize a different language. Or that section is just wrong):

left-recursive: on grammophone
right-recursive: on grammophone

But even there, the left-recursive version is LR(0), and the right-recursive version isn't. So it might make sense to prefer the left-recursive definition there (ignoring any performance benefits) because the increased expressivity might propagate to fewer inadequate states in more complex grammars?

If we apply the CPS idea from langcc, we can make the right recursive version LR(0): on grammophone

or-operator

Consider page 9 and page 33:

A naive desugaring produces a non-LR(1) grammar: on grammophone. If we canonicalize, we get an LR(0) grammar on grammophone, but by doing that we would lose meaning (e.g., the semantic actions would have to be the same so an automatic transformation is not practical here.)

If we apply the automatic transformation, we can keep meaning and have an LR(0) grammar: on grammophone

Kristensen and Madsen don't seem to provide a method for applying their transformation automatically to ECFGs, but 50 years later, langcc did that, although the author doesn't mention that in the paper.

I think the key question here is whether we could benefit from extending the definitions of nullsets/firstsets/(slr) followsets/(lalr) lookaheadsets to ECFGs and introduce new actions in LR automata to support + (and interspersed +), * (and interspersed *), ? and | explicitly as first class operations without having to desugar them to a mutilated CFG. (regular expression to NFA to DFA conversions already need to implicitly consider such extended definitions of nullsets/firstsets/followsets, so this shouldn't be anything new.)

Applying any such transformations makes it much harder to keep a sane parse tree or do incremental parsing and, in general, it just complicates things by a lot.

I haven't seen a single project that implements what the paper says, i.e., extend the LR-theory to support ECFGs as a first class specification. Maybe by doing that we can have all the performance benefits of left-recursive definitions, the hypothetical additional expressiveness of right-recursive definitions, support incremental parsing and have sane parse trees without having to do anything really complicated.

(Note: it is not straightforward to restore ASTs from regular expressions, so I guess it won't be straightforward to restore parse trees from first class ECFGs. https://github.com/ianh/owl shows that the parse tree problem for regular expressions can be solved.)

modulovalue · 2023-11-18T13:29:08Z

Technique: Left vs Right Recursion

Amazing, I've just found an example that provides a grammar where a right-recursive definition can be superior:

What follows is a minimal example of what that post says:

If we want to parse a list of comma separated values with an optional trailing comma, we can:

accept a left recursive list that has a trailing comma and one that doesn't: on grammophone => Not LR(1)
accept a right recursive list that has a trailing comma and one that doesn't on grammophone => SLR(1)

Furthermore, we can:

accept a single right recursive list that has an optional trailing comma on grammophone => SLR(1)

Note: A right recursive definition is not enough if the optional trailing separator is not in the base case of the right recursive definition: (on grammophone => not LR(1)). Joe's CPS transformation achieves precisely that, that is, it moves the optional trailing separator into the base case.

However, we can have an LR(1) left-recursive definition of that language: on grammophone, so right recursion is not strictly necessary.

To conclude, a quote from the linked blogpost:

The idea that “left recursion is good, right recursion is bad” is a myth, at least in terms of conflicts. There are ways of working with left recursion and ways of working with right recursion; they are just not the same.

And to be complete, here's the example where a left recursive definition is superior:

left recursive is SLR(1) on grammophone
right recursive is not LR(1) on grammophone

Observation: followset differences between left and right recursion

Consider right-recursive on grammophone & left-recursive on grammophone

Notice how the rec nonterminal in the left-recursive version contains a larger followset. label can follow rec in the left-recursive version, but not in the right-recursive version. Since LALR lookaheads are partitions of followsets, a similar observation could be made for LALR lookaheads.

Note:

~~Right recursive (x.',')*','?: on grammophone~~

modulovalue · 2023-11-20T00:47:29Z

Transformation: (Left-)CPS

The CPS technique (#26 (comment)) introduces a transformation that is able to move the right remainder of a rule into a nonterminal to its left. However, it would also make sense to do the same with left remainders, and move them into nonterminals to the right.

This is a comma separated list that accepts an optional trailing comma on grammophone.
Applying such a "Left CPS" transformation would result in the following grammar: on grammophone. Here, the intermediate unnecessary O was able to be removed and the grammar became more compact. Therefore, such a Left-CPS transformation is able to make grammars more amenable to an optimization that removes unit productions. Also, applying "Left CPS" and inlining the intermediate O can lead to fewer conflicts. (TODO find a small repro grammar for this observation)

Left-CPS can only be applied to lists if they are left recursive and Right-CPS only to right recursive lists. (Why? Because lists and of the left if they are left recursive and on the right if they are right recursive).

modulovalue · 2023-11-20T17:05:48Z

Transformation: Pete Jinks

See: http://www.cs.man.ac.uk/~pjj/complang/heuristics.html

That page contains a bunch of equalities that can be used to rewrite grammars. I'm wondering if it would be possible to find more by, e.g., formalizing CFGs (ECFGs?) and their algebraic rules as an algebra and putting that through, e.g., the knuth bendix completion procedure (and maybe we can even find a confluent term rewriting system)?

Some transformations are not obvious to me, especially the self referential ones. Can an axiomatic system be defined to make their correctness more obvious by deriving them from basic algebraic manipulations?

PJ 4.12

(A* # B)* = (B* # A)* = (A+|B+)* = (A|B)* => made initial grammar SLR(1)

modulovalue · 2023-11-25T01:43:17Z

Example: CPS-eligibility criterion for *

Here's another example of a grammar (a* a) that can benefit from the CPS transformation.

Before: on grammophone Not LR(1)

After: on grammophone SLR(1)

Maybe this observation could be translated into a CPS-eligibility criterion: let the grammar be X* Y, if any of first(Y) is in first(X), then the CPS transformation should be applied?

modulovalue · 2023-11-25T02:45:58Z

Transformation: Procedure Cloning

GFGs (#27) introduce the idea of applying a traditional program optimization technique known as procedure cloning to CFGs.

Consider the following grammar (for this regular language: (m|n)*mf(m|n)*n): on grammophone. It is not LR(1). However, by combining procedure cloning and CPS we can make it SLR(1): on grammophone. We clone the flattened (m|n)* L list into L1 and L2 and replace each use of L with a unique list. This allows us to apply CPS without losing any information related to semantic actions.

We can make the grammar SLR(1) by using a left-recursive list: on grammophone.
However, if we add a level of indirection with the left-recursive list, the grammar loses it's SLR(1) property and it becomes not even LR(1): on grammophone. The right-recursive version is not susceptible to this level-of-indirection-induced inadequacy: on grammophone. The topic of unit rule elimination might be relevant here: #26 (comment)

modulovalue · 2023-11-25T20:07:11Z

Example: Need for LL(k) where k > 1.

Terence Parr and Russell Quong discuss the need for LL(k) & LR(k) where k > 1 in
LL and LR Translator Need k.

Example 1: We illustrate the advantage of an LL(2) grammar over an LL(1) grammar for recognizing a fragment of the C language. Consider distinguishing between C labels ``ID :'' and C assignment statements "ID = ...", Where ID represents an identifier token. In the following grammar fragment, rule {\tt stat} requires two lookahead symbols and is easily expressed with an LL(2) grammar. This common language feature is hard to express with an LL(1) grammar because the ID token is matched in many different grammar locations. Left-factoring the common ID prefix in stat} and expr would results in a much less natural grammar.

on grammophone

Although manual left-factoring can sometimes solve the problem, it is not ideal for two reasons. First, even if left-factoring yields a reasonable grammar, the LL(k) grammar for k>1 is simply more convenient, which tends to reduce development time. Secondly, while left-factoring might be theoretically possible, it can be practically implausible, as in this example, where expr occurs throughout the grammer.

The next example, example 2, discusses the issue from the following comment in this issue #26 (comment).

The other examples don't appear to add anything new to this issue. A lot of that article discusses how actions can reduce the expressivity of LR parsers to that of an LL parser, but that's a big can of worms that I'm not going to try to investigate in this issue.

modulovalue · 2023-11-29T04:08:22Z

Transformation: Extended CPS

The CPS paper appears to only consider basic EBNF operations such as * and + in its flattening scheme and CPS transformation.

We can go further. One might need to have multiple base cases (such as when we want to parse lists with interspersed commas and optional trailing commas, see: #26 (comment)). The paper only considers those with one base case. We can extend CPS to flattening procedures that introduce multiple base cases. I'm going to call this Extended CPS (ECPS)

Here's an example of a basic ECPS application with multiple base cases:

That simple grammar is a snippet from the Dart grammar that describes a reduced version of function expressions and record patterns.

Notice how the version without CPS is not LR(1), but the version with CPS is LR(0).

The CPS paper describes a very basic form of detecting which nonterminals are eligible for CPS. It is only able to be applied to EBNF symbols that were flattened before. I think we can do better. A better CPS transformation should not have to depend on the flattening scheme and be able to find CPS eligible symbols by examining the structure of the CFG by, for example, looking for cycles in the CFG (maybe it's enough to consider SCCs (via Tarjan) that consist of a single fundamental cycle (via Johnson with a single entry-point to determine CPS eligibility (if there are multiple entry-points, procedure cloning could help). If the recursion tails on the right, it should be eligible for left-CPS and if it tails on the left, it should be eligible for right-CPS). (Also, see: jzimmerman/langcc#48 (comment) for the same idea but for optionals)

modulovalue · 2023-12-22T01:54:40Z

This link contains a tutorial from Cornell on how to turn a Java grammar LALR(1). There might be something interesting there.

TODO implement the examples in grammophone

modulovalue · 2023-12-22T02:01:41Z

This link contains a tutorial from one of the only LL(k) parser generators (SLK) on LL(k). I'm not focusing on LL parsing, and most of the stuff here is covering LR-parsing, but that page is very detailed so it deserves to be mentioned.

TODO implement the examples in grammophone

modulovalue · 2024-03-30T11:30:54Z

Example: "star lookahead + 1"

Consider: on grammophone

S -> A | B.
A -> B1 a.
B -> B2 b.
B1 -> B1 b | b.
B2 -> B2 b | b.

(From: A Methodology for Removing LALR(k) Conflicts, right after Figure 1)

This is not LALR(k) because it needs a non-finite amount of lookahead to resolve whether to reduce B1 or B2. However, the language itself is regular.

I don't know of any strategies for dealing with this. It would be interesting to find a way for detecting conflicts of such kind and dealing with them.

does LR-regular handle it?
what automatic transformation could make this LALR(x) while preserving semantic actions?

modulovalue · 2024-09-27T17:02:09Z

SLR(1) / LALR(1)

Here's a simple grammar that is LALR(1) but not SLR(1).

On grammophone

S -> a A c | a B d | B c.
A -> z.
B -> z.

modulovalue · 2024-11-25T12:52:39Z

This paper contains a simple example where an LALR(2) grammar has been manually converted to LALR(1):

Suffix languages in Lr parsing - Benjamin R. Seyfarth & Manuel E. Bermudez

LALR(2) from Figure 1:

on grammophone:

Common_stmt -> 'common' Named_common_list.
Named_common_list -> Named_common_list Optional_comma Named_common.
Named_common_list -> Named_common.
Named_common -> '/' Common_name '/' Element_list.
Element_list -> Element_list ',' Element.
Element_list -> Element.
Optional_comma -> ','.
Optional_comma -> .

LALR(1) from Figure 2:

on grammophone:

Common_stmt -> 'common' Named_common_list Named_no_comma.
Named_common -> '/' Common_name '/' Element_list Optional_comma.
Named_no_comma -> '/' Common_name '/' Element_list.
Named_common_list -> Named_common_list Named_common.
Named_common_list -> .
Element_list -> Element_list ',' Element.
Element_list -> Element.
Optional_comma -> ','.
Optional_comma -> .

modulovalue · 2024-11-25T13:36:15Z

Here's an example of an LL(2) grammar:

on grammophone

A -> B | C.
B -> "a" "d".
B -> "b" "d".
C -> "a" "e".
C -> "c" "e".

It comes from here: Lelek, an LL(k) Parser Generator: Solving the Rule Decision Problem. There's also more written about that example there.

modulovalue · 2024-11-28T10:39:16Z

In Parsing Techniques - Second Edition, Fig. 9.46, an LAR(m) grammar is presented: on grammophone.

Start -> S.
S -> A a.
S -> B b.
A -> A C.
A -> C.
C -> d.
B -> B D.
B -> D.
D -> d.

It needs regular lookahead to be made deterministic. One conflict is resolved by d+a and the other by d+b. That grammar is not LALR(k) for any k because it potentially needs an infinite amount of lookahead due to the regular nature of the lookahead.

This grammar is from Fig. 9.51 and it contains a right recursive version of the previous grammar.

on grammophone

Start -> S.
S -> A.
S -> B.
A -> C A.
A -> a.
C -> d.
B -> D B.
B -> b.
D -> d.

This grammar needs its stack to be trimmed and it is properly LAR(1), the previous grammar does not need its stack to be trimmed, but it is still LAR(1).

TODO example for LAR(>1)

modulovalue · 2024-11-28T11:17:30Z

Here's an unambiguous non-deterministic grammar given as an example in Fig. 9.13 in Parsing Techniques - Second Edition.

on grammophone

S -> a S a. 
S -> a.

That book states that no LR based methods or extensions of it (... like LAR(m)) are able to parse it deterministically. However, I'm not entirely convinced of that and that statement doesn't appear to come with a proof. A proof or a counterexample would be very nice.

Edit: there are some discussions here https://groups.google.com/g/comp.compilers/c/8b5EWs1pREM/m/Cc3AeS_pZDQJ that discuss this grammar.

>From Matt Timmermans/MSL ...
I find it hard to imagine a deterministic parser based on the grammar

S: aSa | a

yet this is an unambiguous grammar and the language described is a regular
language.

Ullman and friend (Hopcoft?, A-W 1979?), in a well known text (on my other
desk) discuss decidability problems for context free grammars, etc.

The same book contains results showing the existence of inherently
ambiguous languages which consequently can't be recognised by a 1DPDA (if
you mean a Deterministic PDA).

I do know of a method for parsing all LR(1) and some non-LR(k) grammars
with a parser very similar to the LR Parser. Like LR Parsing, it is O(N)
and the parse tables are very similar to LR (or LALR(1)) parse tables. It
is a simple generalisation of the LR Parsing technique. As yet I don't
know the full power of this parser. It can use `infinite' look-ahead to
resolve LR conflicts and is quite practical. It is described in several
places (and more widely accessible soon, I hope):

%A Bob Buckley
%T Relaxing a right-most first restriction
%J Australian Comp. Sci. Comms.
%V 9
%N 1
%D 1987
%P 304-312

%A Bob Buckley
%T Three techniques for parsing a context sensitive language
%D 1993
%I School of Computer and Information Science, University of South Australia
%R CIS-93-004

regards
b++

This comment was marked as duplicate.

Sign in to view

This was referenced Jul 30, 2023

A source of inefficiency in LR(1) parsers. osa1/parsegen#11

Open

Does booze-tools eliminate this LR(1) redundancy? kjosib/booze-tools#45

Closed

modulovalue mentioned this issue Jul 31, 2023

Support {read, include}-related diagnostics. #33

Closed

This comment was marked as outdated.

Sign in to view

modulovalue mentioned this issue Oct 25, 2023

Tainting doesn't need to consider shift reduce conflicts? kjosib/booze-tools#49

Closed

modulovalue mentioned this issue Nov 15, 2023

Storing grammars in the url. ianh/owl#39

Open

modulovalue mentioned this issue Mar 30, 2024

support EBNF #50

Open

Grammar presets. #26

Grammar presets. #26

Comments

modulovalue commented Mar 29, 2023 • edited Loading

"Mysterious conflict"

modulovalue commented Apr 5, 2023 • edited Loading

Grammar examples

mdaines commented Apr 5, 2023

modulovalue commented Apr 5, 2023 • edited Loading

Technique: Determinization via Unit Production Elimination

modulovalue commented May 24, 2023 • edited Loading

Grammar examples

modulovalue commented Jun 19, 2023 • edited Loading

Testcase: unnecessary LR(1) splits

This comment was marked as duplicate.

modulovalue commented Jul 22, 2023 • edited Loading

Testcase: NQLALR guard

modulovalue commented Jul 29, 2023 • edited Loading

Testcase: LALR correctness

modulovalue commented Jul 29, 2023 • edited Loading

Testcase: LALR(2)

modulovalue commented Jul 31, 2023 • edited Loading

Metric: Includes Theorem

modulovalue commented Jul 31, 2023 • edited Loading

Transformation: Recursion Scheme Fusion

Example Grammar: LALR(2)

modulovalue commented Jul 31, 2023 • edited Loading

Metric: Read Theorem

modulovalue commented Sep 12, 2023 • edited Loading

Transformation: (Right-)CPS

modulovalue commented Oct 24, 2023 • edited Loading

Metric: Context splitting only resolves R/R conflicts

modulovalue commented Oct 24, 2023 • edited Loading

Technique: Moving LR(k) to runtime

Testcase: LR(1) to LR(2) inefficient splitting

modulovalue commented Oct 24, 2023 • edited Loading

Transformation: ?

modulovalue commented Oct 24, 2023 • edited Loading

Class: LR(closed)/LR(infinite)

Determinization: Convert left recursion to right recursion -> Procedure cloning -> Right-CPS

modulovalue commented Oct 24, 2023 • edited Loading

Minimal Grammar Example: LR(2)

modulovalue commented Oct 24, 2023 • edited Loading

Minimal Grammar Example: LR(1)

mdaines commented Oct 24, 2023 • edited Loading

This comment was marked as outdated.

mdaines commented Oct 24, 2023

modulovalue commented Oct 24, 2023 • edited Loading

Grammar Example LR(k) for any k

modulovalue commented Oct 25, 2023 • edited Loading

Metric: Ill-foundedness

modulovalue commented Oct 25, 2023 • edited Loading

Testcase: GLR hidden left recursion termination issues

modulovalue commented Oct 28, 2023 • edited Loading

Tutorial: LR(1)

modulovalue commented Oct 31, 2023 • edited Loading

Testcase: LR(1) disambiguation related incompleteness

modulovalue commented Nov 1, 2023 • edited Loading

Transformation: Deduplication

modulovalue commented Nov 1, 2023 • edited Loading

Grammar Examples

modulovalue commented Nov 8, 2023 • edited Loading

Class: LR-regular (LRR)

modulovalue commented Nov 18, 2023 • edited Loading

Class: ECFGs

star-operator

or-operator

modulovalue commented Nov 18, 2023 • edited Loading

Technique: Left vs Right Recursion

Observation: followset differences between left and right recursion

modulovalue commented Nov 20, 2023 • edited Loading

Transformation: (Left-)CPS

modulovalue commented Nov 20, 2023 • edited Loading

Transformation: Pete Jinks

PJ 4.12

modulovalue commented Nov 25, 2023

Example: CPS-eligibility criterion for *

modulovalue commented Nov 25, 2023

Transformation: Procedure Cloning

modulovalue commented Nov 25, 2023

modulovalue commented Mar 29, 2023 •

edited

Loading

modulovalue commented Apr 5, 2023 •

edited

Loading

modulovalue commented Apr 5, 2023 •

edited

Loading

modulovalue commented May 24, 2023 •

edited

Loading

modulovalue commented Jun 19, 2023 •

edited

Loading

modulovalue commented Jul 22, 2023 •

edited

Loading

modulovalue commented Jul 29, 2023 •

edited

Loading

modulovalue commented Jul 29, 2023 •

edited

Loading

modulovalue commented Jul 31, 2023 •

edited

Loading

modulovalue commented Jul 31, 2023 •

edited

Loading

modulovalue commented Jul 31, 2023 •

edited

Loading

modulovalue commented Sep 12, 2023 •

edited

Loading

modulovalue commented Oct 24, 2023 •

edited

Loading

modulovalue commented Oct 24, 2023 •

edited

Loading

modulovalue commented Oct 24, 2023 •

edited

Loading

modulovalue commented Oct 24, 2023 •

edited

Loading

modulovalue commented Oct 24, 2023 •

edited

Loading

modulovalue commented Oct 24, 2023 •

edited

Loading

mdaines commented Oct 24, 2023 •

edited

Loading

modulovalue commented Oct 24, 2023 •

edited

Loading

modulovalue commented Oct 25, 2023 •

edited

Loading

modulovalue commented Oct 25, 2023 •

edited

Loading

modulovalue commented Oct 28, 2023 •

edited

Loading

modulovalue commented Oct 31, 2023 •

edited

Loading

modulovalue commented Nov 1, 2023 •

edited

Loading

modulovalue commented Nov 1, 2023 •

edited

Loading

modulovalue commented Nov 8, 2023 •

edited

Loading

modulovalue commented Nov 18, 2023 •

edited

Loading

modulovalue commented Nov 18, 2023 •

edited

Loading

modulovalue commented Nov 20, 2023 •

edited

Loading

modulovalue commented Nov 20, 2023 •

edited

Loading

modulovalue commented Nov 28, 2024 •

edited

Loading

modulovalue commented Nov 28, 2024 •

edited

Loading