-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grammar presets. #26
Comments
Grammar examplesI've just found this: http://smlweb.cpsc.ucalgary.ca/example-grammars/ by chasing links starting at the readme of this project. This contains a bunch of grammar examples for various grammar classes. |
Yes, those would be good examples. They're already imported for test fixtures: https://github.com/mdaines/grammophone/blob/76612cb618f22fa7aaf9ec53f2ddefb6194f85b2/test/fixtures/example_grammars.js |
Technique: Determinization via Unit Production EliminationHere are two more that I think are interesting because they have practical relevance to e.g. parameter/argument lists that allow optional trailing commas: This grammar attempts to capture interspersed lists where the separator can optionally occur at the end of the list. The following grammar can be disambiguated by SLR(1): on grammophone
The following grammar wraps the list from above in a separate production rule which makes it not LR(1): on grammophone
A parser generator that naively transforms an "interspersion" construct to one with a level of indirection (like in the second grammar) would generate a grammar that is not LR(1). |
Grammar examplesHere's a gist I found that contains more: https://gist.github.com/lin/dc83bb38eb458ded3ff01aec4a327d54 |
Testcase: unnecessary LR(1) splitsThis grammar demonstrates a source of inefficiency in the automaton of the standard LR(1) construction. Looking at the LR(1) automaton, I think that state 13 & 18 and state 17 & 12 could be merged and the resulting automaton would still resolve the ambiguity that LR(1) was meant to resolve. only 6 & 9 and 14 & 19 are required to be distinct new states. This grammar is given as an example in some slides that introduce langcc. However, I'm not sure if langcc eliminates this source of inefficiency. It appears that the act of eliminating this inefficiency is known as "minimizing" an LR automaton. Edit: the author of the booze-tools project wrote a little something about this grammar: https://boozetools.readthedocs.io/en/latest/minimal.html |
This comment was marked as duplicate.
This comment was marked as duplicate.
Testcase: NQLALR guardHere are two other interesting grammars: When implementing an LALR parser using the Deremer and Pennello relation-based algorithm, it is easy to implement something that is known as NQLALR (Not-Quite-LALR), and apparently, back when this was a hot research topic, many people made this mistake. According to some papers, NQLALR doesn't fit into the SLR -> LALR hierarchy, meaning, there are grammars that are SLR and not NQLALR. This observation can be used to debug an LALR implementation. The following papers present example grammars that can be used to do that: Simple Computation of LALR(1) Lookahead Sets, Manuel Bermudez An LALR(1) grammar that is non-SLR(1) and non-NQLALR(1). An SLR(1) grammar that is not NQLARL(1). |
Testcase: LALR correctnessThe following paper discusses an incorrect oversimplification that one can make while implementing their algorithm. Efficient Computation of LALR(1) Look-Ahead Sets, Frank DeRemer & Thomas Pennello The paper doesn't discuss this, but I think, it is possible to look at the whole algorithm as a single dataflow problem. Their wording implies that two separate SCC runs are necessary. In any case, that grammar might also be valuable in verifying that an implementation is not incorrect in some obvious way. According to the paper, the following grammar should be LALR(1) and an incorrect implementation would report it as not being LALR(1). |
Testcase: LALR(2)The following paper discusses a grammar that is "almost" LALR(1). It has 9 inadequate states under LR(0). 8 of them can be resolved using LALR(1) and there's one left that needs LALR(2). Practical Arbitrary Lookahead LR Parsing, Manuel E. Bermudez and Karl M. Schimpf |
Metric: Includes TheoremDeRemer & Pennello conjectured that if the "includes" relation of their LALR(1) construction algorithm contains a nontrivial SCC (... this misses the read condition, see the paper), then the grammar can't be LR(k) for any k. This was proven by A Short Proof of a Conjecture of DeRemer and Pennello In general, this property is undecidable. This is one of the few heuristics that are known for detecting whether a CFG is not LR(k) for any k. |
Transformation: Recursion Scheme FusionThe PhD Thesis by Philippe Charles on LALR(k) parsing and automatic error recovery discusses the following three grammars extensively (Figure 3.1, Page 31). They all model a simple BNF grammar. He uses them to highlight the importance of LALR(k) where k is > 1. ... Example Grammar: LALR(2)... (a) is LALR(2), (b) is a version of (a) that was converted to LALR(1) and (c) is a version of (a) that requires an explicit separator to be LALR(1) to maintain the shape of (a). Note: this comment is very similar to this comment #26 (comment), the transformation from (a) to (b) appears to be the same. |
Metric: Read TheoremIn Efficient Computation of LALR(1) Look-Ahead Sets, Frank DeRemer & Thomas Pennello, the authors prove that if there's a nontrivial SCC in the reads relation of their LALR(1) algorithm, then the grammar is not LR(k) for any k. They give 2 grammars as examples. One is ambiguous, the other is not. This is the unambiguous grammar that is not LR(k) for any k. (The other ambiguous grammar is that one without the f at the end of the A production on grammophone.) |
Transformation: (Right-)CPSThe following grammar is presented in the following paper that was published alongside langcc (cf. page 5 bullet 4):
It is not LR(1), but can be made SLR(1) by the CPS transformation in that paper, which would transform it to the following grammar:
It can also be made SLR(1) by simply inlining S on grammophone. |
Metric: Context splitting only resolves R/R conflictsIn his dissertation on LR parsing (page 94, first bullet, third paragraph), Xin Chen claims that it is widely known that only reduce/reduce conflicts can be resolved by state splitting and not shift/reduce conflicts. I think this is an interesting fact that doesn't seem obvious to me and I wanted to capture it here. It can be observed in the first grammar in this issue and in the grammar mentioned in this comment. The LALR(1) parsing tables contain only reduce/reduce conflicts so, I guess, in an efficient implementation, it would make sense to start with state splitting to resolve those reduce/reduce conflicts and not to consider shift/reduce conflicts at all. Considering that claim, it is clear that the (a) grammar in #26 (comment) can't be determinized with state splitting alone. As the paper claims, either it needs to be transformed to (b), or more lookahead is needed. |
Technique: Moving LR(k) to runtimeThis LR(2) grammar contains a reduce/reduce conflict in its LR(1) automaton and is used as an example for doing LR(k) at runtime in the dissertation by Xin Chen (Page 164) Testcase: LR(1) to LR(2) inefficient splittingIt is also an example of a source of inefficiency (similar to #26 (comment)) that the canonical LR(1) construction has. LR(1) doesn't help with resolving any inadequate states (LR(0) has 3, LALR(1) resolves two, LR(2) resolves one), but the canonical LR(1) construction still splits one state. An efficient implementation of LR(k) should consider that and not split states that don't help with resolving inadequate states. (TODO or maybe it needs to split this state in LR(1) so that LR(2) can be effective?) |
Transformation: ?This mentions a yacc-grammar-snippet grammar for a language that is LR(2), but the grammar is not. This is also being discussed in Xin Chen's dissertation (here on Page 167). The first link in this comment contains some prose that discusses different ways for dealing with it by, for example, rewriting it to an LR(1) grammar such as b or c After a little bit of cleaning up, we get the following grammar:
TODO: http://www.cs.man.ac.uk/~pjj/complang/howto.html Consider (b):
|
Class: LR(closed)/LR(infinite)Chris Clark suggested a grammar class that he calls LR(closed) here. An example grammar can be found on grammophone. Quote:
It is also mentioned in Xin Chen's dissertation here on page 170. This is interesting in a different way. Here, an LR split removes a local ambiguity on a 'b', but since it doesn't remove the local ambiguity on 'a', that ambiguity is duplicated with the split. the amount of local ambiguities can therefore increase when going from LALR(1) to LR(1) if one is not careful about how he counts them. Determinization: Convert left recursion to right recursion -> Procedure cloning -> Right-CPSWe can clone A and B to get the following grammar: on grammophone Not LR(1) We can convert the left recursion to right recursion: on grammophone Not LR(1) We can then apply CPS to move the right context to the left into the base case of the recursion: on grammophone SLR(1) |
@modulovalue I decided to enable discussions as experiment, and created an example grammars section: https://github.com/mdaines/grammophone/discussions/categories/example-grammars I appreciate you posting these grammars, but I was wondering if discussions might be a better format than a single issue. Let me know what you think; I'm not sure if I like GitHub discussions or not. I still plan to add a few example grammars to Grammophone when I get a chance. |
This comment was marked as outdated.
This comment was marked as outdated.
It's not bothersome. Feel free to keep adding them here if you like. Also, if you end up publishing the examples you've collected somewhere else I'm happy to link to that. |
Grammar Example LR(k) for any kHere's a simple formula for constructing a sample LR(k) grammar for k > 0 (source):
|
Metric: Ill-foundednessHere is an example of an "ill-founded" grammar as defined by the booze-tools project. It seems to be similar to the "unrealizable nonterminals" metric that grammophone exposes. It would be interesting to know whether the "ill-founded" concept adds something new or whether unrealizable nonterminals and ill-foundedness are one and the same thing.
|
Testcase: GLR hidden left recursion termination issuesA naive implementation of GLR would have trouble with certain grammars containing epsilon productions. This is one: on grammophone This is known as hidden left recursion.
|
Tutorial: LR(1)The LALRPOP project provides an excellent explanation of how to implement LR(1) through the lane tracing method: First example: on grammophone & the explanation This blogpost appears to be an extended version of that readme. |
Testcase: LR(1) disambiguation related incompletenessThe IELR paper claims:
This is the grammar that they use on grammophone. Notice how grammophone identifies that there are 4 example sentences.
So, apparently, a strategy that uses associativity declarations for removing ambiguities could lead to a different language being recognized. |
Transformation: DeduplicationA very simple LR(2) grammar on grammophone with some prose (source) that shows how to rewrite that grammar to LR(0) by removing duplicates. This grammar shows that it might make sense to have some form of duplicate detection that identifies identical production rules. |
Grammar ExamplesThis repo contains A TON of random LR(k) grammars and a bunch of LR(1) grammars in bison syntax. |
Class: LR-regular (LRR)This is an example grammar that is meant to be LRR (LR-regular): on grammophone. It is a proper superclass of LR(k). That grammar comes from the following paper that introduces LRR: LR-regular grammars—an extension of LR(k) grammars, Karel Čulik II and Rina Cohen There's a somewhat common source of ambiguities in LR-parsers that many people are pointing out where infinite lookahead would be needed to remove all inandequate states, so those grammars can't be LR(k) for any k. I'm wondering whether LRR could help with that. I'm wondering about the relationship between Chris Clark's LR(closed) grammars and LRR. It would be nice to find some minimal LRR grammars so that their automata can be inspected more easily. I'm also wondering if LRR could be generalized to something like "LR-CF" where the lookahead sets themselves are not just regular, but descriptions of a context-free language. Or maybe at least visibly pushdown languages (https://github.com/ianh/owl), which support a limited form of recursion, which regular languages don't, and have desirable properties that context free languages don't have, but regular languages have. Leo optimizes Earley-style parsers to take linear time for LRR grammars. One parser that claims to be able to parser LRR grammars is: Marpa, A practical general parser: the recognizer, Jeffrey Kegler. |
Class: ECFGsThis paper by Kristensen and Madsen proposes to extend the LR-theory from CFGs to ECFGs. It argues that by doing that, many conflicts can be avoided. (See page 8 for reference): star-operatorThe following grammar:
can be desugared into a left-recursive version (on grammophone) or a right-recursive version (on grammophone), both of which are not LR(1). The LR(1) automaton of the left-recursive version contains an r/r conflict where both productions are epsilon productions in one inadequate state. The LR(1) automaton of the right-recursive version contains 2 of same r/r conflicts, but with an additional shift conflict for each in two inadequate states. If we canonicalize Left-recursive definitions in bottom up parsers are more efficient because they don't need O(n) stack space. However, something that is not clear to me is the question of which kinds of inadequate states can benefit from a right-recursive definition. right-recursive definitions don't appear to be strictly worse (excluding performance). From a performance perspective, left-recursive definitions are superior. But left-recursive definitions can introduce inadequate states that right-recursive definitions can't introduce (#26 (comment)). On page 9, Kristensen and Madsen show an alternative desugaring that is SLR(1). (There's a typo there, the last production in both grammars should be left-recursive: on grammophone But even there, the left-recursive version is LR(0), and the right-recursive version isn't. So it might make sense to prefer the left-recursive definition there (ignoring any performance benefits) because the increased expressivity might propagate to fewer inadequate states in more complex grammars? If we apply the CPS idea from langcc, we can make the right recursive version LR(0): on grammophone or-operatorConsider page 9 and page 33: A naive desugaring produces a non-LR(1) grammar: on grammophone. If we canonicalize, we get an LR(0) grammar on grammophone, but by doing that we would lose meaning (e.g., the semantic actions would have to be the same so an automatic transformation is not practical here.) If we apply the automatic transformation, we can keep meaning and have an LR(0) grammar: on grammophone Kristensen and Madsen don't seem to provide a method for applying their transformation automatically to ECFGs, but 50 years later, langcc did that, although the author doesn't mention that in the paper. I think the key question here is whether we could benefit from extending the definitions of nullsets/firstsets/(slr) followsets/(lalr) lookaheadsets to ECFGs and introduce new actions in LR automata to support + (and interspersed +), * (and interspersed *), ? and | explicitly as first class operations without having to desugar them to a mutilated CFG. (regular expression to NFA to DFA conversions already need to implicitly consider such extended definitions of nullsets/firstsets/followsets, so this shouldn't be anything new.) Applying any such transformations makes it much harder to keep a sane parse tree or do incremental parsing and, in general, it just complicates things by a lot. I haven't seen a single project that implements what the paper says, i.e., extend the LR-theory to support ECFGs as a first class specification. Maybe by doing that we can have all the performance benefits of left-recursive definitions, the hypothetical additional expressiveness of right-recursive definitions, support incremental parsing and have sane parse trees without having to do anything really complicated. (Note: it is not straightforward to restore ASTs from regular expressions, so I guess it won't be straightforward to restore parse trees from first class ECFGs. https://github.com/ianh/owl shows that the parse tree problem for regular expressions can be solved.) |
Technique: Left vs Right RecursionWhat follows is a minimal example of what that post says: If we want to parse a list of comma separated values with an optional trailing comma, we can:
Furthermore, we can:
Note: A right recursive definition is not enough if the optional trailing separator is not in the base case of the right recursive definition: (on grammophone => not LR(1)). Joe's CPS transformation achieves precisely that, that is, it moves the optional trailing separator into the base case. However, we can have an LR(1) left-recursive definition of that language: on grammophone, so right recursion is not strictly necessary. To conclude, a quote from the linked blogpost:
And to be complete, here's the example where a left recursive definition is superior:
Observation: followset differences between left and right recursionConsider right-recursive on grammophone & left-recursive on grammophone Notice how the Note:
|
Transformation: (Left-)CPSThe CPS technique (#26 (comment)) introduces a transformation that is able to move the right remainder of a rule into a nonterminal to its left. However, it would also make sense to do the same with left remainders, and move them into nonterminals to the right. This is a comma separated list that accepts an optional trailing comma on grammophone. Left-CPS can only be applied to lists if they are left recursive and Right-CPS only to right recursive lists. (Why? Because lists and of the left if they are left recursive and on the right if they are right recursive). |
Transformation: Pete JinksSee: http://www.cs.man.ac.uk/~pjj/complang/heuristics.html That page contains a bunch of equalities that can be used to rewrite grammars. I'm wondering if it would be possible to find more by, e.g., formalizing CFGs (ECFGs?) and their algebraic rules as an algebra and putting that through, e.g., the knuth bendix completion procedure (and maybe we can even find a confluent term rewriting system)? Some transformations are not obvious to me, especially the self referential ones. Can an axiomatic system be defined to make their correctness more obvious by deriving them from basic algebraic manipulations? PJ 4.12
|
Example: CPS-eligibility criterion for *Here's another example of a grammar ( Before: on grammophone Not LR(1) After: on grammophone SLR(1) Maybe this observation could be translated into a CPS-eligibility criterion: let the grammar be |
Transformation: Procedure CloningGFGs (#27) introduce the idea of applying a traditional program optimization technique known as procedure cloning to CFGs. Consider the following grammar (for this regular language: We can make the grammar SLR(1) by using a left-recursive list: on grammophone. |
Example: Need for LL(k) where k > 1.Terence Parr and Russell Quong discuss the need for LL(k) & LR(k) where k > 1 in
The next example, example 2, discusses the issue from the following comment in this issue #26 (comment). The other examples don't appear to add anything new to this issue. A lot of that article discusses how actions can reduce the expressivity of LR parsers to that of an LL parser, but that's a big can of worms that I'm not going to try to investigate in this issue. |
Transformation: Extended CPSThe CPS paper appears to only consider basic EBNF operations such as * and + in its flattening scheme and CPS transformation. We can go further. One might need to have multiple base cases (such as when we want to parse lists with interspersed commas and optional trailing commas, see: #26 (comment)). The paper only considers those with one base case. We can extend CPS to flattening procedures that introduce multiple base cases. I'm going to call this Extended CPS (ECPS) Here's an example of a basic ECPS application with multiple base cases: That simple grammar is a snippet from the Dart grammar that describes a reduced version of function expressions and record patterns. Notice how the version without CPS is not LR(1), but the version with CPS is LR(0). The CPS paper describes a very basic form of detecting which nonterminals are eligible for CPS. It is only able to be applied to EBNF symbols that were flattened before. I think we can do better. A better CPS transformation should not have to depend on the flattening scheme and be able to find CPS eligible symbols by examining the structure of the CFG by, for example, looking for cycles in the CFG (maybe it's enough to consider SCCs (via Tarjan) that consist of a single fundamental cycle (via Johnson with a single entry-point to determine CPS eligibility (if there are multiple entry-points, procedure cloning could help). If the recursion tails on the right, it should be eligible for left-CPS and if it tails on the left, it should be eligible for right-CPS). (Also, see: jzimmerman/langcc#48 (comment) for the same idea but for optionals) |
This link contains a tutorial from Cornell on how to turn a Java grammar LALR(1). There might be something interesting there. TODO implement the examples in grammophone |
This link contains a tutorial from one of the only LL(k) parser generators (SLK) on LL(k). I'm not focusing on LL parsing, and most of the stuff here is covering LR-parsing, but that page is very detailed so it deserves to be mentioned. TODO implement the examples in grammophone |
Example: "star lookahead + 1"Consider: on grammophone
(From: A Methodology for Removing LALR(k) Conflicts, right after Figure 1) This is not LALR(k) because it needs a non-finite amount of lookahead to resolve whether to reduce B1 or B2. However, the language itself is regular. I don't know of any strategies for dealing with this. It would be interesting to find a way for detecting conflicts of such kind and dealing with them.
|
SLR(1) / LALR(1)Here's a simple grammar that is LALR(1) but not SLR(1).
|
This paper contains a simple example where an LALR(2) grammar has been manually converted to LALR(1): Suffix languages in Lr parsing - Benjamin R. Seyfarth & Manuel E. Bermudez LALR(2) from Figure 1:
LALR(1) from Figure 2:
|
Here's an example of an LL(2) grammar:
It comes from here: Lelek, an LL(k) Parser Generator: Solving the Rule Decision Problem. There's also more written about that example there. |
In Parsing Techniques - Second Edition, Fig. 9.46, an LAR(m) grammar is presented: on grammophone.
It needs regular lookahead to be made deterministic. One conflict is resolved by This grammar is from Fig. 9.51 and it contains a right recursive version of the previous grammar.
This grammar needs its stack to be trimmed and it is properly LAR(1), the previous grammar does not need its stack to be trimmed, but it is still LAR(1). TODO example for LAR(>1) |
Here's an unambiguous non-deterministic grammar given as an example in Fig. 9.13 in Parsing Techniques - Second Edition.
That book states that no LR based methods or extensions of it (... like LAR(m)) are able to parse it deterministically. However, I'm not entirely convinced of that and that statement doesn't appear to come with a proof. A proof or a counterexample would be very nice. Edit: there are some discussions here https://groups.google.com/g/comp.compilers/c/8b5EWs1pREM/m/Cc3AeS_pZDQJ that discuss this grammar.
|
Hello @mdaines,
(I hope you don't mind another issue)
I think it would be awesome if there was a way to load sample grammars (and perhaps make it easy to submit them via PRs?). I'm thinking of something like the "Load Sample" feature in edotor, but with a short description for why the grammar is considered to be interesting.
"Mysterious conflict"
Here's one grammar that I think is interesting source:
on grammophone
It demonstrates a grammar that is LALR(1), but not LR(1). The conflict is known as a "mysterious conflict".
The text was updated successfully, but these errors were encountered: