Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] "Template"/Macro/Generic rules #261

Open
CAD97 opened this issue Aug 24, 2018 · 6 comments
Open

[RFC] "Template"/Macro/Generic rules #261

CAD97 opened this issue Aug 24, 2018 · 6 comments

Comments

@CAD97
Copy link
Contributor

CAD97 commented Aug 24, 2018

Motivation

One of the advertised features of LALRPOP, macros/templates/generics are a useful tool for factoring out common parts of your grammar. The common example is CommaSeparated<production> to represent production ~ ("," ~ production)* ~ ","?. (This can also be written (production ~ ",")* ~ production?, but I prefer the former formulation.)

In this RFC I lay out how a design for generic rules might look in pest, and attempt to make a case for their implementation.

A proposal for standard casing

In the 2.0 version of pest, the standard casing sees builtin rules in SHOUT_CASE and user rules recommended to be in snake_case. This RFC proposes that generic rules could be TitleCase by convention, along with their arguments, to distinguish them from normal rules.

Guide-Level Explanation

(Shamelessly adapted from the LALRPOP book, which is licensed MIT/Apache as is LALRPOP itself)

When writing grammars we encounter repetitive constructs that might normally be copy-and-pasted. A common example is something like a "comma-separated list". If we want to parse a comma-separated list of expressions, it might look something like:

expressions = { expression ~ ("," ~ expression)* ~ ","? }

But what happens if later we want a comma-separated list of terms, or anything else? For this, pest offers generic rules. By using a generic rule, we can factor out this common functionality into one place.

expressions = { CommaSeparated<expression> }
terms = { CommaSeparated<term> }

CommaSeparated<Rule> = _{ Rule ~ ("," ~ Rule)* ~ ","? }

Because CommaSeparated is marked as a silent rule with a _, this means this is functionally equivalent to inlining its structure into both expressions and terms. If a generic rule is not silenced, it will be included in the output structure just like any other rule.

Implementation-Level Explanation

There are two ways to handle generic rules. In the first, we treat it as a template, and generate multiple parsing functions for each instantiation. In the second, we pass along the generics to the Rust code.

I will explain via walking through in pseudocode the following example (note that no rules are silent, unlike above):

expressions = { CommaSeparated<expression> }
terms = { CommaSeparated<term> }

CommaSeparated<Rule> = { Rule ~ ("," ~ Rule)* ~ ","? }

Template desugaring

For each unique rule that is passed into a generic rule, desugar to a new rule instantiated with the concrete rule(s) passed to the generic rule.

expressions = { CommaSeparated__expression }
terms = { CommaSeparated__term }

CommaSeparated__expression = { expression ~ ("," ~ expression)* ~ ","? }
CommaSeparated__term = { term ~ ("," ~ term)* ~ ","? }

The generated rules do not correspond to unique Rule enum variants in the output, however; all generated rules from the same generic rule map to the same Rule enum variant.

Generic implementation

In addition to the parser state, the generated function for parsing this rule takes an argument representing what rule is passed as its generic argument. It then calls said function for any time the generic argument is present in the definition.

fn CommaSeparated(
    state: Box<::pest::ParserState<Rule>>,
    Rule: impl Fn(Box<::pest::ParserState<Rule>>) -> ::pest::ParseResult<Box<::pest::ParserState<Rule>>>,
) -> ::pest::ParseResult<Box<::pest::ParserState<Rule>>> {
    Rule(state)?
        .many(0.., |state| Rule(state.literal(",")?))?
        .optional(|state| state.literal(","))
}

Grammar changes

The terminal rule is changed to accommodate generic rules:

-terminal = _{ _push | identifier | string | insensitive_string | range }
+terminal = _{ _push | (identifier ~ ("<" ~ CommaSeparated<terminal> ~ ">")?) | string | insensitive_string | range }

In the future, we may wish to relax this such that a generic rule can take a term or even expression instead. Conversely, we may wish to only accept one terminal instead of a list to begin with.

Prior Art

  • LALRPOP, as linked above

Unresolved Questions

  • Do we want to pass along the genericness to the generated Rust code, or "monomorphize" the rules ourselves?
  • How generic should the generics be allowed to be? Do we want to support the more generic Separated<Term, By> = { Term ~ (By ~ Term)* ~ By? }? Do we need to support it?
  • Does paramaterizing generics over term or expression instead of terminal give any more convenience to the user? Any generic rule can be expressed solely by taking a terminal by just defining a silent terminal to be the desired more complicated expression.
@CAD97
Copy link
Contributor Author

CAD97 commented Aug 24, 2018

CCing some of the linked project authors whose grammar looked at a glance that they might benefit from this (and are currently using the derive grammar):

@sunng87 (handlebars-rust), @jturner314 (py_literal), @wahn (rs_pbrt), @Keats (tera),

@CAD97
Copy link
Contributor Author

CAD97 commented Aug 24, 2018

I wrote this RFC because I'm currently wishing I had it in my personal project. If we want to write a large language grammar using pest, this re-usability factor is probably much more useful than even #197.

@dragostis
Copy link
Contributor

dragostis commented Aug 24, 2018

I like the idea and I think it could potentially be something to be included in 2.0 or 2.1. However, I would prefer a slightly different approach. How about every rule can take arguments like normal functions? rule would thus be equivalent to rule(). Everything else would stay the same.

The implementation would be a bit more demanding, since the AST would need to be changed to some degree. Monomorphization would also be needed in order to have good performance, probably implemented as an optimization step in meta.

@CAD97, would you be willing to take a jab at it once we put everything in order?

@CAD97
Copy link
Contributor Author

CAD97 commented Aug 24, 2018

Yep, I can work on an MVP implementation. I like the idea of making every rule a function that takes rule arguments.

If we translate this to generics at the Rust level, Rust will take care of the monomorphization pass for us.

This is definitely a 2.1 thing rather than a 2.0 blocker, though. Either formulation of <> or () is a clean extension to the existing 2.0 syntax.

@sunng87
Copy link
Contributor

sunng87 commented Aug 24, 2018

There were little comma-separated syntax in handlebars, but I think this can be a good addition to pest. Also I'm a fan of a more generic Separated<Term, By, EscapedBy> implementation.

Also how about a Quoted<Term, StartBy, EndBy, EscapedBy> ?

@Keats
Copy link
Contributor

Keats commented Aug 29, 2018

Sounds like a good idea to me!

Also I'm a fan of a more generic Separated<Term, By, EscapedBy> implementation.

+1 on that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants