Restoration of the pest3 work effort 🙌 #885

tomtau · 2023-07-14T13:06:32Z

tomtau
Jul 14, 2023
Maintainer

UPDATE (May 14, 20224): early pest3 prototype is here #1016

pest3 started as an effort to improve pest's language grammar and parser API, i.e., as per pest's focus, pest3 aimed to improve their accessibility, correctness, and performance. 💪 (Note that flexibility was not a primary goal.)

As @dragostis put it during our brainstorming call a few days ago, the idea was to have a language that is easier to use than the current pest 2's grammar and a better API that would leverage Rust's type system (unlike the existing Pairs API). Unfortunately, @dragostis burned out, and that effort did not lead to completion. 😢

A lot has changed since then, so this discussion aims to gather more feedback on what pest3 could be. I imagine pest3's future steps are the following:

💡 This discussion will run for a while to gather different ideas and current preferences;
🧪 A period of experimentation will follow and produce a more realistic idea for pest3's scope (to avoid the second-system syndrome, or in other words: "perfect is the enemy of good");
🚧 After that, the maintenance focus will be on implementation, documentation, tooling, and transition to pest3.
Given the uncertainty at this stage (from pest3's scope to people interested in helping out), this comes without a concrete timeline.

Anyway, in this discussion, I will post many separate threads, and label each thread with one potential grammar or API-breaking change. You can do the following actions:

⬆️ UPVOTE a thread if that breaking change interests you and you wish to see that in pest3. (This will help to sort potential preferences for the focus during experimentations.)
💬 COMMENT on a thread if you have more ideas for its proposed change or if you can help to implement it.
🧵 START A NEW THREAD if you find a missing feature among existing threads and wish to see it in pest3 (ideally, you can link existing GH issues).

tomtau · 2023-07-14T13:07:22Z

tomtau
Jul 14, 2023
Maintainer Author

Grammar Change: simplified handling of whitespaces/comments

Since atomic rules are cascading, it is not immediately obvious if two sequenced expressions a ~ b accept trivia—it wholly depends on whether or not the current rule inherits atomicity. The idea is to make it more explicit by being able to define the infix sequence operator ~ itself. As suggested in:
#333 (comment)

Operator	Trivia	Non-trivia
Sequence	`~`	`-`
Repeat zero or more times	`~*`	`*`
Repeat one or more times	`~+`	`+`
Repeat exactly n times	`~{n}`	`{n}`
Repeat minimum of n times	`~{n..}`	`{n..}`
Repeat maximum of n - 1 times	`~{..n}`	`{..n}`
Repeat maximum of n times	`~{..=n}`	`{..=n}`
Repeat between m and n - 1 times	`~{m..n}`	`{m..n}`
Repeat between m and n times	`~{m..=n}`	`{m..=n}`

10 replies

elenakrittik Jul 16, 2023

Perhaps i'm misunderstanding something, but that grammar specifically has lots of [spaces] here and there to allow arbitrary whitespaces between code, and backslashed is included in main rule to allow code to be split at arbitrary point by backslash. If you need to you can also make "\n" in backslahed optional to allow even such obscure syntax like

echo \what
 \       the
              \ \\ duck

Please correct me if i missed some obvious point

Toasterson Jul 16, 2023

Hmm maybe I am also not up to snuff anymore with my PEG. I'll have to look it over when I have some time. I simply remember it being extremely hard or impossible for me to do in my code. This here was the full parser I was making at the time https://github.com/OpenFlowLabs/ips/blob/master/libips/src/actions/manifest.pest
So maybe I blocked myself out of it? If you say, hey this is very easily doable with the thing you posted then I missed the explanations on how to do it. I remember more peope asking the same thing in the matrix channel later and nobody was able to answer. If people can now get the help to make this then it's ok.

elenakrittik Jul 16, 2023

As far as i can see you want to achieve this behaviour via WHITESPACE. I don't immediately see whether it's a good option, but it's definitely possible. Pest can parse pretty much anything (except CSGs, perhaps, but they're a whole another story).

Toasterson Jul 16, 2023

I don't want to say exactly how it should be achieved. But it should be as simple as the WHITESPACE feature.

tomtau May 17, 2024
Maintainer Author

@Toasterson any thoughts on this: #1016 (reply in thread) ?

tomtau · 2023-07-14T13:08:54Z

tomtau
Jul 14, 2023
Maintainer Author

Grammar Change: better reusability of expressions using macro/template/generic rules

This change is to be able to parametrize rules at definition time.
As suggested in #261 so that one can e.g. write separated(e, ",") instead of repeating some form of e ~ ("," ~ e)*.

0 replies

tomtau · 2023-07-14T13:10:45Z

tomtau
Jul 14, 2023
Maintainer Author

Grammar and API Change: token parametrization

There was also an idea for parametrizing tokens that can be replaced at runtime.
As suggested in: #333 (comment) so that one can swap out e.g. delimiters [ and ] instead of the default { and } in grammars.

0 replies

tomtau · 2023-07-14T13:12:26Z

tomtau
Jul 14, 2023
Maintainer Author

Grammar Change: module/namespace system

pest can now define multiple grammar files:

pest/derive/examples/calc.rs

Line 5 in d9bfdde

#[grammar = "../examples/base.pest"]

but it is a simple "rule concat“ mechanism and it does not work with pest_vm. The idea is to have modules more akin to Rust modules. This change removes the need for capitalization of built-in rules.
As suggested in #333 and #660 :

/// Modules can be created by importing other grammars and are immediately public.
use "cool.pest";
use "this.pest" as that;

/// pest has its own sub-modules.
any     = { pest::any }
stack   = { pest::stack::peek }
unicode = { pest::unicode::binary::punctuation }

4 replies

Jamalam360 Jul 14, 2023

I like the idea, it would be useful for the LSP as well. Not sure about the whole pest::unicode::binary::punctuation, it seems very verbose

NyalephTheCat Jul 14, 2023
Sponsor

Maybe it could be aliased?

elenakrittik Jul 14, 2023

I like the idea, it would be useful for the LSP as well. Not sure about the whole pest::unicode::binary::punctuation, it seems very verbose

~~We can always flatten some namespaces or shorten rule names (e.g. pest::unicode::binary:: punctuation -> pest::unicode::punct) if they are reported to be too long.~~

~~Alternatively, as pointed out above, we can add an alias system but that would be an overkill in my opinion.~~

Read below.

CAD97 Jul 14, 2023
Maintainer

So long as silent rules are a thing, aliasing is trivial (e.g. punct = _{ pest::unicode::punctuation })

tomtau · 2023-07-14T13:13:48Z

tomtau
Jul 14, 2023
Maintainer Author

Grammar Change: stack slicing, additional stack operations or alternatives

pest 2.X has PEEK[start..end]. So the idea is to extend it to other slack operations (POP/DROP) and perhaps make it more flexible.

A few issues regarding the stack were posted:

POP not follow the Atomic rule #721 — this change may not be needed if the simplified handling of whitespaces is implemented;
Is there any way to determine, at the level of grammar, that the order of expressions does not matter? #842 — PEEK_ANY or could one solve it with stack slicing?
Case-insensitive PUSH / PEEK / POP #541 — case insensitive matching… maybe one can push/pop/peek both variants and match on two indices with stack slicing?
[Request] Add COUNT() option to pest grammar #572 — COUNT?
Matching an expression and pushing a different one #880 — PUSH_OTHER?

Stack operations are an extra grammar complexity, so the question is whether they are needed for the parsing problems, i.e. whether there are simpler alternatives. For example, for indentation-sensitive languages, one may use two operators for indentation and alignment. The separate lexer may also allow the use of the "lexer trick".

0 replies

tomtau · 2023-07-14T13:16:02Z

tomtau
Jul 14, 2023
Maintainer Author

Grammar Change: grammar versioning

Either inside the grammar file directly or in the derivation expression.
Probably inside the grammar file directly is better.
As suggested in #333:

#[grammar = "grammar.pest", version = "3.0"]

Draft: pest files MAY start with a version_attrubute = "#" ~ "!" ~ "[" ~ "version" ~ "=" ~ semver_bound ~ "]" }. If not specified, it defaults to #[version = ^2.0].

2 replies

homersimpsons Jul 15, 2023

Maybe instead of supporting multiple versions there could be an easy way to upgrade from a given version to another. I guess it is still okay to have some manual changes to do.

tomtau Jul 15, 2023
Maintainer Author

Yes, that was also discussed in the original thread. As long as the new grammar remains semantically-compatible, it should be possible to provide automated tooling to help with grammar conversions.

tomtau · 2023-07-14T13:18:39Z

tomtau
Jul 14, 2023
Maintainer Author

Grammar Change: small bikeshedding

As suggested in #333 (comment) :

making { } optional
make use of the module/namespace system and instead of upper case rule name convention, put the builtin rules or symbols into pest::* modules

0 replies

tomtau · 2023-07-14T13:19:43Z

tomtau
Jul 14, 2023
Maintainer Author

Grammar Change: better error-reporting

Chumsky has a nice delimited_by operator that can lead to better error messages: https://github.com/zesterer/chumsky/blob/main/tutorial.md#parsing-parentheses as more context is expressed in the grammar that those pairs of tokens are related.

And non-token generating symbols in errors can also help: #327

4 replies

Samyak2 Aug 19, 2023

I have found error reporting lacking in pest, although I have not used it a lot.

A symbol that is silent in the parse tree, but shown in error messages would be very useful. This is the same as non-token-generating symbols described in #327. I would like to propose a new modifier, say -, that makes the symbol behave in such a way. Example:

Semicolon = -{ ";" }
Statement = { "a" | "b" }
TopLevel = { (Statement ~ Semicolon)* }

This would show Semicolon in error messages, such as = expected Semicolon but it will not appear in the parse tree.

Thoughts?

tomtau Aug 20, 2023
Maintainer Author

When revisiting the sketch of pest3's surface syntax, I found that there may be two possible ways of aliasing the token sequences: 1) silent rules (similar as before: Semicolon = _{ ";" }), 2) parameter-less "metarules" (Semicolon() = ";")... I'm thinking perhaps the latter one (as well as metarules with parameters) could be useful if it made it to the error reporting?

davawen Aug 21, 2023

I don't think there should be a special modifier for including tokens in messages.
Imo, string literals should always be present in error reporting, because if it's something the grammar expects, well... you expect it?

line = { ... ~ ";" }
//
// | a = 3
// |      ^
// = expected ..., or ";"  <- When wouldn't you want this??

When first writing this I was saying the same should be true for silent rules, but after thinking about it for a bit I think I got a clearer picture.
On one hand, I want to be able to declare token names in error messages (without going into renamed_rules, because it often means duplication with the grammar file), so:

SEMICOLON = _{ ";" } /// you now get `= expected SEMICOLON` in the error above

But silent rules can really be any kind of alias, they serve to avoid repetition while not actually modifying the parse tree:

prefix = _{ PLUS | MINUS }

/// With silent rule errors, you'd get
// | $not_a_valid_prefix
// | ^
// = expected ..., or prefix
/// Instead of the much better
// = expected ..., PLUS, or MINUS

So this leaves me more tempted with your second option.......

But this leaves a bad taste in my mouth!
Serving to "avoid repetition while not actually modifying the parse tree" is exactly what parametrized rules are about for me (if you swap 'parse tree' for 'program', the parallel is pretty obvious between «parameter-less "metarules"»/parametrized rules and procedures/functions)
The way I see this is furthered by how I implement parametrized rules in a little pest preprocessor I wrote:
I simply copy the contents of the "function" and replace the arguments as they come up (C style but respects precedence), so in that sense an empty function is the exact same as a silent rule.

That way, you could separate silent rules that should still have an impact in error reporting and pure aliases that don't exist at parse time (maybe with some kind of syntax to more easily define them from silent rules, I have a small idea but no bikeshedding for now!)

SEMICOLON = _{ ";" }
prefix() = { PLUS | MINUS }
    PLUS = { "+" } MINUS = {"-"}
expr = { prefix ~ number ~ SEMICOLON } // no parentheses needed for prefix

/// $20;
// -> expected PLUS, or MINUS
// -10
// -> expected SEMICOLON

Although I have no idea how hard this would be to implement since I didn't delve at all into pest's source (but I would be very open to contribute once we get an idea of what pest3 is going to be! It's by far the easiest parser library to use in rust and if we can fix its quirks it's going to be the best one too!)
Sorry for the long comment, thanks for reading <3.

EDIT: errors coming from parametrized rules could be very useful too: you could have a built-in delimited_by rule for example. I feel like I'm just making up more questions than answers right now...

davawen Aug 21, 2023

Actually, scratch everything I said about silent rules and functions, I'm just dumb.
Having a special modifier is much better: you can just annotate parametrized rules.
You get a natural looking escalation:

Rule -> creates symbol and is "expected"
"Parsing" Rule -> passes its symbol, but is expected, takes control of its children's errors
Silent Rule -> just an alias

And the same applies to parametrized rules:

No modifier -> creates symbol (and would be generic over its inputs in a typed AST)
"Expected" modifier -> passes its symbols (and arguments) but takes control of its children's error
Silent Rule -> just a function

Then you can have a built-in delimited_by rule with a nice error, or you could almost make it yourself if, given we implement a typed AST, it's replicated for errors:

// number = { ASCII_DIGIT-+ } // non-trivia repetition
// delimited_by(left, inner, right) = -{ left ~ inner ~ right } // modifier
// expr = { delimited_by("(", number, ")") }

// Generated: 
// ignoring string lifetimes for clarity
trait Rule { /* ... */ }
struct number { inner: List<ASCII_DIGIT>, /* ... */ } // would parse using Rule::as_str
struct delimited_by<left: Rule, inner: Rule, right: Rule> { /* ... */ }
struct expr { inner: delimited_by<&str, number, &str>, /* ... */ } 

trait Error { /* ... */ }
struct Error_number { inner: Option<error_List<error_ASCII_DIGIT>> /* ... */ }
struct Error_delimited_by<left: Error, inner: Error, right: Error> { left: Option<left>, inner: Option<inner>, right: Option<right>, /* ... */ }
struct Error_expr { inner: Option<Error_delimited_by<Error_str, Error_number, Error_str>> }

// Usage:
match parse::<expr>(input) {
    Err(e) => e.rename_delimited_by( |d| {
        match d {
            { Some(e), _, _ } => format!("expected starting delimiter {e}"),
            { _, Some(e), _ } => format!("expected inner rule {e}"),
            { _, _, Some(e) } => format!("expected ending delimiter {e}"),
            // or if we use `Result` instead you can have:
            // { Ok(d), _, Err(e) } format!("expected ending delimiter {e} (started at {})", d.line_col()),
             _ => unreachable!()
         }
    }
}

And string literals would be easily handled by default (or with a simple rename_str call)
Having a separate modifier would be very useful by itself, but combined with a typed AST it can be VERY powerful.

tomtau · 2023-07-14T13:20:28Z

tomtau
Jul 14, 2023
Maintainer Author

Grammar Change: custom hooks

As suggested in #815

WHITESPACE   =  _{ " " | "\t" | NEWLINE }
int          =  @{ (ASCII_NONZERO_DIGIT ~ ASCII_DIGIT+ | ASCII_DIGIT) }
__HOOK_INT   =  _{ int }
ints         =  { SOI ~ __HOOK_INT* ~ EOI }

For all terms beginning with __HOOK, the generator will call user-defined hooks before deciding whether to parse.

    #[derive(Parser)]
    #[grammar = "../examples/hook.pest"]
    #[custom_state(crate::parser::CustomState)]
    pub struct Parser;

    pub struct CustomState {
        pub max_int_visited: usize,
    }

    impl Parser {
        #[allow(non_snake_case)]
        fn hook__HOOK_INT<'a>(state: &mut CustomState, span: Span<'a>) -> bool {
            let val: usize = span.as_str().parse().unwrap();
            if val >= state.max_int_visited {
                state.max_int_visited = val;
                true
            } else {
                false
            }
        }
    }

The state will need to support snapshot / recovery so that the hook can be placed anywhere. But users can also opt-out recovery and modify their grammar to avoid this situation. Examples in hook.rs. and hook.pest.

This is powerful, but it is an "escape hatch" that goes against the pest's portability (e.g. how would this work in pest_vm?).

5 replies

Jamalam360 Jul 14, 2023

This feels very hacky/out of place

TheVeryDarkness Jan 12, 2024

@tomtau I have a different plan about hooks, which is different than that one above. It's a bit like what bison does.

Maybe people won't call these "hooks"?

I have three kinds of hooks in mind.

Post Hooks

Post hooks are those hooks that are called on success.

One can write this:

#[post(hook="parse_int", ret=u64)]
int = ('0'..'9')+

And generator will call parse_int after the parsing of a int is done, and replace the data structure of int with a u64. Of course, the hook should be visible at the same level of the derived structure.

use crate::int::parse_int;

#[derive(Parser)]
struct Grammar;

Middle Hooks

Middle hooks are those hooks that are called on error.

Middle hooks are almost the same with Post hooks, except that they are mapping error types into custom data structures.

#[mid(hook="ParserError::not_a_integer", ret=ParserError::NotAInteger)]
int = ('0'..'9')+

Pre Hooks

Pre hooks are those hooks that are called when starting to parse a node.

Maybe these hooks can be used to track something? I have no example for pre hooks.

tomtau Jan 12, 2024
Maintainer Author

@TheVeryDarkness those processing functions/"hooks" may be OK, as they don't change the parsing process and e.g. pest_vm can work as before.
The "pre hooks" description seems a bit like that event-based parser output I was jotting down here: #885 (reply in thread)

AlphaFoxz Sep 6, 2024

Hi, im a newer. I realized it's not easy to test every "single rule".
Maybe it's a reasonable idea that adding a feature flag. And then generate "wrapped rules".

Preprocess the pest file:

my_rule         = { "keyword" }
__exact_my_rule = { SOI ~ my_rule ~ EOI } // Auto generated

And then it could be easily test:

#[derive(Parser)]
#[grammar = "parser.pest"]
struct MyParser;

#[cfg(test)]
#[derive(ParserForTesting)]
#[grammar = "parser.pest"]
struct MyParserForTesting;

#[test]
fn test_my_rule() {
    MyParserForTesting::parse(Rule::__exact_my_rule, "keywordandmore").unwrap_err();
}

TheVeryDarkness Sep 18, 2024

Hi, im a newer. I realized it's not easy to test every "single rule". Maybe it's a reasonable idea that adding a feature flag. And then generate "wrapped rules".

Preprocess the pest file:
my_rule         = { "keyword" }
__exact_my_rule = { SOI ~ my_rule ~ EOI } // Auto generated
And then it could be easily test:
#[derive(Parser)]
#[grammar = "parser.pest"]
struct MyParser;

#[cfg(test)]
#[derive(ParserForTesting)]
#[grammar = "parser.pest"]
struct MyParserForTesting;
#[test]
fn test_my_rule() {
    MyParserForTesting::parse(Rule::__exact_my_rule, "keywordandmore").unwrap_err();
}

In pest_typed and pest3, there are two functions try_parse_partial and try_parse, where the former can consumes some prefix but not the whole of the input and the latter forces the parser to consume the whole input, otherwise an error like expected EOF would be raised.

tomtau · 2023-07-14T13:22:17Z

tomtau
Jul 14, 2023
Maintainer Author

Grammar Change: supporting left recursion

As suggested in #533
There are a few ideas on how to handle left recursion in PEG, either by explicitly marking rules or implicitly via a different parsing mechanism… But different parsing mechanisms may have performance tradeoffs.

0 replies

tomtau · 2023-07-14T13:22:40Z

tomtau
Jul 14, 2023
Maintainer Author

Grammar Change: parsing binary data

pest's focus has been on textual data, but it may be something to consider.

0 replies

tomtau · 2023-07-14T13:23:06Z

tomtau
Jul 14, 2023
Maintainer Author

Grammar Change: separating a lexer

As suggested in #580
Separating a lexer has many advantages, but having a single grammar syntax is preferable for pest's accessibility goal.

1 reply

davidsantiago Oct 1, 2023

I made an offhand remark about this on the Discord, and @tomtau suggested I post some thoughts on this thread. I don't have a super deep understanding of parser generators and PEG implementation beyond hazy undergrad compiler classes long ago, so I'm taking my own thoughts with a grain of salt; more like musings. Reading all this discussion around pest3 changes and issue #580, it does appear that there has already been a high-level decision made that enabling this use case is desirable, in order to keep Pest "unified" as a single grammar parsed in one phase. So at the start, I'm not expecting a lot of motion on this if I advocate in the opposite direction. However, I do think (again, without expertise) that this use case can actually be serviced without abandoning that ideal, as I describe below. And even if I'm wrong about that, hopefully a friendly dissent on this can be helpful in other discussions that touch these issues.

So, naively (very naively!), it does seem to me like a few things are true:

Pest has a number of complications and annoying cases either stemming from (or resulting in the complication of) handling of certain common cases in parsing. The aforementioned discussion on Discord kicked off with one, for example. I notice that in this larger pest3 discussion the most popular thread is "Simplified Handling of Whitespace and Comments,", wherein the proposed solution is a bifurcation of several key operators according to how they treat trivia. Tom also mentioned that the stack operations are also present in Pest as a way to handle the parsing of indentation-sensitive languages, another case which is handled fairly naturally in a parsing strategy where tokenization is separate from parsing. I did also notice that in this discussion thread group there's a sub-thread about adding additional stack operations to fix a number of troubling cases with the stack operations functionality, though I did not dig into them to see how many would be touched by this proposal. Still, this does create the question in my head about what is really "simpler."
If one were to take a pest file and build a tree of its rules, it would be possible to draw a horizontal line splitting the tree vertically in two, and any rule node that didn't have a parent in the bottom half could be called a "token" in some sense. This is really just the same observation as that a language that is normally defined as separate rules for recognizing tokens and grammar rules can be expressed as a unified set of rules in a Pest (or any other PEG) -- where there's a distinction drawn in one viewpoint, it can be unified in Pest. But one could also express it all in Pest and have in their mind the idea that some of the rules describe "tokens" and others are "grammar," without that distinction having any concrete form within Pest or the grammar file.
If one wanted to write "just a tokenizer" in Pest, one could write all the rules describing tokens, and then create a rule called "token = _{...}" that is set to match any of the other token rules by listing them all out (tedious and error-prone though it would be to keep all the rules in the file and the list of sub-rules in this one rule in sync), and then have as the root rule they match on a rule like "tokens = { SOI ~ token* ~ EOI }". I haven't tried it, but it seems like this should work -- happy to get shot down here if I'm wrong. The issue at this point is just that Pest has no way to write more rules against this stream of tokens, so you are on your own for parsing.

I do like the simplicity and unity of Pest's grammar. It's worth fighting to keep. But I do wonder if it actually would remain more so if Pest were to yield to the wind here and pursue solutions that increase the language's power to naturally express real-world grammars without resulting to special casing and implicit rules and quasi-imperative constructs in the grammar.

As I said, I don't understand Pest's internals or the limitations of PEG parsing strategies, but what would be ideal for me, is something like the following. I'd like to be able to ask Pest to parse an input into a "stream" (whatever concrete form that takes) of Rules (to copy the term from the current API), relative to a root Rule, as happens now. Then I'd like to be able to take that stream of Rules, and pass it back into Pest and ask it to parse it again against another root Rule in the same Pest file (or potentially another Pest file, but that'd be a bonus for me -- it seems harder, since the rules reference each other). Possibly I would have modified or filtered that stream of Rules before passing it back. And ideally I could do this any number of times, not just "lexer" and "parser."

Extra bonus for me would be to be able to designate certain Rules in the grammar file as belonging to different "Rule Groups," so that I wouldn't have to list out scores of sub-rules and constantly keep them in sync to create a "token" rule as I described above. Extra extra bonus would be the ability to mark some rules as "trivia" rules so that they automatically get filtered out of the Rule stream during parsing, which I believe would make a lot of cases that would require massaging of the stream in Rust unnecessary.

That's it. With or without the "bonus" stuff, if Pest could work this way, I believe it would enable in a natural and still fairly unified way a number of cases where users are currently having some friction with the language and system, and enable more flexibility, and obviate some of the changes being proposed for pest3 to fix these issues. And this would keep the unified, top-to-bottom PEG grammar structure in the pest file itself, since it just enables multi-phased parsing driven from the Rust side.

Beyond the question of whether users in general would find this to be a desirable way for Pest to work, there exists the further and thornier question of whether Pest could work this way. This is where I fear my ignorance; I have my suspicion that users would like it if Pest worked like this, but as I alluded at the start, I simply have no idea if Pest could possibly work like this. I believe Pest has a highly optimized VM component that grammar files are compiled to, and this would surely represent a need for significant changes to how such a VM worked, as well as the Pest API on the Rust side. That may well be just too much work to try to implement, especially absent more clarity that a change like this could bring a lot of value with respect to other issues that Pest is working to improve on.

However, that is my conceptual, uninformed, and speculative case for the idea that Pest does not fundamentally have to make a choice between a unified language and grammar model and finer-grained multi-phase parsing.

tomtau · 2023-07-14T13:23:55Z

tomtau
Jul 14, 2023
Maintainer Author

API Change: typed AST and alternatives to Pairs API

As suggested in #882 #416 #440 #806
The idea is primarily to incorporate something like pest-ast directly into pest. One potential alternative API may be event-based, i.e. instead of producing a particular data structure, the parser will fire events on rules or parts of rules as per its annotation. The event-based API should be more general and one can implement different wrappers over it -- e.g. the old pairs API can be re-constructed from events.

9 replies

elenakrittik Jul 14, 2023

If i want to build an AST as a result, that approach would require me to define some sort of shared variable (main AST node in this case) and then push parsed nodes/rules from events into that shared variable (which can become real hard with nested nodes), right? Or am i missing something?

tomtau Jul 14, 2023
Maintainer Author

I don't have a concrete proposal yet; I imagine it could work perhaps that the derive macro would generate a trait with default empty implementations:

trait ParserEventProcessor<T> {
  //...
  fn on_rule_x_start(&mut self, ...) {}
 // ...
}

One will then have:

pub struct MyTreeBuilder {
 //...
}

impl ParserEventProcessor<MyAst> for MyTreeBuilder {
  // override whatever is needed
}

So that one can run:

  parser.parse(Rule::x, &mut my_tree_builder, &input)

or potentially:

  parser.parse(Rule::x, my_tree_builder, &input)

if there's a single finalize(mut self) -> T that will be called at the end by the parser to return a constructed output (if any).

marcfir Jul 14, 2023

quick-xml can be a good example. They use an event-based approach for XML parsing . They also combined it with Serde for AST generation, which shows the flexibility.

elenakrittik Jul 15, 2023

Apparently i'm pretty familiar with this approach, i just didn't know it is named so. In the past i developed a Godot add-on for advanced manipulation on XML data and even had to poke into Godot's own implementation of XML reader and - as i now understood - it also uses an event-based approach. I wouldn't say i like it more than ASTs (main thing my add-on did was converting from that unusable (from my perspective) event-based readers to a full XML AST), but it's not that bad either (and i think it's even better than ASTs when you need to modify parsed input ad-hoc). +1 for this proposal then.

tomtau Dec 18, 2023
Maintainer Author

Also this effort could potentially be incorporated: pest-parser/ast#29

tomtau · 2023-07-14T13:25:44Z

tomtau
Jul 14, 2023
Maintainer Author

API Change: crate restructuring or not depending on pest_derive

pest could be like serde, i.e. that there’s no need to explicitly declare a dependency on pest_derive and one could just do pest = {"version = "…“, features = ["derive“]} but that’s currently not possible due to the crate inter-dependant structure and bootstrapping… maybe there’s a way to move the "pest“ crate stuff into a separate crate which would be inter-dependant with the other crates, and make "pest“ a sort of meta-crate that will re-export the new crate and "pest_derive“?

Alternatively, as suggested in #333 (comment)

Maybe a CLI tool to expand the .pest into a nicely formatted rust file, with a watch mode when developing instead of a derive? This would mean committing a generated file in git but in exchange we get faster compilation + easy to debug.

0 replies

tomtau · 2023-07-14T13:32:51Z

tomtau
Jul 14, 2023
Maintainer Author

API Change: streaming input

As suggested in #370 #153
To directly parse streams from stdin or sockets.

1 reply

azriel91 May 14, 2024

I'm super keen for something like this, for reading a large file into memory, and optionally discarding parts of the input -- i.e. it's not only streaming the input, but streaming out the parsed parts so the consumer can choose to filter things. I'm not sure how realistic / ergonomic one can make the API for that though.

Another "natural" progression of this is to support an async stream, though that may convolute the API too much.

tomtau · 2023-07-14T13:34:28Z

tomtau
Jul 14, 2023
Maintainer Author

API Change: detailed character offsets in Pair (UTF-16 and UTF-32)

As mentioned in #370
Finding these offsets requires iterating over the input string with str.char_indices after parsing.

0 replies

tomtau · 2023-07-14T13:34:43Z

tomtau
Jul 14, 2023
Maintainer Author

API Change: multiple error return (i.e. parser not early terminating on errors)

as e.g. mentioned in #711
Currently, pest terminates early on the first error, but it may be desired for it to continue parsing. This will require for the generated parser code to be "resilient" which may be an optional flag for the annotation.

0 replies

tomtau · 2023-07-14T13:35:48Z

tomtau
Jul 14, 2023
Maintainer Author

API Change: pluggable backends

As mentioned in #178
Perhaps the easiest may be to remain in Rust, but allow adding some code on the generated rules that could make it easier for FFI… to allow cxx or diplomat annotations on the generated rules?

0 replies

tomtau · 2023-07-14T13:38:25Z

tomtau
Jul 14, 2023
Maintainer Author

Grammar or API Change: easier precedence specification

As mentioned in #386
Grammar-based precedence specification may be a bit syntax-heavy, but perhaps there are ways to make it more lightweight (e.g. if precedence is implicit from the line order).
One API-based alternative is some extra "glue“ annotations on the parser derivation that will automatically generate the Pratt parser for given grammar rules precedence.

#[rule]
#[precedence = [
    left(plus, minus),
    left(times, divided),
    right(power)
]]
fn binary_expression(lhs: Expr, op: OP, rhs: Expr) -> Expr { ... }

0 replies

tomtau · 2023-07-14T13:38:48Z

tomtau
Jul 14, 2023
Maintainer Author

API Change: better debugging support

One way may be to have DebugParser annotation that will insert the needed code in the generated parser which can overcome some of the current limitations of pest_debugger (e.g. inspection of the stack and resuming a different rule from the breakpoint).
That, however, does not work in the WebAssembly context, so perhaps there is a way to have a different parser API (using async?) that is more amenable in the debugging context even in WASM.

0 replies

tomtau · 2023-07-14T13:39:05Z

tomtau
Jul 14, 2023
Maintainer Author

Experimental: egraph-based optimizer

@dragostis mentioned that in simpler grammars, it may be possible to use egraphs to capture the rewriting rules using egg and ruler.

1 reply

QuarticCat Jul 15, 2023

I'm interested in this technique as well. IIRC, the original egg paper only optimized the rebuilding process, the e-matching process is still quite inefficient, not to mention that the idea of equality saturation itself is not performant. Of course, they made some optimizations to e-matching compared to the e-matching paper it is based on. But I think the complexity remains the same. If you decide to use it, you should care about the performance. Besides, there are some alternatives that might help:

egglog - egg team used some ideas from databases to optimize e-matching and implemented it here.
aegraphs - cranelift used a limited version of e-graphs in exchange for performance. It might not suit your case.

tomtau · 2023-07-14T13:40:59Z

tomtau
Jul 14, 2023
Maintainer Author

Experimental: better correctness and performance

I am a fan of this early effort paguroidea by @SchrodingerZhu @QuarticCat @CyanPineapple for a few reasons, mainly:

it’s based on this idea of type-driven parsing: https://www.cl.cam.ac.uk/~nk480/parsing.pdf
Currently, pest has this ad-hoc validator code that tries to capture different edge cases of non-terminating or non-progressing grammars, but it’s a rabbit hole: On the hunt of grammars making the parser run indefinetly #851 while that typed, algebraic approach provides a more principled way of doing static checking of grammars for similar issues (and ambiguity since it’s in the context of CFGs).
I was looking if there was something similar done for PEGs, and came across this https://dl.acm.org/doi/abs/10.1145/3555776.3577620 / https://github.com/lives-group/typed-peg … which seems pretty promising, but pest's grammar isn't a pure PEG.
It uses this deterministic GNF parser idea my former colleague @xnning worked on: https://arxiv.org/abs/2304.05276v2
which is quite simple and performant.
A side-note: @Mubelotix made this great effort of optimizing pest performance: https://github.com/Mubelotix/faster-pest

Finally, pest3's grammar does not need to remain PEG-based, and it may be worth exploring whether ideas used in paguroidea can be applied in pest.

8 replies

SchrodingerZhu Jul 16, 2023

Well, I guess my answer is yes and no. 😸 PEG may usually be used in lexer-free parsing scenarios as there is normally no division between parse and lexer in recursive descent/packrat algorithms. However, the input of PEG can actually be token streams instead of bytes/chars. (see an example from chumsky)

QuarticCat Jul 19, 2023

@tomtau @dragostis There's a tricky problem with typed AST designs. In both pest and paguroidea, left recursions are not allowed. That means to parse a comma-separated list, we have to write it as something like value ~ ("," ~ value)*. It's very common that in this case users want a Vec<Value>. However, we have no idea how to let users describe this demand. And

lalrpop doesn't have this problem because it allows value_list = value | value_list ~ "," ~ value. If we mimic the lalrpop style, we have to return a Value and a Vec<Value> (or any T: FromIterator<Item = Value> or something similar) and let users prepend the first value by themselves, which is expensive. Or we can advise users to use VecDeque instead. I don't like that.
nom doesn't have this problem because it provides a primitive for this pattern. And no matter what, users can take over control of parsing at any time.
PEGTL doesn't have this problem because you can simply hook the value event and build the vec in order. But building a typed AST in PEGTL style is hard on its own.

EDIT: In PEG we can actually do (value ~ ",")* ~ value, the append style. But it incurs extra backtracking. Depending on the definition of value, it can be super expensive unless you have implemented memoization (see packrat parsing).

tomtau Jul 19, 2023
Maintainer Author

I recall @dragostis mentioned to me on the call, he hit some issues with a typed AST when he was attempting it (and there may be more tricky bits, like with type aliasing), so he suggested the pragmatic approach may be just a partially typed for some common types (like Option)... but this Vec stuff is unfortunate.

Tartasprint Oct 13, 2023

What are the goals of typed ASTs ? Is it to check a grammar is complete/sound ? Is it to provide a nice API to the user ?
What are the goals of implementing typed ASTs (or some remixed flavor) in pest ?

tomtau Oct 13, 2023
Maintainer Author

I'd say more of the latter, i.e. that it'd be possible to go directly to a typed AST from parsing instead of via a string-y syntax tree. So it'd mean less boilerplate or being less error-prone to end users.

As for the implementation goals, I'm not sure, but most likely to make it pragmatic, i.e. see where it can be pushed through without creating too much complexity (in the pest implementation and in the pest grammar or derive annotations)... so it may not need to work for any typed AST, but for a subset that's composed of common Rust std types.

NyalephTheCat · 2023-07-14T14:32:01Z

NyalephTheCat
Jul 14, 2023
Sponsor

Grammar Change: Parametrized rules

Like I explained in #886 it would be really nice to be able to parametrize some rules. For example,

IdentifierReference[Yield, Await] = { Identifier ~ [~Yield, "yield"] ~ [~Await, "await"] }

Could expand to:

IdentifierReference = { Identifier }
IdentifierReference_Yield = { Identifier | yield }
IdentifierReference_Await = { Identifier | await }
IdentifierReference_Yield_Await = { Identifier | yield | await }

And could be used like SomeRule = { SOI ~ IdentifierReference[yield] ~ EOI}

0 replies

tomtau · 2023-07-14T14:38:05Z

tomtau
Jul 14, 2023
Maintainer Author

Grammar Change: `+=` operator to pest grammar

from #836

Example

base.pest

base_type_spec = { integer_type }

string,pest

base_type_spec += { string_type }

any.pest

base_type_spec += { any_type }

parser.rs

pub mod parser_with_string {
#[derive(Parser)]
#[grammar = "base.pest"]
#[grammar = "string.pest"]
pub struct MyParser;
}

pub mod parser_with_any {
#[derive(Parser)]
#[grammar = "base.pest"]
#[grammar = "any.pest"]
pub struct MyParser;
}

2 replies

NyalephTheCat Jul 14, 2023
Sponsor

That's something I'm really looking for... There's this language that I'm implementing a grammar for that I'd like to add my own grammar to in order to add support for extra stuff like switch statement

0nyr Jul 17, 2023

This would be really awesome ✨

Improved code manageability, progressive implementation and debugging and code re-usability.

elenakrittik · 2023-07-14T17:42:19Z

elenakrittik
Jul 14, 2023

Grammar Change: Grammar and Rule Attributes

Syntax can be the same as rust's attributes (except leading # is replaced by some other character for obvious reasons) i guess? Though unlike rust here attributes will be used for metadata exclusively. Instead of requiring code changes to both actual implementation and pest_derive (or whatever will replace it in pest3) each time we want to make a new feature, attributes will not have any concrete list of allowed values/keywords and feature implementations will just consume those values/keywords they recognise.

As an example, several features from this discussion can be changed to use pest attributes:

Custom hooks:

WHITESPACE   =  _{ " " | "\t" | NEWLINE }
@[hooked]
int          =  @{ (ASCII_NONZERO_DIGIT ~ ASCII_DIGIT+ | ASCII_DIGIT) }
ints         =  { SOI ~ __HOOK_INT* ~ EOI }

#[derive(Parser)]
#[grammar = "../examples/hook.pest"]
#[custom_state(crate::parser::CustomState)]
pub struct Parser;

pub struct CustomState {
    pub max_int_visited: usize,
}

impl Parser {
    fn hook_int<'a>(state: &mut CustomState, span: Span<'a>) -> bool {
        let val: usize = span.as_str().parse().unwrap();
        if val >= state.max_int_visited {
            state.max_int_visited = val;
            true
        } else {
            false
        }
    }
}

Grammar versioning:

@![version = "3.0"]

# ...

Better debugging support can be complemented by attributes:

@[breakpoint]
myrule = { ... }

I am no expert on the topic, but perhaps Easier precedence climbing can be implemented using attributes too.

1 reply

marcfir Jul 14, 2023

I think this is a good idea. This helps to collect more information in a single place. Especially when processes need custom information after parsing. Language workbenches are an example of this.

oovm · 2023-10-13T13:07:17Z

oovm
Oct 13, 2023

Tag V1(Node Tag)

Tag is an important feature leading to strongly typed AST. It is divided into the following three steps:

node tag
branch tag
group tag

Why do we need node tag?

Node tag is used to guide you which elements should be captured in the AST.

More importantly, it is used to distinguish elements of the same type (such as identifiers), but play different roles.

Tag V2(Branch Tag)

Now that the first phase has been merged, the next two phases can continue to evolve after the first phase is stable.

Why do we need branch tag?

A strongly typed AST must conform to the ADT model, so we need to distinguish whether it is a struct or a enum.

For example: #832 (comment)

Tag V3(Group Tag)

This step is optional, it is not required

The main appeal of this feature is that if there are a lot of elements, it is very cumbersome to mark which ones are nodes (semantic elements, such as ID) and which ones are leaf (link elements, such as ", ,, ;).

Mark all elements at once with a token #(...), and support allocation ratio, ## can cancel.

eg:

#(string_prefix #(string_start #(string_inner) string_end) string_suffix)

Automatically capture all odd-numbered layer elements, that is, [string_prefix, string_inner, string_suffix], the rule name is the tag name

Once these features are implemented, strongly typed ASTs (example) can be automatically derived and parsed based on tags (example).

0 replies

oovm · 2023-10-13T13:13:49Z

oovm
Oct 13, 2023

Custom parser

Supports embedding any parsing function, as long as it conforms to State -> Result<State, State>.

For example, if you want to match a URL, but it is from a third-party library, you can wrap a parser and embed it.

Custom inspector

Supports checking a successful matching rule, that is, deciding whether to convert Ok(State) ->Err(State)

For example, you may want to match the string first, then scan it a second time to see if there is any illegal escape, and reject it if there is any.

0 replies

oovm · 2023-10-13T13:51:12Z

oovm
Oct 13, 2023

Parser parameters

Unlike parameterized rules, parser parameters are used to pass static (&ctx, static during parsing) environment variables.

Very useful when you are writing a template engine, because you need to customize the slot starting symbol.

2 replies

tomtau Jan 24, 2024
Maintainer Author

Is this the same as #885 (comment) or something else?

davawen Jan 24, 2024

No, I think this is to parametrize the whole parser from rust.
The exemple given is of a token character that you'd need to change from the outside.

oovm · 2023-10-13T13:56:49Z

oovm
Oct 13, 2023

Trap rules

Add a keyword. When this rule is encountered, the remaining matches will no longer be tried and failure will be returned directly, allowing customization of failure content.

Maybe you need to distinguish whether this is a branch failure or a global failure.

It is also possible to provide a recovery variable, directly inserting the given EndToken such as )

1 reply

viridia Oct 15, 2023

This feature exists in some other parsers such as winnow and is very useful: https://docs.rs/winnow/latest/winnow/combinator/fn.cut_err.html

The term used for this in Winnow is "cut_err", the term "cut" apparently comes from Prolog.

The "cut" feature means "continue with the current rule, but don't try any alternative rules if this one fails." So it's not automatically an error, it just means that you want to commit to the current rule and disable backtracking.

tomtau · 2024-01-24T06:28:16Z

tomtau
Jan 24, 2024
Maintainer Author

Experimental: Incremental parsing and other goodies in Ohm

I was looking at Ohm, a PEG-based parser generator in the JS/TS ecosystem, and it's got some interesting ideas under the hood:

left-recursion which is considered here: Restoration of the pest3 work effort 🙌 #885 (comment)
Incremental Packrat Parsing: probably not a big deal, but potentially a nice thing for IDEs
Modular Semantic Actions: something similar akin to object algebras https://nextjournal.com/dubroy/ohm-parsing-made-easy

A couple of things are worth noting about the relationship between Ohm and object algebras. First, when the programmer invokes an operation on a match result, Ohm does not eagerly evaluate that operation on its sub-expressions. Which sub-expressions are evaluated, and the order in which they are evaluated, is determined only by the semantic actions that implement the operation. This is a useful form of modularity that the programmer does not get “for free” when using object algebras in an eager language like JavaScript. Second, Ohm’s operations can be mutually recursive, i.e., it is possible for an operation op1 to use another operation op2 in its implementation, and vice-versa. Support for mutually-recursive operations comes from our semantics abstraction, which does not have an analog in object algebras.

It currently doesn't support context-sensitive parsing but there's this issue to track it: Allow for context-sensitive parsing ohmjs/ohm#158 and SPEG looks like an interesting approach.

(A side-note, its online editor is pretty cool: https://ohmjs.org/editor/ )

0 replies

Restoration of the pest3 work effort 🙌 #885

tomtau Jul 14, 2023 Maintainer

UPDATE (May 14, 20224): early pest3 prototype is here #1016

Replies: 30 comments · 51 replies

tomtau Jul 14, 2023 Maintainer Author

Grammar Change: simplified handling of whitespaces/comments

tomtau May 17, 2024 Maintainer Author

tomtau Jul 14, 2023 Maintainer Author

Grammar Change: better reusability of expressions using macro/template/generic rules

tomtau Jul 14, 2023 Maintainer Author

Grammar and API Change: token parametrization

tomtau Jul 14, 2023 Maintainer Author

Grammar Change: module/namespace system

NyalephTheCat Jul 14, 2023 Sponsor

CAD97 Jul 14, 2023 Maintainer

tomtau Jul 14, 2023 Maintainer Author

Grammar Change: stack slicing, additional stack operations or alternatives

tomtau Jul 14, 2023 Maintainer Author

Grammar Change: grammar versioning

tomtau Jul 15, 2023 Maintainer Author

tomtau Jul 14, 2023 Maintainer Author

Grammar Change: small bikeshedding

tomtau Jul 14, 2023 Maintainer Author

Grammar Change: better error-reporting

tomtau Aug 20, 2023 Maintainer Author

tomtau Jul 14, 2023 Maintainer Author

Grammar Change: custom hooks

Post Hooks

Middle Hooks

Pre Hooks

tomtau Jan 12, 2024 Maintainer Author

tomtau Jul 14, 2023 Maintainer Author

Grammar Change: supporting left recursion

tomtau Jul 14, 2023 Maintainer Author

Grammar Change: parsing binary data

tomtau Jul 14, 2023 Maintainer Author

Grammar Change: separating a lexer

tomtau Jul 14, 2023 Maintainer Author

API Change: typed AST and alternatives to Pairs API

tomtau Jul 14, 2023 Maintainer Author

tomtau Dec 18, 2023 Maintainer Author

tomtau Jul 14, 2023 Maintainer Author

API Change: crate restructuring or not depending on pest_derive

tomtau Jul 14, 2023 Maintainer Author

API Change: streaming input

tomtau Jul 14, 2023 Maintainer Author

API Change: detailed character offsets in Pair (UTF-16 and UTF-32)

tomtau Jul 14, 2023 Maintainer Author

API Change: multiple error return (i.e. parser not early terminating on errors)

tomtau Jul 14, 2023 Maintainer Author

API Change: pluggable backends

tomtau
Jul 14, 2023
Maintainer

Replies: 30 comments 51 replies

tomtau
Jul 14, 2023
Maintainer Author

tomtau May 17, 2024
Maintainer Author

tomtau
Jul 14, 2023
Maintainer Author

tomtau
Jul 14, 2023
Maintainer Author

tomtau
Jul 14, 2023
Maintainer Author

NyalephTheCat Jul 14, 2023
Sponsor

CAD97 Jul 14, 2023
Maintainer

tomtau
Jul 14, 2023
Maintainer Author

tomtau
Jul 14, 2023
Maintainer Author

tomtau Jul 15, 2023
Maintainer Author

tomtau
Jul 14, 2023
Maintainer Author

tomtau
Jul 14, 2023
Maintainer Author

tomtau Aug 20, 2023
Maintainer Author

tomtau
Jul 14, 2023
Maintainer Author

tomtau Jan 12, 2024
Maintainer Author

tomtau
Jul 14, 2023
Maintainer Author

tomtau
Jul 14, 2023
Maintainer Author

tomtau
Jul 14, 2023
Maintainer Author

tomtau
Jul 14, 2023
Maintainer Author

tomtau Jul 14, 2023
Maintainer Author

tomtau Dec 18, 2023
Maintainer Author

tomtau
Jul 14, 2023
Maintainer Author

tomtau
Jul 14, 2023
Maintainer Author

tomtau
Jul 14, 2023
Maintainer Author

tomtau
Jul 14, 2023
Maintainer Author

tomtau
Jul 14, 2023
Maintainer Author