On the hunt of grammars making the parser run indefinetly #851

Tartasprint · 2023-04-28T22:01:59Z

Tartasprint
Apr 28, 2023

There are grammars that may cause even a "bug-free" parser to run indefinitely such as:

for_ever_1=@{""*}
for_ever_2=@{SOI*}
for_ever_3=@{for_ever_3}
WHITESPACE = {for_ever_4}
for_ever_4=!{"" ~ "GRAPHICS"}
always_failing_5 = @{ANY~SOI}
for_ever_5 = @{( !(always_failing_5) )*}
for_ever_6=@{(PUSH("") ~ POP)*}
for_ever_7=@{POPALL ~ (POPALL)*}
for_ever_8=@{POPALL ~ (PEEKALL)*}
maybe_for_ever_9=@{(POPALL)*}
maybe_for_ever_10=@{(PEEKALL)*}

Some others can be found in #848, but it's not important for now.
I think no one is interested in the fact that the parser can hang, they may accommodate to it but they are not wishing it to hang. In this discussion I'm interested to prevent the parser to hang indefinitely. By doing that it may mean progressing on #685, although that issue is more concerned about improving the performance than the desperate cases I talk about here.

As of now

Currently the cases 1-3 are covered, but not cases 4-10. As of now the validation of this kind of problems is handled by pest_meta::validator::validate_ast. In that function the steps we are interested in are validate_repetitions, validate_left_recursion and validate_whitespace_comment. The first checks if there is an expression that does not make progress and can be repeated indefinitely by a repetition operator (* alias Rep, + alias RepOnce, {n,} alias RepMin). The second test checks for another source of infinite repetition, left recursion. The third one is similar to the first one applied to WHITESPACE and COMMENT since they are implicitly *-repeated by the sequence operator ~. It currently doesn't check for possible left recursion hidden by ~ as pointed out by example 4.

Potential solutions

I have a few propositions to prevent this to happen:

[OLD_STATIC] Try to change and fix the current functions without changing their structure (adding/removing functions/data structures, changing their signature/members, etc...)
[NEW_STATIC] Rewrite and change completely the validation, without changing validate_ast which is the only public function. I detail later a possible rewrite.
[DYN_STOP] Add a dynamic validation step in the parser to stop it if it does not make any progress
[DYN_IGNORE] Add a dynamic validation step in the parser to step over an expression that does not make any progress

Here is a brief comparison of the propositions:

Proposition	Breaks behavior	Breaks API	Validation type	Fixes	May fix	Performant
OLD_STATIC	no	no	static	1-4	5	no, redundant checks
NEW_STATIC	no	no but may	static	1-6	7-10	yes
DYN_STOP	no	probably	dynamic	1-10	and more	dynamic, still performant
DYN_IGNORE	yes	probably	dynamic	1-10	and more	dynamic, less performant

§1 Fixing the current validation steps

§1.0 The OLD_STATIC option is perhaps the easiest since most of the work is already done. It is probable that the example 4 can be fixed (provided the check for left recursion works), example 5 might be fixed but involves creating a new function is_always_failing adding up to the actual complexity.

§1.1 But there is a structural problem: the validate_left_recursion relies on the is_non_failing and is_non progressing functions, that make the assumption that the grammar is non-left recursive... I might be wrong, and this problem is maybe not a problem and the left recursion validation step might be correct, but it's hard (for me) to prove or at least be convinced that it is correct. If someone has an explanation for the correctness of that step, I will welcome it. But beware it might cause headaches 😛.

§1.3 Even if that step turns out to be correct, the validation is inefficient: the functions is_non_failing and is_non_progressing are called at least for every infinite repetition operator, they compute the values recursively for every sub-expression/sub-rule and do not store intermediate results. As an example to be checked the expression {((very_complex)*~ANY)*~(very_complex)*} requires to compute is_non_progressing and is_non_failing of very_complex both 3 times.

§1.4 Then the example 6 is not possible to be fixed without changing the signature of the functions. Let me explain. The rule for_ever_6=@{(PUSH("") ~ POP)*} runs for ever because the PUSH("") does no progress and then the POP does no progress either because the last item on the in the stack is "" which never fails. To detect it we would need something similar to §2.1.

§1.5 Now we could decide to forbid empty pushes, but that may not be desirable, I imagine. I haven't useful examples of empty pushes in mind, but they might look like PUSH("non_empty" | "" ) or PUSH(("non_empty")*). Maybe even the dumb PUSH("") might be useful to keep track of seen patterns in a hacky way: @{("A"~PUSH(""))+ ~ (POP~"B")+}. Note that this example does not work currently as the parser does not seem to push empty strings onto the stack. And also that it's useless since r=@{"A"~r~"B" | "" } does the same thing. But still, it's fun.

§1 Change the validation step

§2.0 Looking at the DROP operation we get a better understanding of the problem. The DROP operation does not make any progress on the input, but we cannot tell it will cause the parser to hang: once the stack is empty it will fail. Of course a problem similar to example 6 might arise, but that's not the point. The point is there are actually three ways of making progress:

on the input. Ex: "non_empty" for positive input progress and "" for zero input progress. This progress is always positive or zero since there is no backtracking.
on the stack. Ex: POP or DROP for positive stack progress, "AZ" or PEEK for zero stack progress, PUSH("AZ") for negative stack progress
by failing. When an expression fails the parser goes onto next possibility. Ex: an expression that never fails "" will not progress, an expression that always fails will always make progress. An expression that fails sometimes may progress (ex: "A") or may not progress (ex: &("A")).

§2.1 I propose the following definition/implementation of progress for STATIC_NEW possibility:

enum Progress {
    AlwaysFail,
    // (is input progressing, is stack progressing)
    MayFail(bool, StackProgress),
    // never fails but is not infinite
    NeverFails(StackProgress),
    // never fails and is infinite
    NeverFailsInfinitely,
}

enum StackProgress {
    // PUSH -> Change(1)
    // " " or PEEK -> Change(0)
    // POP|DROP -> Change(-1)
    Change(i32),
    // for things like POPALL
    Cleaning,
}

impl Progress {
    // helper function
    pub fn is_strictly_progressing(&self) -> {
        match self {
            Progress::AlwaysFail => true,
            Progress::MayFail(input, stack) => {
                if input > 0 {
                    true // we are progressing on input
                } else { match stack {
                    Change(c) => c > 0, // or maybe if c<=0
                    Cleaning => false, // or maybe
                }}
            }
            // It never fails, but parser can't get stuck,
            // since it happens a finite number of times
            Progress::NeverFails(StackProgress) => true,
            Progress::NeverFailsInfinitely => false,
        }
    }
}

§2.2 The helper function is called strict because an expression like &("A") has a progress of Progress::MayFail(false,StackProgress::Change(0)) but might still fail. The choice to being made here is strict because the function returns true only if we are 100% sure its progressing. To know if an expression is progressing we need more context. The helper function could also return something like true|false|maybe to be more precise

§2.3 To check statically progress without computing it more than once I propose to attach to each expression a validation member.
It can either be done by adding a progress member to the current expression structure (STATIC_NEW, choice 1, option A), or by adding an indirection with a validation member that refers to a struct ValidationStatus { progress } to make easily the same with other kinds validation (STATIC_NEW, 1B).
We can either attach this metadata directly on the public parser_meta::ParserExpr structure to allow users to benefit from this analysis (breaking change) (STATIC_NEW, 2A), or make a similar new structure ParserExpr+metadata that could be either public (positively breaking change) (STATIC_NEW, 2B) or private (non breaking changes) (STATIC_NEW,2C).

§2.4 Then the validator would run through the parse tree until every expression has been assigned a progress. That part I did not think through yet, because it's harder to think about, and that I don't want to spend time on it if this not wanted. I do have vague ideas but propositions are welcome.

§3 Dynamic checking

§3.0 Now I think we can all agree static-compile-time validation is awesome, both for certainty that no errors will pop-up and for performance. But since I did not think through all the cases, especially the ones involving left recursion 🤕, I'm not sure all cases can be treated at compile time. Therefore the following thoughts only apply if static validation is no sufficient. You may not think about it or even skip to §4 (it's near though). I won't go in too much detail anyway. So we could be satisfied by all the cases caught by the static checks and hope that the ones not covered won't happen (DYN choice NoDyn).

§3.1 But this does not seem wise to me, and I propose to make available dynamic checks. Available because I understand that for performance reasons someone might not them. But still present, at least for the dev/test side.
Then I found two possibilities, when a dynamic check fails (a infinite loop was detected) we can simply stop the parser and return an error (breaking changes in the API, not in the behavior) (DYN choice Stop).

§3.2 But we could also "step over" the infinite loop. As an example, parsing the expression ("A" | "")* ~ "B" with input "AAB" would give:

Parsing "A"
parsing "A"
parsing ""
if we repeat again, we will not have progressed, so we quit the repetition and continue
parsing "B"
finished !

§3.3 is choice (DYN choice Ignore). This choice is very infinite-tolerant, may even allow parsing of left recursion rules, I'm not sure though (but there might be hope for that according to #533, but I didn't read the papers yet). But a major drawback is that it may result in an inefficient parser: the one that stops is inefficient but knows when to stop while the other one will go on and on always performing the checks. But on another side this is what the stopping one should do most of the time if most of the time the input doesn't lead to infinite loops.

§4 Important considerations

§4.0 To do all this, there is a need of having very precise semantics of every pest operator. But that goes in another discussion. Example of non-well defined (to me at least) behavior are:

Nesting stack operations in predicates
Empty pushes, as noted in §1.5
How PEEK[a..b] works for a<b, a==b, and a>b`

I won't start that discussion now because the current one already took me enough time, I don't have all the edge cases right now, and because it's not necessary in the immediate. But it will be necessary if we need to trust any validation implementation.

Recap

The choices I propose are the following:

OLD_STATIC
STATIC_NEW
- Choice 1 (A or B), Choice 2 (A, B or C)
In case STATIC is not enough, NoDyn, Stop, Ignore.

Thank you for reading till here ! Please give your comments/thoughts/advice 😄 ?

tomtau · 2023-04-29T01:16:42Z

tomtau
Apr 29, 2023
Maintainer

Thanks for a detailed write-up!

the validate_left_recursion relies on the is_non_failing and is_non progressing functions, that make the assumption that the grammar is non-left recursive... I might be wrong, and this problem is maybe not a problem and the left recursion validation step might be correct, but it's hard (for me) to prove or at least be convinced that it is correct. If someone has an explanation for the correctness of that step, I will welcome it. But beware it might cause headaches 😛.

Then the validator would run through the parse tree until every expression has been assigned a progress. That part I did not think through yet, because it's harder to think about, and that I don't want to spend time on it if this not wanted. I do have vague ideas but propositions are welcome

Maybe @dragostis @CAD97 @pest-parser/triage may have some input for either of those.

Please give your comments/thoughts/advice 😄 ?

Given OLD_STATIC won't be able to fix the other cases (I haven't thought about them in detail, so I take your word for it), I think NEW_STATIC is probably the way to go. As for the breaking change, I think it may be fine in the sense that ParserExpr is rather internal or rarely used (if ever) by external crates... but that breaking change may "bubble up" / need a lot of refactoring (and potentially require other breaking changes). So my suggestion would be seeking to do NEW_STATIC in a non-breaking or a least breaking way.

As for the dynamic checking option, pest already has this optional setting: https://docs.rs/pest/latest/pest/fn.set_call_limit.html which is a bit simplistic, but it should capture indefinite running cases as well (should there be more edge cases than what's potentially covered by NEW_STATIC).

0 replies

Tartasprint · 2023-04-30T23:03:20Z

Tartasprint
Apr 30, 2023
Author

Thank you for your suggestions. I gave implementing NEW_STATIC a try, but I kept finding more edge cases. In the end I realized that I was trying to prove whether the parser would stop, and after some research it turns out that it is proven to be undecidable... It is no wonder this was giving me headaches :P!

So my suggestion would be seeking to do NEW_STATIC in a non-breaking or a least breaking way.

The problem is back to the early stage of planning, and I don't know yet if solving (actually mitigating it) will cause breaking changes, but I think it is possible to break very few things.

Here is an attempt to deal with undecidability.

It might be possible to detect most (all?) the real world edge cases. I don't think, but that should be checked, that people who write grammars use all of the expressive power given by pest. As example, doing a small (and probably incomplete) search on Github tells me that very few people seem to use stack operations nested in predicates, and that it's mostly PEEK.

Doing that I also noticed there are grammars that do PUSHes but never decrease the stack using POP or DROP (examples: 1, 2, 3, and a last one that doesn't even use PEEK!).

But knowing people don't do things won't solve the problem (also because warning them about not doing some of those things is the problem).

Here are some possible approaches:

Catch most of the errors, without false warnings (error catcher)
Catch most of the correct grammars, without false positives (correct catcher)
Make some heuristic that catches the most probable case (heuristic)

I may be wrong, but 3 is the current approach. As an example when checking if the negative predicate !(expr) never fails, the question is “does expr always fail”. Since there are very few¹ expressions that never fail, the current check returns saying that !(expr) may fail.

Option 3 has the advantage of always giving a clear answer: the grammar is correct or not. While option 1/2 can sometimes say “you're mistaken/correct”, it will say “I don't know” the rest of time. And “I don't know” can be misleading.

But I think options 1 and 2 should still be considered, because their answer can be trusted.

Using both option 1 and option 2 could result into the following behavior:

If the grammar is proven incorrect by the error catcher, tell the user.
If the grammar is proven correct by correct catcher, tell the user.
Otherwise the result is unknown, the correct catcher should provide an explanation of why it couldn't prove the grammar to be correct.

In case a sub-expression correctness is unknown we may bubble it up to its parent expressions. But that would probably result in knowing nothing about the "root" expression. So I propose to use a heuristic like in option 3, but mark the result with uncertainty. A result could then be implemented with something looking like:

enum Knowledge<T> {
    Provably(T),
    Maybe(T), // Heuristic results go here
    Unknown,
}

This allows both to trust the validator output, and give hints to the user in case the result is unknown.

To give good hints would require either looking how people use pest, or counting on the fact people will report bad hints. I think doing a bit of both could be good.

What do you think of this ?

I did not make a search to check if always failing are encountered, but they seem unlikely to me. Examples of always failing expression is SOI (resp EOI) preceded (resp. followed) by non-empty input-progressing expressions. Another example is given by two predicates &(expr1)~&(expr2) where the languages generated expr1 and expr2 have an empty intersection (ex: @{&("A") ~ &("B")}). ↩

5 replies

tomtau May 1, 2023
Maintainer

I'm not sure if it's undecidable, but I guess some of the validator checks may be equivalent to the grammar emptiness problem (which is undecidable with those stack operations).

Anyway, given that, the approaches you've outlined make sense. In general, it's good to follow POLA, so that the improved validator won't spam errors for grammars that people previously used and built without any issues... On the other hand, I'm not sure if there's a good way to give hints (not sure if I recall correctly that emitting warnings from proc macros may be silenced / not visible under the default cargo building to the end user)... or with a limited success, the extra hints can just be accessible in a dedicated pest tooling (IDE plugins, online editor...)?

Tartasprint May 1, 2023
Author

On the other hand, I'm not sure if there's a good way to give hints (not sure if I recall correctly that emitting warnings from proc macros may be silenced / not visible under the default cargo building to the end user)

It seems to be the case looking at the issue rust-lang/rust#54140. If this issue is fixed in rust, it would be possible to emit warnings.

A proposition is to make the validation step returning either something like Ok/Error/Warning to allow usage of that information, providing a unique interface for IDE plugins, the online editor and the proc macro; and let them handle however they want/can. In the case of proc macros that could be by emitting errors but not warnings.

And by seeing how good are rustc diagnostics we could take inspiration from the public (unstable) api proc_macro::Diagnostic.

Unfortunately all of that seems to shout “breaking changes!”.

Tartasprint May 1, 2023
Author

I'm not sure if it's undecidable, but I guess some of the validator checks may be equivalent to the grammar emptiness problem (which is undecidable with those stack operations).

You're right. Since I wasn't sure undecidability came from the stack operations of Pest I made some searches, and I stumbled upon the paper from Bryan Ford¹ that described PEGs.

In that paper are described a language for describing PEGs and its semantics. Here is a useful definition from section 3.3:

An expression e handles a string x if it either matches or fails on x in the grammar G.
A grammar G handles string x if its start expression s handles x.
G is complete if it handles all strings (the empty one included)

One problem is that the PEGs in the paper do not use stack operations, so they should be given semantics. I leave that to another time.
Another problem is that in Pest there are no starting expressions, but it can be fixed by defining a complete grammar as handling all strings (the empty one included) with any rule of the grammar as the starting one.

From now on I will use that definition. Reformulating the initial problem: "Determine if a Pest grammar is complete".

In section 3.5 it is proven that a it is undecidable to tell whether an arbitrary grammar is complete or not; and since Pest grammars can describe all PEGs the same result applies to them.

Hopefully in section 3.6, B. Ford defines well-formed grammars. They are a subset of PEGs and from the author's comment they seem general enough for practical cases:

A grammar can have left-recursive rules but still be complete if its degenerate loops are actually unreachable, but we have little need for such grammars in practice.

I'm not sure if that means complete grammars that are not well-formed always fall into that category, but if that's the case (and there is proof) only well-formed grammars seem to be useful in practice: unreachable problems are not problems, putting apart that we would like to know when something is unreachable.

The good news is that checking if a PEG is well-formed is quite easy (or at least it seems reading at the paper).

I will give a try implementing it, to have a proof of concept and, for now, without worrying about breaking changes.

Parsing Expression Grammars: A Recognition-Based Syntactic Foundation, Bryan Ford, Massachusetts Institute of Technology, POPL’04, January 14–16, 2004, Venice, Italy ↩

tomtau Jul 7, 2023
Maintainer

Regarding static analysis, I recently came across this https://dl.acm.org/doi/10.1145/3355378.3355388 where they assign a type to each PEG rule, but it's pretty annotation-heavy.

And there's this follow-up https://dl.acm.org/doi/abs/10.1145/3555776.3577620 / https://github.com/lives-group/typed-peg by @rodrigogribeiro @emcardoso which infers types using Z3.

I think that work is good to look at... it's only for "vanilla" PEG though, so it'd require extending it to pest's stack operators.
(I'm guessing that the semantics of those could be defined similarly to how this PEG extension for indentation-sensitivity was defined here https://michaeldadams.org/papers/layout_parsing_2/LayoutParsing2-2014-haskell-authors-copy.pdf )

One other thing is that there's potential work for pest to support left recursion, and in that analysis, left recursion is simply untyped / rejected.

tomtau Jul 16, 2023
Maintainer

https://github.com/taocpp/PEGTL/blob/main/doc/Grammar-Analysis.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On the hunt of grammars making the parser run indefinetly #851

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

On the hunt of grammars making the parser run indefinetly #851

Tartasprint Apr 28, 2023

As of now

Potential solutions

§1 Fixing the current validation steps

§1 Change the validation step

§3 Dynamic checking

§4 Important considerations

Recap

Replies: 2 comments · 5 replies

tomtau Apr 29, 2023 Maintainer

Tartasprint Apr 30, 2023 Author

Footnotes

tomtau May 1, 2023 Maintainer

Tartasprint May 1, 2023 Author

Tartasprint May 1, 2023 Author

Footnotes

tomtau Jul 7, 2023 Maintainer

tomtau Jul 16, 2023 Maintainer

Tartasprint
Apr 28, 2023

Replies: 2 comments 5 replies

tomtau
Apr 29, 2023
Maintainer

Tartasprint
Apr 30, 2023
Author

tomtau May 1, 2023
Maintainer

Tartasprint May 1, 2023
Author

Tartasprint May 1, 2023
Author

tomtau Jul 7, 2023
Maintainer

tomtau Jul 16, 2023
Maintainer