On the hunt of grammars making the parser run indefinetly #851
Replies: 2 comments 5 replies
-
Thanks for a detailed write-up!
Maybe @dragostis @CAD97 @pest-parser/triage may have some input for either of those.
Given OLD_STATIC won't be able to fix the other cases (I haven't thought about them in detail, so I take your word for it), I think NEW_STATIC is probably the way to go. As for the breaking change, I think it may be fine in the sense that ParserExpr is rather internal or rarely used (if ever) by external crates... but that breaking change may "bubble up" / need a lot of refactoring (and potentially require other breaking changes). So my suggestion would be seeking to do NEW_STATIC in a non-breaking or a least breaking way. As for the dynamic checking option, pest already has this optional setting: https://docs.rs/pest/latest/pest/fn.set_call_limit.html which is a bit simplistic, but it should capture indefinite running cases as well (should there be more edge cases than what's potentially covered by NEW_STATIC). |
Beta Was this translation helpful? Give feedback.
-
Thank you for your suggestions. I gave implementing NEW_STATIC a try, but I kept finding more edge cases. In the end I realized that I was trying to prove whether the parser would stop, and after some research it turns out that it is proven to be undecidable... It is no wonder this was giving me headaches :P!
The problem is back to the early stage of planning, and I don't know yet if solving (actually mitigating it) will cause breaking changes, but I think it is possible to break very few things. Here is an attempt to deal with undecidability. It might be possible to detect most (all?) the real world edge cases. I don't think, but that should be checked, that people who write grammars use all of the expressive power given by pest. As example, doing a small (and probably incomplete) search on Github tells me that very few people seem to use stack operations nested in predicates, and that it's mostly PEEK. Doing that I also noticed there are grammars that do PUSHes but never decrease the stack using POP or DROP (examples: 1, 2, 3, and a last one that doesn't even use PEEK!). But knowing people don't do things won't solve the problem (also because warning them about not doing some of those things is the problem). Here are some possible approaches:
I may be wrong, but 3 is the current approach. As an example when checking if the negative predicate Option 3 has the advantage of always giving a clear answer: the grammar is correct or not. While option 1/2 can sometimes say “you're mistaken/correct”, it will say “I don't know” the rest of time. And “I don't know” can be misleading. But I think options 1 and 2 should still be considered, because their answer can be trusted. Using both option 1 and option 2 could result into the following behavior:
In case a sub-expression correctness is unknown we may bubble it up to its parent expressions. But that would probably result in knowing nothing about the "root" expression. So I propose to use a heuristic like in option 3, but mark the result with uncertainty. A result could then be implemented with something looking like: enum Knowledge<T> {
Provably(T),
Maybe(T), // Heuristic results go here
Unknown,
} This allows both to trust the validator output, and give hints to the user in case the result is unknown. To give good hints would require either looking how people use pest, or counting on the fact people will report bad hints. I think doing a bit of both could be good. What do you think of this ? Footnotes
|
Beta Was this translation helpful? Give feedback.
-
There are grammars that may cause even a "bug-free" parser to run indefinitely such as:
Some others can be found in #848, but it's not important for now.
I think no one is interested in the fact that the parser can hang, they may accommodate to it but they are not wishing it to hang. In this discussion I'm interested to prevent the parser to hang indefinitely. By doing that it may mean progressing on #685, although that issue is more concerned about improving the performance than the desperate cases I talk about here.
As of now
Currently the cases 1-3 are covered, but not cases 4-10. As of now the validation of this kind of problems is handled by
pest_meta::validator::validate_ast
. In that function the steps we are interested in arevalidate_repetitions
,validate_left_recursion
andvalidate_whitespace_comment
. The first checks if there is an expression that does not make progress and can be repeated indefinitely by a repetition operator (*
aliasRep
,+
aliasRepOnce
,{n,}
aliasRepMin
). The second test checks for another source of infinite repetition, left recursion. The third one is similar to the first one applied toWHITESPACE
andCOMMENT
since they are implicitly*
-repeated by the sequence operator~
. It currently doesn't check for possible left recursion hidden by~
as pointed out by example 4.Potential solutions
I have a few propositions to prevent this to happen:
validate_ast
which is the only public function. I detail later a possible rewrite.Here is a brief comparison of the propositions:
§1 Fixing the current validation steps
§1.0 The OLD_STATIC option is perhaps the easiest since most of the work is already done. It is probable that the example 4 can be fixed (provided the check for left recursion works), example 5 might be fixed but involves creating a new function
is_always_failing
adding up to the actual complexity.§1.1 But there is a structural problem: the
validate_left_recursion
relies on theis_non_failing
andis_non progressing
functions, that make the assumption that the grammar is non-left recursive... I might be wrong, and this problem is maybe not a problem and the left recursion validation step might be correct, but it's hard (for me) to prove or at least be convinced that it is correct. If someone has an explanation for the correctness of that step, I will welcome it. But beware it might cause headaches 😛.§1.3 Even if that step turns out to be correct, the validation is inefficient: the functions
is_non_failing
andis_non_progressing
are called at least for every infinite repetition operator, they compute the values recursively for every sub-expression/sub-rule and do not store intermediate results. As an example to be checked the expression{((very_complex)*~ANY)*~(very_complex)*}
requires to computeis_non_progressing
andis_non_failing
ofvery_complex
both 3 times.§1.4 Then the example 6 is not possible to be fixed without changing the signature of the functions. Let me explain. The rule
for_ever_6=@{(PUSH("") ~ POP)*}
runs for ever because thePUSH("")
does no progress and then thePOP
does no progress either because the last item on the in the stack is""
which never fails. To detect it we would need something similar to §2.1.§1.5 Now we could decide to forbid empty pushes, but that may not be desirable, I imagine. I haven't useful examples of empty pushes in mind, but they might look like
PUSH("non_empty" | "" )
orPUSH(("non_empty")*)
. Maybe even the dumbPUSH("")
might be useful to keep track of seen patterns in a hacky way:@{("A"~PUSH(""))+ ~ (POP~"B")+}
. Note that this example does not work currently as the parser does not seem to push empty strings onto the stack. And also that it's useless sincer=@{"A"~r~"B" | "" }
does the same thing. But still, it's fun.§1 Change the validation step
§2.0 Looking at the
DROP
operation we get a better understanding of the problem. TheDROP
operation does not make any progress on the input, but we cannot tell it will cause the parser to hang: once the stack is empty it will fail. Of course a problem similar to example 6 might arise, but that's not the point. The point is there are actually three ways of making progress:"non_empty"
for positive input progress and""
for zero input progress. This progress is always positive or zero since there is no backtracking.POP
orDROP
for positive stack progress,"AZ"
orPEEK
for zero stack progress,PUSH("AZ")
for negative stack progress""
will not progress, an expression that always fails will always make progress. An expression that fails sometimes may progress (ex:"A"
) or may not progress (ex:&("A")
).§2.1 I propose the following definition/implementation of progress for STATIC_NEW possibility:
§2.2 The helper function is called strict because an expression like
&("A")
has a progress ofProgress::MayFail(false,StackProgress::Change(0))
but might still fail. The choice to being made here is strict because the function returnstrue
only if we are 100% sure its progressing. To know if an expression is progressing we need more context. The helper function could also return something liketrue|false|maybe
to be more precise§2.3 To check statically progress without computing it more than once I propose to attach to each expression a validation member.
It can either be done by adding a
progress
member to the current expression structure (STATIC_NEW, choice 1, option A), or by adding an indirection with avalidation
member that refers to astruct ValidationStatus { progress }
to make easily the same with other kinds validation (STATIC_NEW, 1B).We can either attach this metadata directly on the public
parser_meta::ParserExpr
structure to allow users to benefit from this analysis (breaking change) (STATIC_NEW, 2A), or make a similar new structureParserExpr+metadata
that could be either public (positively breaking change) (STATIC_NEW, 2B) or private (non breaking changes) (STATIC_NEW,2C).§2.4 Then the validator would run through the parse tree until every expression has been assigned a progress. That part I did not think through yet, because it's harder to think about, and that I don't want to spend time on it if this not wanted. I do have vague ideas but propositions are welcome.
§3 Dynamic checking
§3.0 Now I think we can all agree static-compile-time validation is awesome, both for certainty that no errors will pop-up and for performance. But since I did not think through all the cases, especially the ones involving left recursion 🤕, I'm not sure all cases can be treated at compile time. Therefore the following thoughts only apply if static validation is no sufficient. You may not think about it or even skip to §4 (it's near though). I won't go in too much detail anyway. So we could be satisfied by all the cases caught by the static checks and hope that the ones not covered won't happen (DYN choice NoDyn).
§3.1 But this does not seem wise to me, and I propose to make available dynamic checks. Available because I understand that for performance reasons someone might not them. But still present, at least for the dev/test side.
Then I found two possibilities, when a dynamic check fails (a infinite loop was detected) we can simply stop the parser and return an error (breaking changes in the API, not in the behavior) (DYN choice Stop).
§3.2 But we could also "step over" the infinite loop. As an example, parsing the expression
("A" | "")* ~ "B"
with input"AAB"
would give:"A"
"A"
""
§3.3 is choice (DYN choice Ignore). This choice is very infinite-tolerant, may even allow parsing of left recursion rules, I'm not sure though (but there might be hope for that according to #533, but I didn't read the papers yet). But a major drawback is that it may result in an inefficient parser: the one that stops is inefficient but knows when to stop while the other one will go on and on always performing the checks. But on another side this is what the stopping one should do most of the time if most of the time the input doesn't lead to infinite loops.
§4 Important considerations
§4.0 To do all this, there is a need of having very precise semantics of every pest operator. But that goes in another discussion. Example of non-well defined (to me at least) behavior are:
PEEK[a..b]
works fora<
b,a==b, and
a>b`I won't start that discussion now because the current one already took me enough time, I don't have all the edge cases right now, and because it's not necessary in the immediate. But it will be necessary if we need to trust any validation implementation.
Recap
The choices I propose are the following:
Thank you for reading till here ! Please give your comments/thoughts/advice 😄 ?
Beta Was this translation helpful? Give feedback.
All reactions