-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parser: Use a non-greedy matcher to extract the block args #11355
Conversation
Interestingly, fixing this bug exposes a new one. In my testing, the previous regex could handle up to ~10,000 characters in the args before triggering a The new error is caused by the non-greedy match added in this PR, as non-greedy matches need to backtrack every time they don't match the end of the pattern. So, I guess the question is: are we okay with a limit of 1MiB stored in the block args? 🙂 |
Nice patch @pento! After some investigation here I think we may be misled somewhat by PHP's error.
Definitely want to avoid this. This is interesting to get and I think it's only present in PHP7+ "out of memory"
After triple-checking I couldn't find anything that would suggest there's a catastrophic backtracking attack vector in the RegExp and I think that's true. On the other hand, I think someone is asserting that PHP limits backtracking to 1,000,000 steps globally, meaning that with a really long input we might hit that from 300,000 different normal backtracking steps. Moving away from So I started questioning what our options are and loaded up the debugger. Why is it so slow to match JSON attributes? What is causing that backtracking? Turns out it's literally just a cut of a thousand pieces. But, it shouldn't be that hard because ideally we should be able to communicate in RegExp form "go until you get to We want to have the RegExp skip over as many characters as it can, right? But we can't use
We can see here that the
That first half of the optimization group with its The worst case scenario here is when we have something like the
…here we've added a second alternation which is "many All that to stay, in my testing I was able to not only avoid the tradeoffs with the non-greedy approach, but also I was able to make really long attribute sections parse fast. In testing when I got (supposed) catastrophic backtracking with your test input (took several seconds before it gave up), I was able to turn that around to 35 steps total and a few ms of runtime. <!--\s+(?<closer>\/)?wp:(?<namespace>[a-z][a-z0-9_-]*\/)?(?<name>[a-z][a-z0-9_-]*)\s+(?<attrs>{(?:[^}]+|}+(?=})|(?!}\s+-->).)+?}\s+)?(?<void>\/)?--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above
Alternate approach to #11355 Fixes: #11066 Parsing JSON-attributes in the existing parsers can be slow when the attribute list is very long. This is due to the need to lookahead at each character and then backtrack as the parser looks for the closing of the attribute section and the block comment closer. In this patch we're introducing an optimization that can eliminate a significant amount of memory-usage and execution time when parsing long attribute sections by recognizing that _when inside the JSON attribute area_ any character that _isn't_ a `}` can be consumed without any further investigation. I've added a test to make sure we can parse 100,000 characters inside the JSON attributes. The default parser was fine in testing with more than a million but the spec parser was running out of memory. I'd prefer to keep the tests running on all parsers and definitely let our spec parser define the acceptance criteria so I left the limit low for now. It'd be great if we found a way to update php-pegjs so that it could generate the code differently. Until then I vote for a weaker specification. The default parser went from requiring hundreds of thousands of steps and taking seconds to parse the attack to taking something like 35 steps and a few milliseconds. There should be no functional changes to the parser outputs and no breaking changes to the project. This is a performance/operational optimization.
@pento I didn't want to hijack your PR but I did want to play around and test my suggestions so I created #11369. as of yesterday we now have a test suite that all parsers can run again, so I hope we end up putting new tests there instead of on specific parser implementations that may only get hit by one of the four implementations existing currently in core (PHP/JS of spec/default) |
Comparing the default PHP parser from Comparing the default JS parser from My conclusion is that there is no significant change in runtime performance in the normal cases. I don't have the catastrophic cases in the document library. |
Alternate approach to #11355 Fixes: #11066 Parsing JSON-attributes in the existing parsers can be slow when the attribute list is very long. This is due to the need to lookahead at each character and then backtrack as the parser looks for the closing of the attribute section and the block comment closer. In this patch we're introducing an optimization that can eliminate a significant amount of memory-usage and execution time when parsing long attribute sections by recognizing that _when inside the JSON attribute area_ any character that _isn't_ a `}` can be consumed without any further investigation. I've added a test to make sure we can parse 100,000 characters inside the JSON attributes. The default parser was fine in testing with more than a million but the spec parser was running out of memory. I'd prefer to keep the tests running on all parsers and definitely let our spec parser define the acceptance criteria so I left the limit low for now. It'd be great if we found a way to update php-pegjs so that it could generate the code differently. Until then I vote for a weaker specification. The default parser went from requiring hundreds of thousands of steps and taking seconds to parse the attack to taking something like 35 steps and a few milliseconds. There should be no functional changes to the parser outputs and no breaking changes to the project. This is a performance/operational optimization.
Closing in favour of #11369. |
* Parser: Optimize JSON-attribute parsing Alternate approach to #11355 Fixes: #11066 Parsing JSON-attributes in the existing parsers can be slow when the attribute list is very long. This is due to the need to lookahead at each character and then backtrack as the parser looks for the closing of the attribute section and the block comment closer. In this patch we're introducing an optimization that can eliminate a significant amount of memory-usage and execution time when parsing long attribute sections by recognizing that _when inside the JSON attribute area_ any character that _isn't_ a `}` can be consumed without any further investigation. I've added a test to make sure we can parse 100,000 characters inside the JSON attributes. The default parser was fine in testing with more than a million but the spec parser was running out of memory. I'd prefer to keep the tests running on all parsers and definitely let our spec parser define the acceptance criteria so I left the limit low for now. It'd be great if we found a way to update php-pegjs so that it could generate the code differently. Until then I vote for a weaker specification. The default parser went from requiring hundreds of thousands of steps and taking seconds to parse the attack to taking something like 35 steps and a few milliseconds. There should be no functional changes to the parser outputs and no breaking changes to the project. This is a performance/operational optimization.
Description
If the block has been serialised with a very long args string, it can cause a
PREG_JIT_STACKLIMIT_ERROR
in PCRE when trying to match the block comment.Fixes #11066.
Checklist: