Parser: Use a non-greedy matcher to extract the block args #11355

pento · 2018-11-01T07:19:31Z

Description

If the block has been serialised with a very long args string, it can cause a PREG_JIT_STACKLIMIT_ERROR in PCRE when trying to match the block comment.

Fixes #11066.

Checklist:

My code is tested.
My code follows the WordPress code style.
My code follows the accessibility standards.
My code has proper inline documentation.

pento · 2018-11-01T07:32:58Z

Interestingly, fixing this bug exposes a new one.

In my testing, the previous regex could handle up to ~10,000 characters in the args before triggering a PREG_JIT_STACKLIMIT_ERROR. With the new regex, it now handles up to just under 1,000,000 characters in the args, before triggering a PREG_BACKTRACK_LIMIT_ERROR.

The new error is caused by the non-greedy match added in this PR, as non-greedy matches need to backtrack every time they don't match the end of the pattern.

So, I guess the question is: are we okay with a limit of 1MiB stored in the block args? 🙂

dmsnell · 2018-11-01T15:22:31Z

Nice patch @pento!

After some investigation here I think we may be misled somewhat by PHP's error.

PREG_JIT_STACKLIMIT_ERROR

Definitely want to avoid this. This is interesting to get and I think it's only present in PHP7+ "out of memory"

PREG_BACKTRACK_LIMIT_ERROR

After triple-checking I couldn't find anything that would suggest there's a catastrophic backtracking attack vector in the RegExp and I think that's true. On the other hand, I think someone is asserting that PHP limits backtracking to 1,000,000 steps globally, meaning that with a really long input we might hit that from 300,000 different normal backtracking steps.

Moving away from (?!}\s+-->) is something I think is worth doing maybe as a last resort. That provides us clarity and safety where .+ can be fragile and risky.

So I started questioning what our options are and loaded up the debugger. Why is it so slow to match JSON attributes? What is causing that backtracking? Turns out it's literally just a cut of a thousand pieces.

But, it shouldn't be that hard because ideally we should be able to communicate in RegExp form "go until you get to --> and then work back from there" which is what I was trying to do with (?!}\s+-->). While I'm still trying to think about how to do this I came up with another thought…

We want to have the RegExp skip over as many characters as it can, right? But we can't use [^}] because JSON is a nested data structure and we could have {"a":{}} or even {"a":"}"}. On the other hand, if we don't find a } then we know that the character is safe to eat, so why not add a small optimization that would let us pass over every non-} in the fast path and then when we find a } resort to the slow path…

(?<attrs>{(?:[^}]+|(?!}\s+-->).)+?}\s+)?

We can see here that the <attrs> group then becomes

an opening {
anything that is either:
- not a closing }
- or if it is, then one at least that isn't closing the attribute group and closing the tag
a closing } and the tag closer

That first half of the optimization group with its + turns a thousand groups of a into a single group of many as (freeing internal memory for the parse) and also means that each character no longer needs three steps to check - only the }s will still require that.

The worst case scenario here is when we have something like the a in your test replaced with a } which gets us into the same spot. This could happen in deeply-nested objects when we close those objects with the curly. I think we can optimize for those situations as well…

(?<attrs>{(?:[^}]+|}+(?=})|(?!}\s+-->).)+?}\s+)?

…here we've added a second alternation which is "many }s which are followed by a }" and it handles that worst-case scenario. This leaves us with just one final worst-case scenario where those closing curlies are separated by a space: } } } } } } } } but this is probably a good place to stop. In any case, once we get rid of the simpler one-character lookahead the marginal benefits drop.

All that to stay, in my testing I was able to not only avoid the tradeoffs with the non-greedy approach, but also I was able to make really long attribute sections parse fast. In testing when I got (supposed) catastrophic backtracking with your test input (took several seconds before it gave up), I was able to turn that around to 35 steps total and a few ms of runtime.

<!--\s+(?<closer>\/)?wp:(?<namespace>[a-z][a-z0-9_-]*\/)?(?<name>[a-z][a-z0-9_-]*)\s+(?<attrs>{(?:[^}]+|}+(?=})|(?!}\s+-->).)+?}\s+)?(?<void>\/)?-->

dmsnell

See my comment above

Alternate approach to #11355 Fixes: #11066 Parsing JSON-attributes in the existing parsers can be slow when the attribute list is very long. This is due to the need to lookahead at each character and then backtrack as the parser looks for the closing of the attribute section and the block comment closer. In this patch we're introducing an optimization that can eliminate a significant amount of memory-usage and execution time when parsing long attribute sections by recognizing that _when inside the JSON attribute area_ any character that _isn't_ a `}` can be consumed without any further investigation. I've added a test to make sure we can parse 100,000 characters inside the JSON attributes. The default parser was fine in testing with more than a million but the spec parser was running out of memory. I'd prefer to keep the tests running on all parsers and definitely let our spec parser define the acceptance criteria so I left the limit low for now. It'd be great if we found a way to update php-pegjs so that it could generate the code differently. Until then I vote for a weaker specification. The default parser went from requiring hundreds of thousands of steps and taking seconds to parse the attack to taking something like 35 steps and a few milliseconds. There should be no functional changes to the parser outputs and no breaking changes to the project. This is a performance/operational optimization.

dmsnell · 2018-11-01T16:40:47Z

@pento I didn't want to hijack your PR but I did want to play around and test my suggestions so I created #11369.

as of yesterday we now have a test suite that all parsers can run again, so I hope we end up putting new tests there instead of on specific parser implementations that may only get hit by one of the four implementations existing currently in core (PHP/JS of spec/default)

dmsnell · 2018-11-01T20:20:10Z

Comparing the default PHP parser from master against the one in this branch (using #6831)

Comparing the default JS parser from master against the one in this branch (using #6831)

My conclusion is that there is no significant change in runtime performance in the normal cases. I don't have the catastrophic cases in the document library.

Alternate approach to #11355 Fixes: #11066 Parsing JSON-attributes in the existing parsers can be slow when the attribute list is very long. This is due to the need to lookahead at each character and then backtrack as the parser looks for the closing of the attribute section and the block comment closer. In this patch we're introducing an optimization that can eliminate a significant amount of memory-usage and execution time when parsing long attribute sections by recognizing that _when inside the JSON attribute area_ any character that _isn't_ a `}` can be consumed without any further investigation. I've added a test to make sure we can parse 100,000 characters inside the JSON attributes. The default parser was fine in testing with more than a million but the spec parser was running out of memory. I'd prefer to keep the tests running on all parsers and definitely let our spec parser define the acceptance criteria so I left the limit low for now. It'd be great if we found a way to update php-pegjs so that it could generate the code differently. Until then I vote for a weaker specification. The default parser went from requiring hundreds of thousands of steps and taking seconds to parse the attack to taking something like 35 steps and a few milliseconds. There should be no functional changes to the parser outputs and no breaking changes to the project. This is a performance/operational optimization.

pento · 2018-11-04T00:17:18Z

Closing in favour of #11369.

* Parser: Optimize JSON-attribute parsing Alternate approach to #11355 Fixes: #11066 Parsing JSON-attributes in the existing parsers can be slow when the attribute list is very long. This is due to the need to lookahead at each character and then backtrack as the parser looks for the closing of the attribute section and the block comment closer. In this patch we're introducing an optimization that can eliminate a significant amount of memory-usage and execution time when parsing long attribute sections by recognizing that _when inside the JSON attribute area_ any character that _isn't_ a `}` can be consumed without any further investigation. I've added a test to make sure we can parse 100,000 characters inside the JSON attributes. The default parser was fine in testing with more than a million but the spec parser was running out of memory. I'd prefer to keep the tests running on all parsers and definitely let our spec parser define the acceptance criteria so I left the limit low for now. It'd be great if we found a way to update php-pegjs so that it could generate the code differently. Until then I vote for a weaker specification. The default parser went from requiring hundreds of thousands of steps and taking seconds to parse the attack to taking something like 35 steps and a few milliseconds. There should be no functional changes to the parser outputs and no breaking changes to the project. This is a performance/operational optimization.

Use a non-greedy matcher to extract the block args

324678e

pento added the [Feature] Parsing Related to efforts to improving the parsing of a string of data and converting it into a different f label Nov 1, 2018

pento added this to the 4.3 milestone Nov 1, 2018

pento requested review from dmsnell and a team November 1, 2018 07:19

dmsnell requested changes Nov 1, 2018

View reviewed changes

dmsnell mentioned this pull request Nov 1, 2018

Parser: Optimize JSON-attribute parsing #11369

Merged

4 tasks

pento removed this from the 4.3 milestone Nov 4, 2018

pento closed this Nov 4, 2018

pento deleted the fix/11066-parsing-long-args branch November 4, 2018 00:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parser: Use a non-greedy matcher to extract the block args #11355

Parser: Use a non-greedy matcher to extract the block args #11355

pento commented Nov 1, 2018 •

edited

Loading

pento commented Nov 1, 2018

dmsnell commented Nov 1, 2018 •

edited

Loading

dmsnell left a comment

dmsnell commented Nov 1, 2018

dmsnell commented Nov 1, 2018

pento commented Nov 4, 2018

Parser: Use a non-greedy matcher to extract the block args #11355

Parser: Use a non-greedy matcher to extract the block args #11355

Conversation

pento commented Nov 1, 2018 • edited Loading

Description

Checklist:

pento commented Nov 1, 2018

dmsnell commented Nov 1, 2018 • edited Loading

dmsnell left a comment

Choose a reason for hiding this comment

dmsnell commented Nov 1, 2018

dmsnell commented Nov 1, 2018

pento commented Nov 4, 2018

pento commented Nov 1, 2018 •

edited

Loading

dmsnell commented Nov 1, 2018 •

edited

Loading