Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser: Use a non-greedy matcher to extract the block args #11355

Closed
wants to merge 1 commit into from

Conversation

pento
Copy link
Member

@pento pento commented Nov 1, 2018

Description

If the block has been serialised with a very long args string, it can cause a PREG_JIT_STACKLIMIT_ERROR in PCRE when trying to match the block comment.

Fixes #11066.

Checklist:

  • My code is tested.
  • My code follows the WordPress code style.
  • My code follows the accessibility standards.
  • My code has proper inline documentation.

@pento pento added the [Feature] Parsing Related to efforts to improving the parsing of a string of data and converting it into a different f label Nov 1, 2018
@pento pento added this to the 4.3 milestone Nov 1, 2018
@pento pento requested review from dmsnell and a team November 1, 2018 07:19
@pento
Copy link
Member Author

pento commented Nov 1, 2018

Interestingly, fixing this bug exposes a new one.

In my testing, the previous regex could handle up to ~10,000 characters in the args before triggering a PREG_JIT_STACKLIMIT_ERROR. With the new regex, it now handles up to just under 1,000,000 characters in the args, before triggering a PREG_BACKTRACK_LIMIT_ERROR.

The new error is caused by the non-greedy match added in this PR, as non-greedy matches need to backtrack every time they don't match the end of the pattern.

So, I guess the question is: are we okay with a limit of 1MiB stored in the block args? 🙂

@dmsnell
Copy link
Member

dmsnell commented Nov 1, 2018

Nice patch @pento!

After some investigation here I think we may be misled somewhat by PHP's error.

PREG_JIT_STACKLIMIT_ERROR

Definitely want to avoid this. This is interesting to get and I think it's only present in PHP7+ "out of memory"

PREG_BACKTRACK_LIMIT_ERROR

After triple-checking I couldn't find anything that would suggest there's a catastrophic backtracking attack vector in the RegExp and I think that's true. On the other hand, I think someone is asserting that PHP limits backtracking to 1,000,000 steps globally, meaning that with a really long input we might hit that from 300,000 different normal backtracking steps.


Moving away from (?!}\s+-->) is something I think is worth doing maybe as a last resort. That provides us clarity and safety where .+ can be fragile and risky.

So I started questioning what our options are and loaded up the debugger. Why is it so slow to match JSON attributes? What is causing that backtracking? Turns out it's literally just a cut of a thousand pieces.

But, it shouldn't be that hard because ideally we should be able to communicate in RegExp form "go until you get to --> and then work back from there" which is what I was trying to do with (?!}\s+-->). While I'm still trying to think about how to do this I came up with another thought…

We want to have the RegExp skip over as many characters as it can, right? But we can't use [^}] because JSON is a nested data structure and we could have {"a":{}} or even {"a":"}"}. On the other hand, if we don't find a } then we know that the character is safe to eat, so why not add a small optimization that would let us pass over every non-} in the fast path and then when we find a } resort to the slow path…

(?<attrs>{(?:[^}]+|(?!}\s+-->).)+?}\s+)?

We can see here that the <attrs> group then becomes

  • an opening {
  • anything that is either:
    • not a closing }
    • or if it is, then one at least that isn't closing the attribute group and closing the tag
  • a closing } and the tag closer

That first half of the optimization group with its + turns a thousand groups of a into a single group of many as (freeing internal memory for the parse) and also means that each character no longer needs three steps to check - only the }s will still require that.

The worst case scenario here is when we have something like the a in your test replaced with a } which gets us into the same spot. This could happen in deeply-nested objects when we close those objects with the curly. I think we can optimize for those situations as well…

(?<attrs>{(?:[^}]+|}+(?=})|(?!}\s+-->).)+?}\s+)?

…here we've added a second alternation which is "many }s which are followed by a }" and it handles that worst-case scenario. This leaves us with just one final worst-case scenario where those closing curlies are separated by a space: } } } } } } } } but this is probably a good place to stop. In any case, once we get rid of the simpler one-character lookahead the marginal benefits drop.


All that to stay, in my testing I was able to not only avoid the tradeoffs with the non-greedy approach, but also I was able to make really long attribute sections parse fast. In testing when I got (supposed) catastrophic backtracking with your test input (took several seconds before it gave up), I was able to turn that around to 35 steps total and a few ms of runtime.

<!--\s+(?<closer>\/)?wp:(?<namespace>[a-z][a-z0-9_-]*\/)?(?<name>[a-z][a-z0-9_-]*)\s+(?<attrs>{(?:[^}]+|}+(?=})|(?!}\s+-->).)+?}\s+)?(?<void>\/)?-->

Copy link
Member

@dmsnell dmsnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above

dmsnell added a commit that referenced this pull request Nov 1, 2018
Alternate approach to #11355
Fixes: #11066

Parsing JSON-attributes in the existing parsers can be slow when the
attribute list is very long. This is due to the need to lookahead at
each character and then backtrack as the parser looks for the closing
of the attribute section and the block comment closer.

In this patch we're introducing an optimization that can eliminate
a significant amount of memory-usage and execution time when parsing
long attribute sections by recognizing that _when inside the JSON
attribute area_ any character that _isn't_ a `}` can be consumed
without any further investigation.

I've added a test to make sure we can parse 100,000 characters inside
the JSON attributes. The default parser was fine in testing with more
than a million but the spec parser was running out of memory. I'd
prefer to keep the tests running on all parsers and definitely let our
spec parser define the acceptance criteria so I left the limit low for
now. It'd be great if we found a way to update php-pegjs so that it
could generate the code differently. Until then I vote for a weaker
specification.

The default parser went from requiring hundreds of thousands of steps
and taking seconds to parse the attack to taking something like 35
steps and a few milliseconds.

There should be no functional changes to the parser outputs and no
breaking changes to the project. This is a performance/operational
optimization.
@dmsnell
Copy link
Member

dmsnell commented Nov 1, 2018

@pento I didn't want to hijack your PR but I did want to play around and test my suggestions so I created #11369.

as of yesterday we now have a test suite that all parsers can run again, so I hope we end up putting new tests there instead of on specific parser implementations that may only get hit by one of the four implementations existing currently in core (PHP/JS of spec/default)

@dmsnell
Copy link
Member

dmsnell commented Nov 1, 2018

Comparing the default PHP parser from master against the one in this branch (using #6831)

comparator-fnuoaerzsg now sh_ 2

Comparing the default JS parser from master against the one in this branch (using #6831)

comparator-fnuoaerzsg now sh_

My conclusion is that there is no significant change in runtime performance in the normal cases. I don't have the catastrophic cases in the document library.

dmsnell added a commit that referenced this pull request Nov 2, 2018
Alternate approach to #11355
Fixes: #11066

Parsing JSON-attributes in the existing parsers can be slow when the
attribute list is very long. This is due to the need to lookahead at
each character and then backtrack as the parser looks for the closing
of the attribute section and the block comment closer.

In this patch we're introducing an optimization that can eliminate
a significant amount of memory-usage and execution time when parsing
long attribute sections by recognizing that _when inside the JSON
attribute area_ any character that _isn't_ a `}` can be consumed
without any further investigation.

I've added a test to make sure we can parse 100,000 characters inside
the JSON attributes. The default parser was fine in testing with more
than a million but the spec parser was running out of memory. I'd
prefer to keep the tests running on all parsers and definitely let our
spec parser define the acceptance criteria so I left the limit low for
now. It'd be great if we found a way to update php-pegjs so that it
could generate the code differently. Until then I vote for a weaker
specification.

The default parser went from requiring hundreds of thousands of steps
and taking seconds to parse the attack to taking something like 35
steps and a few milliseconds.

There should be no functional changes to the parser outputs and no
breaking changes to the project. This is a performance/operational
optimization.
@pento pento removed this from the 4.3 milestone Nov 4, 2018
@pento
Copy link
Member Author

pento commented Nov 4, 2018

Closing in favour of #11369.

@pento pento closed this Nov 4, 2018
@pento pento deleted the fix/11066-parsing-long-args branch November 4, 2018 00:17
dmsnell added a commit that referenced this pull request Nov 4, 2018
* Parser: Optimize JSON-attribute parsing

Alternate approach to #11355
Fixes: #11066

Parsing JSON-attributes in the existing parsers can be slow when the
attribute list is very long. This is due to the need to lookahead at
each character and then backtrack as the parser looks for the closing
of the attribute section and the block comment closer.

In this patch we're introducing an optimization that can eliminate
a significant amount of memory-usage and execution time when parsing
long attribute sections by recognizing that _when inside the JSON
attribute area_ any character that _isn't_ a `}` can be consumed
without any further investigation.

I've added a test to make sure we can parse 100,000 characters inside
the JSON attributes. The default parser was fine in testing with more
than a million but the spec parser was running out of memory. I'd
prefer to keep the tests running on all parsers and definitely let our
spec parser define the acceptance criteria so I left the limit low for
now. It'd be great if we found a way to update php-pegjs so that it
could generate the code differently. Until then I vote for a weaker
specification.

The default parser went from requiring hundreds of thousands of steps
and taking seconds to parse the attack to taking something like 35
steps and a few milliseconds.

There should be no functional changes to the parser outputs and no
breaking changes to the project. This is a performance/operational
optimization.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Feature] Parsing Related to efforts to improving the parsing of a string of data and converting it into a different f
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants