-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parser: Optimize JSON-attribute parsing #11369
Conversation
Comparing the default PHP parser from Comparing the default JS parser from My conclusion is that there is no significant change in runtime performance in the normal cases. I don't have the catastrophic cases in the document library. |
b680a0c
to
a5c34d7
Compare
Alternate approach to #11355 Fixes: #11066 Parsing JSON-attributes in the existing parsers can be slow when the attribute list is very long. This is due to the need to lookahead at each character and then backtrack as the parser looks for the closing of the attribute section and the block comment closer. In this patch we're introducing an optimization that can eliminate a significant amount of memory-usage and execution time when parsing long attribute sections by recognizing that _when inside the JSON attribute area_ any character that _isn't_ a `}` can be consumed without any further investigation. I've added a test to make sure we can parse 100,000 characters inside the JSON attributes. The default parser was fine in testing with more than a million but the spec parser was running out of memory. I'd prefer to keep the tests running on all parsers and definitely let our spec parser define the acceptance criteria so I left the limit low for now. It'd be great if we found a way to update php-pegjs so that it could generate the code differently. Until then I vote for a weaker specification. The default parser went from requiring hundreds of thousands of steps and taking seconds to parse the attack to taking something like 35 steps and a few milliseconds. There should be no functional changes to the parser outputs and no breaking changes to the project. This is a performance/operational optimization.
a5c34d7
to
4bc0f2e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -347,7 +347,7 @@ function next_token() { | |||
* match back in PHP to see which one it was. | |||
*/ | |||
$has_match = preg_match( | |||
'/<!--\s+(?<closer>\/)?wp:(?<namespace>[a-z][a-z0-9_-]*\/)?(?<name>[a-z][a-z0-9_-]*)\s+(?<attrs>{(?:(?!}\s+-->).)+?}\s+)?(?<void>\/)?-->/s', | |||
'/<!--\s+(?<closer>\/)?wp:(?<namespace>[a-z][a-z0-9_-]*\/)?(?<name>[a-z][a-z0-9_-]*)\s+(?<attrs>{(?:[^}]+|}+(?=})|(?!}\s+-->).)+?}\s+)?(?<void>\/)?-->/s', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I understand this term: }+(?=})
. Among other things, I don't understand why +
should be applied to that first }
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the history is in #11355
I guess we could go in and add some explanatory comments here but we're creating two new matching groups here based on a "fast parse" pathway
we know that specifically we want to consume ((?!}\s+-->).)
, but the problem with that is that for every position in the input we're allocating some memory to hold the match group and then we also start backtracking once the match fails. this is unlike the autogenerated PEG parser because PEGs don't backtrack.
therefore, I looked at what was causing things to be slow and realized that we know more about our inputs than the RegExp does, that is, the majority of our characters won't even be }
. taking this knowledge lets us encode that fast-path and fallback to the slow path
if the input isn't a
}
, then directly consume it and proceed.
we can compare these two methods with a simplified test.
we'll use the simple test data < {"a": "b"} >
.
make sure to click on the debugger in these to see what's
happening inside the RegExp engine.
(?:<\s+)({(?:(?!}\s+>).)*})(?:\s+>)
- 1 match, 51 steps - https://regex101.com/r/UmdrvA/1
(?:<\s+)({(?:(?:[^}]+)|(?:(?!}\s+>).))*})(?:\s+>)
- 1 match, 26 steps - https://regex101.com/r/cXcMFw/1
if we make the input bigger we can watch in the debugger how the [}]+
can consume a long string of text in one step and skip all the backtracking.
that leaves us a bad case where we end up with deeply-nested JSON though and in the content we could imagine }}}}}}}}} -->
. the }+(?=})
bit then is a second-order optimization to consume a string of closing brackets up to the last one in the string (?=
is the positive lookahead to say "match the closer but don't consume it).
by introducing a new alternation group we've made a new possible branch in the execution but on the other hand it's fast to consume a string of closers. the goal is mainly to remove the backtracking which opens a denial-of-service attack and can dramatically bloat the execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the thorough explanation <3
Bug introduced in #11369 Someone discovered high CPU usage due to catastrophic backtracking on an invalid block comment delimiter. The following input crashed the parser on the server: ```html <!-- wp:block {"a":0} / --> ``` The optimization introduced in #11369 ended up opening a place for backtracking that shouldn't be there. In this patch we're grouping the attribute parsing section of the tokenizing RegExp pattern so that we can make the group itself _possessive_ so that we abort any backtracking.
Bug introduced in #11369 Someone discovered high CPU usage due to catastrophic backtracking on an invalid block comment delimiter. The following input crashed the parser on the server: ```html <!-- wp:block {"a":0} / --> ``` The optimization introduced in #11369 ended up opening a place for backtracking that shouldn't be there. In this patch we're grouping the attribute parsing section of the tokenizing RegExp pattern so that we can make the group itself _possessive_ so that we abort any backtracking.
Bug introduced in #11369 Someone discovered high CPU usage due to catastrophic backtracking on an invalid block comment delimiter. The following input crashed the parser on the server: ```html <!-- wp:block {"a":0} / --> ``` The optimization introduced in #11369 ended up opening a place for backtracking that shouldn't be there. In this patch we're grouping the attribute parsing section of the tokenizing RegExp pattern so that we can make the group itself _possessive_ so that we abort any backtracking.
Bug introduced in #11369 Someone discovered high CPU usage due to catastrophic backtracking on an invalid block comment delimiter. The following input crashed the parser on the server: ```html <!-- wp:block {"a":0} / --> ``` The optimization introduced in #11369 ended up opening a place for backtracking that shouldn't be there. In this patch we're grouping the attribute parsing section of the tokenizing RegExp pattern so that we can make the group itself _possessive_ so that we abort any backtracking.
Bug introduced in #11369 Someone discovered high CPU usage due to catastrophic backtracking on an invalid block comment delimiter. The following input crashed the parser on the server: ```html <!-- wp:block {"a":0} / --> ``` The optimization introduced in #11369 ended up opening a place for backtracking that shouldn't be there. In this patch we're grouping the attribute parsing section of the tokenizing RegExp pattern so that we can make the group itself _possessive_ so that we abort any backtracking.
* Parser: Make attribute parsing possessive Bug introduced in #11369 Someone discovered high CPU usage due to catastrophic backtracking on an invalid block comment delimiter. The following input crashed the parser on the server: ```html <!-- wp:block {"a":0} / --> ``` The optimization introduced in #11369 ended up opening a place for backtracking that shouldn't be there. In this patch we're grouping the attribute parsing section of the tokenizing RegExp pattern so that we can make the group itself _possessive_ so that we abort any backtracking. * add test and fix broken fix * really fix default JS parser * add explanatory comment * add @SInCE comment with updated target version * version bumps
* Parser: Make attribute parsing possessive Bug introduced in #11369 Someone discovered high CPU usage due to catastrophic backtracking on an invalid block comment delimiter. The following input crashed the parser on the server: ```html <!-- wp:block {"a":0} / --> ``` The optimization introduced in #11369 ended up opening a place for backtracking that shouldn't be there. In this patch we're grouping the attribute parsing section of the tokenizing RegExp pattern so that we can make the group itself _possessive_ so that we abort any backtracking. * add test and fix broken fix * really fix default JS parser * add explanatory comment * add @SInCE comment with updated target version * version bumps
Alternate approach to #11355
Fixes: #11066
Status
here are local uncommitted changes after one or both of 'npm install' or 'npm run docs:build'!
- can someone help me understand this?Description
Parsing JSON-attributes in the existing parsers can be slow when the
attribute list is very long. This is due to the need to lookahead at
each character and then backtrack as the parser looks for the closing
of the attribute section and the block comment closer.
In this patch we're introducing an optimization that can eliminate
a significant amount of memory-usage and execution time when parsing
long attribute sections by recognizing that when inside the JSON
attribute area any character that isn't a
}
can be consumedwithout any further investigation.
I've added a test to make sure we can parse 100,000 characters inside
the JSON attributes. The default parser was fine in testing with more
than a million but the spec parser was running out of memory. I'd
prefer to keep the tests running on all parsers and definitely let our
spec parser define the acceptance criteria so I left the limit low for
now. It'd be great if we found a way to update php-pegjs so that it
could generate the code differently. Until then I vote for a weaker
specification.
The default parser went from requiring hundreds of thousands of steps
and taking seconds to parse the attack to taking something like 35
steps and a few milliseconds.
There should be no functional changes to the parser outputs and no
breaking changes to the project. This is a performance/operational
optimization.
Checklist: