Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser: Optimize JSON-attribute parsing #11369

Merged
merged 3 commits into from
Nov 4, 2018
Merged

Conversation

dmsnell
Copy link
Member

@dmsnell dmsnell commented Nov 1, 2018

Alternate approach to #11355
Fixes: #11066

Status

  • I'm confused by the failing test saying here are local uncommitted changes after one or both of 'npm install' or 'npm run docs:build'! - can someone help me understand this?

Description

Parsing JSON-attributes in the existing parsers can be slow when the
attribute list is very long. This is due to the need to lookahead at
each character and then backtrack as the parser looks for the closing
of the attribute section and the block comment closer.

In this patch we're introducing an optimization that can eliminate
a significant amount of memory-usage and execution time when parsing
long attribute sections by recognizing that when inside the JSON
attribute area
any character that isn't a } can be consumed
without any further investigation.

I've added a test to make sure we can parse 100,000 characters inside
the JSON attributes. The default parser was fine in testing with more
than a million but the spec parser was running out of memory. I'd
prefer to keep the tests running on all parsers and definitely let our
spec parser define the acceptance criteria so I left the limit low for
now. It'd be great if we found a way to update php-pegjs so that it
could generate the code differently. Until then I vote for a weaker
specification.

The default parser went from requiring hundreds of thousands of steps
and taking seconds to parse the attack to taking something like 35
steps and a few milliseconds.

There should be no functional changes to the parser outputs and no
breaking changes to the project. This is a performance/operational
optimization.

Checklist:

  • My code is tested.
  • My code follows the WordPress code style.
  • My code follows the accessibility standards. (N/A?)
  • My code has proper inline documentation. (? not sure)

@dmsnell dmsnell added [Type] Bug An existing feature does not function as intended [Feature] Parsing Related to efforts to improving the parsing of a string of data and converting it into a different f labels Nov 1, 2018
@dmsnell dmsnell requested review from mcsf and pento November 1, 2018 16:39
@dmsnell
Copy link
Member Author

dmsnell commented Nov 1, 2018

Comparing the default PHP parser from master against the one in this branch (using #6831)

comparator-fnuoaerzsg now sh_

Comparing the default JS parser from master against the one in this branch (using #6831)

comparator-fnuoaerzsg now sh_

My conclusion is that there is no significant change in runtime performance in the normal cases. I don't have the catastrophic cases in the document library.

@dmsnell dmsnell force-pushed the fix/11066-parser-exhaustion branch from b680a0c to a5c34d7 Compare November 2, 2018 16:43
Alternate approach to #11355
Fixes: #11066

Parsing JSON-attributes in the existing parsers can be slow when the
attribute list is very long. This is due to the need to lookahead at
each character and then backtrack as the parser looks for the closing
of the attribute section and the block comment closer.

In this patch we're introducing an optimization that can eliminate
a significant amount of memory-usage and execution time when parsing
long attribute sections by recognizing that _when inside the JSON
attribute area_ any character that _isn't_ a `}` can be consumed
without any further investigation.

I've added a test to make sure we can parse 100,000 characters inside
the JSON attributes. The default parser was fine in testing with more
than a million but the spec parser was running out of memory. I'd
prefer to keep the tests running on all parsers and definitely let our
spec parser define the acceptance criteria so I left the limit low for
now. It'd be great if we found a way to update php-pegjs so that it
could generate the code differently. Until then I vote for a weaker
specification.

The default parser went from requiring hundreds of thousands of steps
and taking seconds to parse the attack to taking something like 35
steps and a few milliseconds.

There should be no functional changes to the parser outputs and no
breaking changes to the project. This is a performance/operational
optimization.
@dmsnell dmsnell force-pushed the fix/11066-parser-exhaustion branch from a5c34d7 to 4bc0f2e Compare November 2, 2018 19:06
@pento pento added this to the 4.3 milestone Nov 4, 2018
Copy link
Member

@pento pento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, @dmsnell!

Let's go with this change, it's a solid improvement of #11355.

@dmsnell dmsnell merged commit c36ea9c into master Nov 4, 2018
@dmsnell dmsnell deleted the fix/11066-parser-exhaustion branch November 4, 2018 15:54
@@ -347,7 +347,7 @@ function next_token() {
* match back in PHP to see which one it was.
*/
$has_match = preg_match(
'/<!--\s+(?<closer>\/)?wp:(?<namespace>[a-z][a-z0-9_-]*\/)?(?<name>[a-z][a-z0-9_-]*)\s+(?<attrs>{(?:(?!}\s+-->).)+?}\s+)?(?<void>\/)?-->/s',
'/<!--\s+(?<closer>\/)?wp:(?<namespace>[a-z][a-z0-9_-]*\/)?(?<name>[a-z][a-z0-9_-]*)\s+(?<attrs>{(?:[^}]+|}+(?=})|(?!}\s+-->).)+?}\s+)?(?<void>\/)?-->/s',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand this term: }+(?=}). Among other things, I don't understand why + should be applied to that first }.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the history is in #11355

I guess we could go in and add some explanatory comments here but we're creating two new matching groups here based on a "fast parse" pathway

we know that specifically we want to consume ((?!}\s+-->).), but the problem with that is that for every position in the input we're allocating some memory to hold the match group and then we also start backtracking once the match fails. this is unlike the autogenerated PEG parser because PEGs don't backtrack.

therefore, I looked at what was causing things to be slow and realized that we know more about our inputs than the RegExp does, that is, the majority of our characters won't even be }. taking this knowledge lets us encode that fast-path and fallback to the slow path

if the input isn't a }, then directly consume it and proceed.

we can compare these two methods with a simplified test.
we'll use the simple test data < {"a": "b"} >.
make sure to click on the debugger in these to see what's
happening inside the RegExp engine.

(?:<\s+)({(?:(?!}\s+>).)*})(?:\s+>) - 1 match, 51 steps - https://regex101.com/r/UmdrvA/1
(?:<\s+)({(?:(?:[^}]+)|(?:(?!}\s+>).))*})(?:\s+>) - 1 match, 26 steps - https://regex101.com/r/cXcMFw/1

if we make the input bigger we can watch in the debugger how the [}]+ can consume a long string of text in one step and skip all the backtracking.

that leaves us a bad case where we end up with deeply-nested JSON though and in the content we could imagine }}}}}}}}} -->. the }+(?=}) bit then is a second-order optimization to consume a string of closing brackets up to the last one in the string (?= is the positive lookahead to say "match the closer but don't consume it).

by introducing a new alternation group we've made a new possible branch in the execution but on the other hand it's fast to consume a string of closers. the goal is mainly to remove the backtracking which opens a denial-of-service attack and can dramatically bloat the execution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the thorough explanation <3

dmsnell added a commit that referenced this pull request Nov 27, 2018
Bug introduced in #11369

Someone discovered high CPU usage due to catastrophic backtracking on
an invalid block comment delimiter. The following input crashed the
parser on the server:

```html
<!-- wp:block {"a":0} / -->
```

The optimization introduced in #11369 ended up opening a place for
backtracking that shouldn't be there. In this patch we're grouping
the attribute parsing section of the tokenizing RegExp pattern so
that we can make the group itself _possessive_ so that we abort
any backtracking.
dmsnell added a commit that referenced this pull request Nov 27, 2018
Bug introduced in #11369

Someone discovered high CPU usage due to catastrophic backtracking on
an invalid block comment delimiter. The following input crashed the
parser on the server:

```html
<!-- wp:block {"a":0} / -->
```

The optimization introduced in #11369 ended up opening a place for
backtracking that shouldn't be there. In this patch we're grouping
the attribute parsing section of the tokenizing RegExp pattern so
that we can make the group itself _possessive_ so that we abort
any backtracking.
dmsnell added a commit that referenced this pull request Nov 27, 2018
Bug introduced in #11369

Someone discovered high CPU usage due to catastrophic backtracking on
an invalid block comment delimiter. The following input crashed the
parser on the server:

```html
<!-- wp:block {"a":0} / -->
```

The optimization introduced in #11369 ended up opening a place for
backtracking that shouldn't be there. In this patch we're grouping
the attribute parsing section of the tokenizing RegExp pattern so
that we can make the group itself _possessive_ so that we abort
any backtracking.
dmsnell added a commit that referenced this pull request Nov 29, 2018
Bug introduced in #11369

Someone discovered high CPU usage due to catastrophic backtracking on
an invalid block comment delimiter. The following input crashed the
parser on the server:

```html
<!-- wp:block {"a":0} / -->
```

The optimization introduced in #11369 ended up opening a place for
backtracking that shouldn't be there. In this patch we're grouping
the attribute parsing section of the tokenizing RegExp pattern so
that we can make the group itself _possessive_ so that we abort
any backtracking.
dmsnell added a commit that referenced this pull request Nov 30, 2018
Bug introduced in #11369

Someone discovered high CPU usage due to catastrophic backtracking on
an invalid block comment delimiter. The following input crashed the
parser on the server:

```html
<!-- wp:block {"a":0} / -->
```

The optimization introduced in #11369 ended up opening a place for
backtracking that shouldn't be there. In this patch we're grouping
the attribute parsing section of the tokenizing RegExp pattern so
that we can make the group itself _possessive_ so that we abort
any backtracking.
youknowriad pushed a commit that referenced this pull request Nov 30, 2018
* Parser: Make attribute parsing possessive

Bug introduced in #11369

Someone discovered high CPU usage due to catastrophic backtracking on
an invalid block comment delimiter. The following input crashed the
parser on the server:

```html
<!-- wp:block {"a":0} / -->
```

The optimization introduced in #11369 ended up opening a place for
backtracking that shouldn't be there. In this patch we're grouping
the attribute parsing section of the tokenizing RegExp pattern so
that we can make the group itself _possessive_ so that we abort
any backtracking.

* add test and fix broken fix

* really fix default JS parser

* add explanatory comment

* add @SInCE comment with updated target version

* version bumps
youknowriad pushed a commit that referenced this pull request Nov 30, 2018
* Parser: Make attribute parsing possessive

Bug introduced in #11369

Someone discovered high CPU usage due to catastrophic backtracking on
an invalid block comment delimiter. The following input crashed the
parser on the server:

```html
<!-- wp:block {"a":0} / -->
```

The optimization introduced in #11369 ended up opening a place for
backtracking that shouldn't be there. In this patch we're grouping
the attribute parsing section of the tokenizing RegExp pattern so
that we can make the group itself _possessive_ so that we abort
any backtracking.

* add test and fix broken fix

* really fix default JS parser

* add explanatory comment

* add @SInCE comment with updated target version

* version bumps
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Feature] Parsing Related to efforts to improving the parsing of a string of data and converting it into a different f [Type] Bug An existing feature does not function as intended
Projects
None yet
Development

Successfully merging this pull request may close these issues.

WordPress 5.0 beta1 - Block now causing Fatal Errors
3 participants