Parser: Optimize JSON-attribute parsing #11369

dmsnell · 2018-11-01T16:39:02Z

Alternate approach to #11355
Fixes: #11066

Status

I'm confused by the failing test saying here are local uncommitted changes after one or both of 'npm install' or 'npm run docs:build'! - can someone help me understand this?

Description

Parsing JSON-attributes in the existing parsers can be slow when the
attribute list is very long. This is due to the need to lookahead at
each character and then backtrack as the parser looks for the closing
of the attribute section and the block comment closer.

In this patch we're introducing an optimization that can eliminate
a significant amount of memory-usage and execution time when parsing
long attribute sections by recognizing that when inside the JSON
attribute area any character that isn't a } can be consumed
without any further investigation.

I've added a test to make sure we can parse 100,000 characters inside
the JSON attributes. The default parser was fine in testing with more
than a million but the spec parser was running out of memory. I'd
prefer to keep the tests running on all parsers and definitely let our
spec parser define the acceptance criteria so I left the limit low for
now. It'd be great if we found a way to update php-pegjs so that it
could generate the code differently. Until then I vote for a weaker
specification.

The default parser went from requiring hundreds of thousands of steps
and taking seconds to parse the attack to taking something like 35
steps and a few milliseconds.

There should be no functional changes to the parser outputs and no
breaking changes to the project. This is a performance/operational
optimization.

Checklist:

My code is tested.
My code follows the WordPress code style.
My code follows the accessibility standards. (N/A?)
My code has proper inline documentation. (? not sure)

dmsnell · 2018-11-01T20:28:46Z

Comparing the default PHP parser from master against the one in this branch (using #6831)

Comparing the default JS parser from master against the one in this branch (using #6831)

My conclusion is that there is no significant change in runtime performance in the normal cases. I don't have the catastrophic cases in the document library.

Alternate approach to #11355 Fixes: #11066 Parsing JSON-attributes in the existing parsers can be slow when the attribute list is very long. This is due to the need to lookahead at each character and then backtrack as the parser looks for the closing of the attribute section and the block comment closer. In this patch we're introducing an optimization that can eliminate a significant amount of memory-usage and execution time when parsing long attribute sections by recognizing that _when inside the JSON attribute area_ any character that _isn't_ a `}` can be consumed without any further investigation. I've added a test to make sure we can parse 100,000 characters inside the JSON attributes. The default parser was fine in testing with more than a million but the spec parser was running out of memory. I'd prefer to keep the tests running on all parsers and definitely let our spec parser define the acceptance criteria so I left the limit low for now. It'd be great if we found a way to update php-pegjs so that it could generate the code differently. Until then I vote for a weaker specification. The default parser went from requiring hundreds of thousands of steps and taking seconds to parse the attack to taking something like 35 steps and a few milliseconds. There should be no functional changes to the parser outputs and no breaking changes to the project. This is a performance/operational optimization.

pento

Thanks for this, @dmsnell!

Let's go with this change, it's a solid improvement of #11355.

mcsf · 2018-11-05T11:56:51Z

packages/block-serialization-default-parser/parser.php

@@ -347,7 +347,7 @@ function next_token() {
 		 * match back in PHP to see which one it was.
 		 */
 		$has_match = preg_match(
-			'/<!--\s+(?<closer>\/)?wp:(?<namespace>[a-z][a-z0-9_-]*\/)?(?<name>[a-z][a-z0-9_-]*)\s+(?<attrs>{(?:(?!}\s+-->).)+?}\s+)?(?<void>\/)?-->/s',
+			'/<!--\s+(?<closer>\/)?wp:(?<namespace>[a-z][a-z0-9_-]*\/)?(?<name>[a-z][a-z0-9_-]*)\s+(?<attrs>{(?:[^}]+|}+(?=})|(?!}\s+-->).)+?}\s+)?(?<void>\/)?-->/s',


I don't think I understand this term: }+(?=}). Among other things, I don't understand why + should be applied to that first }.

the history is in #11355

I guess we could go in and add some explanatory comments here but we're creating two new matching groups here based on a "fast parse" pathway

we know that specifically we want to consume ((?!}\s+-->).), but the problem with that is that for every position in the input we're allocating some memory to hold the match group and then we also start backtracking once the match fails. this is unlike the autogenerated PEG parser because PEGs don't backtrack.

therefore, I looked at what was causing things to be slow and realized that we know more about our inputs than the RegExp does, that is, the majority of our characters won't even be }. taking this knowledge lets us encode that fast-path and fallback to the slow path

if the input isn't a }, then directly consume it and proceed.

we can compare these two methods with a simplified test.
we'll use the simple test data < {"a": "b"} >.
make sure to click on the debugger in these to see what's
happening inside the RegExp engine.

(?:<\s+)({(?:(?!}\s+>).)*})(?:\s+>) - 1 match, 51 steps - https://regex101.com/r/UmdrvA/1
(?:<\s+)({(?:(?:[^}]+)|(?:(?!}\s+>).))*})(?:\s+>) - 1 match, 26 steps - https://regex101.com/r/cXcMFw/1

if we make the input bigger we can watch in the debugger how the [}]+ can consume a long string of text in one step and skip all the backtracking.

that leaves us a bad case where we end up with deeply-nested JSON though and in the content we could imagine }}}}}}}}} -->. the }+(?=}) bit then is a second-order optimization to consume a string of closing brackets up to the last one in the string (?= is the positive lookahead to say "match the closer but don't consume it).

by introducing a new alternation group we've made a new possible branch in the execution but on the other hand it's fast to consume a string of closers. the goal is mainly to remove the backtracking which opens a denial-of-service attack and can dramatically bloat the execution.

thanks for the thorough explanation <3

Bug introduced in #11369 Someone discovered high CPU usage due to catastrophic backtracking on an invalid block comment delimiter. The following input crashed the parser on the server: ```html  ``` The optimization introduced in #11369 ended up opening a place for backtracking that shouldn't be there. In this patch we're grouping the attribute parsing section of the tokenizing RegExp pattern so that we can make the group itself _possessive_ so that we abort any backtracking.

@SInCE

* Parser: Make attribute parsing possessive Bug introduced in #11369 Someone discovered high CPU usage due to catastrophic backtracking on an invalid block comment delimiter. The following input crashed the parser on the server: ```html  ``` The optimization introduced in #11369 ended up opening a place for backtracking that shouldn't be there. In this patch we're grouping the attribute parsing section of the tokenizing RegExp pattern so that we can make the group itself _possessive_ so that we abort any backtracking. * add test and fix broken fix * really fix default JS parser * add explanatory comment * add @SInCE comment with updated target version * version bumps

@SInCE

* Parser: Make attribute parsing possessive Bug introduced in #11369 Someone discovered high CPU usage due to catastrophic backtracking on an invalid block comment delimiter. The following input crashed the parser on the server: ```html  ``` The optimization introduced in #11369 ended up opening a place for backtracking that shouldn't be there. In this patch we're grouping the attribute parsing section of the tokenizing RegExp pattern so that we can make the group itself _possessive_ so that we abort any backtracking. * add test and fix broken fix * really fix default JS parser * add explanatory comment * add @SInCE comment with updated target version * version bumps

dmsnell added [Type] Bug An existing feature does not function as intended [Feature] Parsing Related to efforts to improving the parsing of a string of data and converting it into a different f labels Nov 1, 2018

dmsnell requested review from mcsf and pento November 1, 2018 16:39

dmsnell mentioned this pull request Nov 1, 2018

Parser: Use a non-greedy matcher to extract the block args #11355

Closed

4 tasks

dmsnell force-pushed the fix/11066-parser-exhaustion branch from b680a0c to a5c34d7 Compare November 2, 2018 16:43

dmsnell added 2 commits November 2, 2018 15:06

update generated parser

4bc0f2e

dmsnell force-pushed the fix/11066-parser-exhaustion branch from a5c34d7 to 4bc0f2e Compare November 2, 2018 19:06

PEGs don't backtrack, take out this (false) optimization

07eb18c

pento added this to the 4.3 milestone Nov 4, 2018

pento approved these changes Nov 4, 2018

View reviewed changes

dmsnell merged commit c36ea9c into master Nov 4, 2018

dmsnell deleted the fix/11066-parser-exhaustion branch November 4, 2018 15:54

mcsf reviewed Nov 5, 2018

View reviewed changes

pento mentioned this pull request Nov 6, 2018

Parser: Remove attrs from the block matcher regex #11522

Closed

dmsnell mentioned this pull request Nov 27, 2018

Parser: Make attribute parsing possessive #12342

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parser: Optimize JSON-attribute parsing #11369

Parser: Optimize JSON-attribute parsing #11369

dmsnell commented Nov 1, 2018 •

edited

Loading

dmsnell commented Nov 1, 2018

pento left a comment

mcsf Nov 5, 2018

dmsnell Nov 5, 2018

mcsf Nov 6, 2018

Parser: Optimize JSON-attribute parsing #11369

Parser: Optimize JSON-attribute parsing #11369

Conversation

dmsnell commented Nov 1, 2018 • edited Loading

Status

Description

Checklist:

dmsnell commented Nov 1, 2018

pento left a comment

Choose a reason for hiding this comment

mcsf Nov 5, 2018

Choose a reason for hiding this comment

dmsnell Nov 5, 2018

Choose a reason for hiding this comment

mcsf Nov 6, 2018

Choose a reason for hiding this comment

dmsnell commented Nov 1, 2018 •

edited

Loading