Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser: Optimize JSON-attribute parsing #11369

Merged
merged 3 commits into from
Nov 4, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion packages/block-serialization-default-parser/parser.php
Original file line number Diff line number Diff line change
Expand Up @@ -347,7 +347,7 @@ function next_token() {
* match back in PHP to see which one it was.
*/
$has_match = preg_match(
'/<!--\s+(?<closer>\/)?wp:(?<namespace>[a-z][a-z0-9_-]*\/)?(?<name>[a-z][a-z0-9_-]*)\s+(?<attrs>{(?:(?!}\s+-->).)+?}\s+)?(?<void>\/)?-->/s',
'/<!--\s+(?<closer>\/)?wp:(?<namespace>[a-z][a-z0-9_-]*\/)?(?<name>[a-z][a-z0-9_-]*)\s+(?<attrs>{(?:[^}]+|}+(?=})|(?!}\s+-->).)+?}\s+)?(?<void>\/)?-->/s',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand this term: }+(?=}). Among other things, I don't understand why + should be applied to that first }.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the history is in #11355

I guess we could go in and add some explanatory comments here but we're creating two new matching groups here based on a "fast parse" pathway

we know that specifically we want to consume ((?!}\s+-->).), but the problem with that is that for every position in the input we're allocating some memory to hold the match group and then we also start backtracking once the match fails. this is unlike the autogenerated PEG parser because PEGs don't backtrack.

therefore, I looked at what was causing things to be slow and realized that we know more about our inputs than the RegExp does, that is, the majority of our characters won't even be }. taking this knowledge lets us encode that fast-path and fallback to the slow path

if the input isn't a }, then directly consume it and proceed.

we can compare these two methods with a simplified test.
we'll use the simple test data < {"a": "b"} >.
make sure to click on the debugger in these to see what's
happening inside the RegExp engine.

(?:<\s+)({(?:(?!}\s+>).)*})(?:\s+>) - 1 match, 51 steps - https://regex101.com/r/UmdrvA/1
(?:<\s+)({(?:(?:[^}]+)|(?:(?!}\s+>).))*})(?:\s+>) - 1 match, 26 steps - https://regex101.com/r/cXcMFw/1

if we make the input bigger we can watch in the debugger how the [}]+ can consume a long string of text in one step and skip all the backtracking.

that leaves us a bad case where we end up with deeply-nested JSON though and in the content we could imagine }}}}}}}}} -->. the }+(?=}) bit then is a second-order optimization to consume a string of closing brackets up to the last one in the string (?= is the positive lookahead to say "match the closer but don't consume it).

by introducing a new alternation group we've made a new possible branch in the execution but on the other hand it's fast to consume a string of closers. the goal is mainly to remove the backtracking which opens a denial-of-service attack and can dramatically bloat the execution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the thorough explanation <3

$this->document,
$matches,
PREG_OFFSET_CAPTURE,
Expand Down
2 changes: 1 addition & 1 deletion packages/block-serialization-default-parser/src/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ let document;
let offset;
let output;
let stack;
const tokenizer = /<!--\s+(\/)?wp:([a-z][a-z0-9_-]*\/)?([a-z][a-z0-9_-]*)\s+({(?:(?!}\s+-->)[^])+?}\s+)?(\/)?-->/g;
const tokenizer = /<!--\s+(\/)?wp:([a-z][a-z0-9_-]*\/)?([a-z][a-z0-9_-]*)\s+({(?:[^}]+|}+(?=})|(?!}\s+-->)[^])+?}\s+)?(\/)?-->/g;

function Block( blockName, attrs, innerBlocks, innerHTML ) {
return {
Expand Down
11 changes: 11 additions & 0 deletions packages/block-serialization-spec-parser/shared-tests.js
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,17 @@ export const jsTester = ( parse ) => () => {
expect.objectContaining( { innerHTML: '<p>Break me</p>' } ),
] ) ) );
} );

describe( 'attack vectors', () => {
test( 'really long JSON attribute sections', () => {
const length = 100000;
const as = 'a'.repeat( length );
let parsed;

expect( () => parsed = parse( `<!-- wp:fake {"a":"${ as }"} /-->` )[ 0 ] ).not.toThrow();
expect( parsed.attrs.a ).toHaveLength( length );
} );
} );
};

const hasPHP = 'test' === process.env.NODE_ENV ? ( () => {
Expand Down