Improve Parser Performance #8044

pento · 2018-07-19T06:12:45Z

Description

The PHP parser has some performance issues on larger posts, I've been doing a bit of tweaking to get some more speed out of it.

The Block_List pre rule now does a simple search for strings that aren't <!--. Its a reasonable assumption that the vast majority of strings that do match this will be the start of a block, so we're probably going to the bs rule once it does match.

The Block_Balanced children rule has a similar simple search, for the same reason.

The Block_Name_Part rule explicitly defines a handful of block names. This feels a bit weird to be doing in the parser, but $( [a-z][a-z0-9_-]* ) is such a slow rule to match, that I think it's worth it.

I've also enabled pegjs' internal caching on the PHP parser, as this appear to give a decent boost, too.

How has this been tested?

Comparing the current parser with this PR, I get some pretty clear performance gains:

File	Current	This PR	Improvement
demo-post.html	29.8ms	7.3ms	4x
shortcode-shortcomings.html	77.9ms	16.5ms	4x
redesigning-chrome-desktop.html	227.9ms	52.6ms	4x
web-at-maximum-fps.html	169.1ms	41.3ms	4x
early-adopting-the-future.html	270.8ms	62.9ms	4x
pygmalian-raw-html.html	421.9ms	254.6ms	1x
moby-dick-parsed.html	5402.3ms	1197.7ms	4x

pento · 2018-07-19T06:26:40Z

I've been running the parser through xdebug, which has been super useful for finding problem areas to focus on.

There are still a few areas that could use improvement:

Something in Block_Balanced is super slow, over 50% of the time spent parsing is inside that rule (not even counting sub-rules).

About a third of the remaining time is spent inside the internal input_substr rule, which I haven't figured a good way to optimise. It's a pity that the pegjs model parses an array instead of char-by-char, making this function necessary.

I'd really like if it were possible to define a rule saying "slurp until you encounter string" that generated fast code. Ideally it would just loop over Gutenberg_PEG_Parser::$input until it encountered the string, instead of the weird function calls that it seems to like.

I got fairly epic performance improvements by hacking a fast check into Gutenberg_PEG_Parser::parse() to do most of the heavy lifting for the Block_List pre rule. It reduced pygmalian-raw-html.html to ~0.1s. I couldn't see a nice way to generate that, though.

pento · 2018-07-19T07:22:41Z

Hacking this into ::parse() (and $prepre handling into peg_join_blocks()) shifts pygmalian-raw-html.html from a 1x improvement to a 8x improvement. (In raw numbers, ~420ms down to ~50ms.)

	$prepre = '';
	for ( $ii = 0; $ii < $this->input_length - 1; $ii++ ) {
		if ( '<' === $this->input[ $ii ] && '!' === $this->input[ $ii + 1 ] ) {
			break;
		}
	}

	if ( $ii === $this->input_length - 1 ) {
		$this->cleanup_state();
		return array( array( 'attrs' => array(), 'innerHTML' => implode( $this->input ) ) );
	}

	if ( $ii > 0 ) {
		$prepre = implode( array_slice( $this->input, 0, $ii ) );
		$this->input = array_slice( $this->input, $ii );
		$this->input_length -= $ii;
	}

gziolo · 2018-07-19T09:07:39Z

I moved blocks/api/post.pegjs to its own package in #7664. It generates new JS file for parser out of updated grammar every time you run npm install, npm run dev or npm run build. Let's make sure it gets properly regenerated with the proposed changes.

nylen · 2018-07-19T16:45:17Z

It's a pity that the pegjs model parses an array instead of char-by-char, making this function necessary.

It sounds like this is the thing to fix here. phpegjs does this to avoid issues with the ways multibyte char handling differs in JS vs PHP, see #1775.

I think that when the grammar rules themselves do not contain multibyte chars (like the Gutenberg grammar), we should be able to work directly with a string instead. How can we test and prove that?

I've also enabled pegjs' internal caching on the PHP parser, as this appear to give a decent boost, too.

How does this change affect memory usage during parsing?

dmsnell · 2018-07-19T21:13:29Z

Thanks for working on this @pento.

I think that this will end up failing the Block_List won't it if we have a normal HTML comment somewhere before the first block?

I've got mixed feelings too about optimizing the grammar itself. I know we have need but I'd rather see us build a new implementation of the PHP parser by hand which conforms to the grammar. We do have ways of automating the tests and performance of alternate parsers.

Since the grammar isn't all that complicated I think we can make a good one by hand that is faster than the generated one. I'd love it if we can keep the intention and specification as clear as they can be.

pento · 2018-07-20T03:25:56Z

I think that when the grammar rules themselves do not contain multibyte chars (like the Gutenberg grammar), we should be able to work directly with a string instead. How can we test and prove that?

Yah, that should be possible. UTF-8 encoding requires that only ASCII characters can have the 0xxxxxxx bit structure, all other characters (including single-byte UTF-8 characters) must be made of bytes that have variations of the 1xxxxxxx bit structure. So, it's not possible for an ASCII character to match any part of a UTF-8 character.

I think that this will end up failing the Block_List won't it if we have a normal HTML comment somewhere before the first block?

Maybe, I didn't test it that much. 🙂

I've got mixed feelings too about optimizing the grammar itself. I know we have need but I'd rather see us build a new implementation of the PHP parser by hand which conforms to the grammar.

This is another viable option if the PCRE parser doesn't work out.

I tend to agree that, rather than optimising the grammar, a better focus would be on improving phpegjs performance for cases that are valuable to us, though that gets a bit tricky with the current grammar. For example, $("<!--" .)* could pretty clearly be optimised to use strpos() to find the next occurrence of "<!--", grab everything that happened before it, and move the currPos forward. Optimising $(!Block .)*, on the other hand, which could benefit from exactly the same optimisation, would be significantly more involved.

Hand-coding a PHP parser is an option that would probably give us better memory usage than PCRE, but I don't know how well it would do on time. It would also be less than fun to maintain, if we were to make significant changes to the grammar.

Anyway, thanks for the thoughts, y'all. I'm going to close this for now, pending the outcome of the significantly more advanced experiments in PCRE parsing.

For some time we've needed a more performant PHP parser for the first stage of parsing the `post_content` document. See #1681 (early exploration) See #8044 (parser performance issue) See #1775 (parser performance, fixed in php-pegjs) I'm proposing this implementation of the spec parser as an alternative to the auto-generated parser from the PEG definition. This is not yet ready to go but I wanted to get the code in a branch so I can iterate on it and garner early feedback. This should eventually provide a setup fixture for #6831 wherein we are testing alternate parser implementations. - designed as a basic recursive-descent - but doesn't recurse on the call-stack, recurses via trampoline - moves linearly through document in one pass - relies on RegExp for tokenization - nested blocks include the nested content in their `innerHTML` this needs to go away - create test fixutre - figure out where to save this file

@Hywan

* Parser: Propose new hand-coded PHP parser For some time we've needed a more performant PHP parser for the first stage of parsing the `post_content` document. See #1681 (early exploration) See #8044 (parser performance issue) See #1775 (parser performance, fixed in php-pegjs) I'm proposing this implementation of the spec parser as an alternative to the auto-generated parser from the PEG definition. This is not yet ready to go but I wanted to get the code in a branch so I can iterate on it and garner early feedback. This should eventually provide a setup fixture for #6831 wherein we are testing alternate parser implementations. - designed as a basic recursive-descent - but doesn't recurse on the call-stack, recurses via trampoline - moves linearly through document in one pass - relies on RegExp for tokenization - nested blocks include the nested content in their `innerHTML` this needs to go away - create test fixutre - figure out where to save this file * Fix issue with containing the nested innerHTML * Also handle newlines as whitespace * Use classes for some static typing * add type hints * remove needless comment * space where space is due * meaningless rename * remove needless function call * harmonize with spec parser * don't forget freeform HTML before blocks * account for oddity in spec-parser * add some polish, fix a thing * comment it * add JS version too * Change `.` to `[^]` because `/s` isn't well supported in JS The `s` flag on the RegExp object informs the engine to treat a dot character as a class that includes the newline character. Without it newlines aren't considered in the dot. Since this flag is new to Javascript and not well supported in different browsers I have removed it in favor of an explicit class of characters that _does_ include the newline, namely the open exclusion of `[^]` which permits all input characters. Hat-top to @Hywan for finding this. * Move code into `/packages` directory, prepare for review * take out names from RegExp pattern to not fail tests * Fix bug in parser: store HTML soup in stack frames while parsing Previously we were sending all "HTML soup" segments of HTML between blocks to the output list before any blocks were processed. We should have been tracking these segments during the parsing and only spit them out when closing a block at the top level. This change stores the index into the input document at which that soup starts if it exists and then produces the freeform block when adding a block to the output from the parse frame stack. * fix whitespace * fix oddity in spec * match styles * use class name filter on server-side parser class * fix whitespace * Document extensibility * fix typo in example code * Push failing parsing test * fix lazy/greedy bug in parser regexp * Docs: Fix typos, links, tweak style. * update from PR feedback * trim docs * Load default block parser, replacing PEG-generated one * Expand `?:` shorthand for PHP 5.2 compat * add fixtures test for default parser * spaces to tabs * could we need no assoc? * fill out return array * put that assoc back in there * isometrize * rename and add 0 * Conditionally include the parser class * Add docblocks * Standardize the package configuration

pento added 2 commits July 19, 2018 15:52

First pass at improving parser performance

f540771

Build the PHP parser

4cdba12

pento added [Status] In Progress Tracking issues with work in progress [Feature] Parsing Related to efforts to improving the parsing of a string of data and converting it into a different f [Type] Performance Related to performance efforts labels Jul 19, 2018

pento self-assigned this Jul 19, 2018

pento requested review from nylen and dmsnell July 19, 2018 06:12

Swap the Block / Block_End child tests of Block_Balanced back.

1e43228

pento closed this Jul 20, 2018

pento deleted the try/improve-parser-performance branch July 20, 2018 03:26

dmsnell mentioned this pull request Jul 20, 2018

Parser: Propose new hand-coded parser #8083

Merged

4 tasks

mcsf mentioned this pull request Jul 27, 2018

Overview of Short-term Parsing Enhancements #8244

Closed

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Parser Performance #8044

Improve Parser Performance #8044

pento commented Jul 19, 2018 •

edited

Loading

pento commented Jul 19, 2018

pento commented Jul 19, 2018 •

edited

Loading

gziolo commented Jul 19, 2018

nylen commented Jul 19, 2018

dmsnell commented Jul 19, 2018

pento commented Jul 20, 2018

Improve Parser Performance #8044

Improve Parser Performance #8044

Conversation

pento commented Jul 19, 2018 • edited Loading

Description

How has this been tested?

pento commented Jul 19, 2018

pento commented Jul 19, 2018 • edited Loading

gziolo commented Jul 19, 2018

nylen commented Jul 19, 2018

dmsnell commented Jul 19, 2018

pento commented Jul 20, 2018

pento commented Jul 19, 2018 •

edited

Loading

pento commented Jul 19, 2018 •

edited

Loading