Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser: Propose new hand-coded parser #8083

Merged
merged 43 commits into from
Sep 6, 2018
Merged

Conversation

dmsnell
Copy link
Member

@dmsnell dmsnell commented Jul 20, 2018

For some time we've needed a more performant PHP parser for the first
stage of parsing the post_content document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

Updates

  • This now also includes a copy of the parser in JS whose performance is also quite good.
  • The files have been moved into the /packages directory - I still need some help understanding where it all belongs and how to make the package work

This provides a setup fixture for #6831 wherein we are testing alternate
parser implementations - https://comparator-yizlfvqafz.now.sh

Distinctives

  • designed as a basic recursive-descent
  • but doesn't recurse on the call-stack, recurses via trampoline
  • moves linearly through document in one pass
  • relies on RegExp for tokenization

Note I expect us to discover implementation bugs during the initial rollout of this parser. We have run it through our document library and unit tests but real posts are surely getting into more complicated constructions. We can deal with these as they come but we should expect these.

Todo

  • nested blocks include the nested content in their innerHTML
    this needs to go away
  • create test fixture - https://comparator-yizlfvqafz.now.sh
  • figure out where to save this file
  • phpunit tests

Benchmark

For posterity's sake I ran the merged parser through the parser comparator and compared it against the auto-generated spec parser. Here are the results from my laptop

                                    ms                        MB    
                                Spec  Default  Speedup    Spec  Default  Comparison
demo-post.html                    29.58   0.23   130     38.56   16.43     43%
early-adopting-the-future.html   263.83   1.01   262     36.84   17.10     46%
moby-dick-parsed.html           5012.13  11.55   434     75.41   25.18     33%
pygmalian-raw-html.html          330.35   0.24  1366    116.72   16.90     14%
redesigning-chrome-desktop.html  211.42   1.22   173     37.22   16.51     44%
shortcode-shortcomings.html       71.28   0.36   198     34.07   16.98     50%
web-at-maximum-fps.html          161.35   0.87   186     33.12   16.32     49%

The tests were done on my late 2013 rMBP quad core 2.6 GHz laptop. According to the Intel Power Gadget the CPU was running at 3.6 GHz the entire time. Each document was parsed with each parser at least 47 times and the runs were at random and each run was randomly chosen to parse the document between one and five times in a row before returning the results. Runtime and memory use were measured inside a runner script running in Docker as described in the parser comparator.

@dmsnell dmsnell added [Type] Enhancement A suggestion for improvement. [Status] In Progress Tracking issues with work in progress [Feature] Parsing Related to efforts to improving the parsing of a string of data and converting it into a different f labels Jul 20, 2018
@dmsnell dmsnell requested review from mcsf, pento, mtias and aduth July 20, 2018 13:57
@dmsnell
Copy link
Member Author

dmsnell commented Jul 20, 2018

I'm pretty sure that the next steps from here involve pondering the data structure of the stack. We have enough working knowledge now to know what we need to track and how we can pop that from the stack to the output.

Done

@dmsnell dmsnell changed the title Parser: Propose new hand-coded PHP parser Parser: Propose new hand-coded parser Jul 21, 2018
@dmsnell dmsnell force-pushed the parser/rd-trampoline-php branch from dd4409a to 4191994 Compare July 23, 2018 07:21
@dmsnell dmsnell force-pushed the parser/rd-trampoline-php branch 3 times, most recently from 478b27a to 24977fc Compare August 24, 2018 18:30
Copy link
Member

@pento pento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noice! Let's get this in sooner rather than later, so we can make inroads on the things depending on having a faster parser. 🙂

I've left some comments, here are a few random notes that have occurred to me, as well:

  • It feels a little weird to be putting the PHP parser on NPM, but we don't really use Packagist at all, sooo... 🤷‍♂️ Let's stick with NPM for now, we can potentially explore doing Packagist/composer things later.
  • phpcs.xml.dist needs to be updated to scan the new PHP code. I mentioned a couple of coding standards issues in the comments, but PHPCS should pick up the rest.
  • Combined with switching the parser in gutenberg_parse_blocks(), phpunit/class-parsing-test.php should be updated to use gutenberg_parse_blocks(), rather than Gutenberg_PEG_Parser.

With this performance improvement, it seems like we could change do_blocks() to parse the content, instead of using the dynamic blocks regex.

@@ -0,0 +1,107 @@
# Block Serialization Default Parser

This library contains the default block serialization parser implementations for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to remove the manual line breaks from the README: we use the Jetpack Markdown parser, which adds a <br/> for single line breaks.

Copy link
Member Author

@dmsnell dmsnell Aug 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes me want to cry since it's something I love about markdown and consistent among every other markdown parser I've used.

The implication of the “one or more consecutive lines of text” rule is that Markdown supports “hard-wrapped” text paragraphs. This differs significantly from most other text-to-HTML formatters (including Movable Type’s “Convert Line Breaks” option) which translate every line break character in a paragraph into a <br /> tag.

When you do want to insert a <br /> break tag using Markdown, you end a line with two or more spaces, then type return.

Yes, this takes a tad more effort to create a <br />, but a simplistic “every line break is a <br />” rule wouldn’t work for Markdown. Markdown’s email-style blockquoting and multi-paragraph list items work best — and look better — when you format them with hard breaks.
https://daringfireball.net/projects/markdown/syntax#p

nonetheless, I have destroyed my markdown to make it happy in ee72314cc

😢

@@ -0,0 +1,260 @@
<?php

function bsdp_parse($document ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding a new _parse() function, can gutenberg_parse_blocks() be updated to use the new parser? We can add a filter in there for easier switching between classes: eg, existing filters in Core that filter a Class name: wp_rest_server_class, customize_dynamic_setting_class.

block_parser_class works for me.

Copy link
Member Author

@dmsnell dmsnell Aug 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see related comment response below.

I'm having some trouble understanding what you wrote @pento. I hope we create a filter to select the parsing function but won't that depend somewhat on having unique names for each possible parse functions?

also, are wp_rest_server_class and customize_dynamic_setting_class anyway related here? are you suggesting we create a class interface for the block parser class?

in lib/block.php I had originally envisioned something like this…

$parser = apply_filter( 'block_parser_class', 'bsdp_parse' );
call_user_func( $parser, $post_content );

I guess you are recommending this instead?

$parser_class = apply_filter( 'block_parser_class', 'bsdp' );
$parser = new $parser_class();
$parser->parse( $post_content );

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

experimented in 064efa58d but I haven't tested it yet

for what it's worth I'd be more comfortable getting this parser in first before making the parser system pluggable just because of the scope of the changes

static $parser;

if ( ! isset( $parser ) ) {
$parser = new BSDP_Parser();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not wild about the BSDP_ prefix. I get why it's there, but perhaps it could be a little more descriptive?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Block_Parser()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mainly this is there to prevent namespace collisions. my hope is that a few PRs after this we'll have a filter choose the parser and obviously if we create two or more Block_Parser() classes we'll run into conflicts.

any thoughts on that? even with an encapsulating class we run into some issues here because I don't think we can create a class within a class. the only way around it otherwise I think is actual namespacing which isn't supported on older PHP version…

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realistically, is there going to be a completely new parser appear between now and 5.0? It seems like this parser is going to be the one that will go into Core.

If that's the case, we should just use a generic name. WP_Block_Parser will fit into the WordPress naming scheme.


switch ( $token_type ) {
case 'no-more-tokens':
# if not in a block then flush output
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to use // for single inline comments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double-slashed it in ee72314cc

return false;
}

# Otherwise we have a problem
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Block inline comments should be in the form:

/*
 * blah
 *
 * - foo
 * - bar
 */

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exploded comments in ee72314cc

# Block Serialization Default Parser

This library contains the default block serialization parser implementations for
WordPress documents. It provides native PHP and Javascript parsers that implement
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Javascript/JavaScript/ 🙂

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

substituted in ee72314cc

@dmsnell dmsnell force-pushed the parser/rd-trampoline-php branch from e246b11 to 7cf7971 Compare August 26, 2018 19:27
@dmsnell dmsnell mentioned this pull request Aug 26, 2018
@@ -0,0 +1,25 @@
{
"name": "@wordpress/block-serialization-default-parser",
"version": "1.0.0",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put 1.0.0-rc.0 or something like that to allow Lerna to do its job - it always bumps version so it would try to do 1.0.1 release otherwise ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

campaigned for release in 8c7e42c

@@ -88,6 +88,7 @@ const gutenbergPackages = [
'autop',
'blob',
'blocks',
'block-serialization-default-parser',
'block-serialization-spec-parser',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we stop bundling the other one if we don't use it in Gutenberg anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a good question. I don't want to kill the PEG parser since that maintains the spec in a way no hand-written implementation can.

in my comparator PRs I'm trying to move towards a system that will automatically run the implementations against the specification in something like a CI job so that we can have our formal specification without worrying about the implementation diverging (for example, if someone makes a change to the implementation without changing the spec first)

that is, I think we want to keep the spec-parser wherever we need it - mainly I think we want to strip it from the default load of Gutenberg but whether we build it, what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The package with transpiled code is going to be there anyway. It's really up to you and how you want to use it. If you are fine with referencing it as a regular npm package then you don't need it. If you want to consume it as part of e2e test or something which requires all Gutenberg build files then you can leave it as is. I just wanted to raise the awareness.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks - this is mainly just out of my expertise at this point. if you are willing to make a decision on it or can tell me what we should do then that would help me out.

it seems like several people want these parser tests to be written with jest and somehow in the normal suite - I don't know what that means here for this decision

@@ -369,6 +376,7 @@ function gutenberg_register_scripts_and_styles() {
array(
'wp-autop',
'wp-blob',
'wp-block-serialization-default-parser',
'wp-block-serialization-spec-parser',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we no longer need to list wp-block-serialization-spec-parser as a dependency. In addition, we should stop registering it, too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed on this one but I wasn't entirely sure how we wanted this to work…

do we want Gutenberg to automatically replace the spec parser with the "default" one at boot through a filter or do we want the "default" to be the default?

I want the auto-generated parser to be available still, especially for things like diagnostics and exploration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As commented above, it all depends on the way you want to use it. I don't have any strong opinions about it. We should just ensure we don't ship unused code to the end users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a decision here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left the spec parser registered but un-enqueued it in 66455b4

lib/blocks.php Outdated
*
* @param string $parser_class Name of block parser class
*/
$parser_class = apply_filters( 'block_parser_class', 'BDSP_Parser' );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should document it in the extensibility docs. Probably, the main document would be the best fit: https://github.com/WordPress/gutenberg/blob/master/docs/extensibility.md.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

documented in 8c7e42c

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still reads BDSP. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another great catch - fixed in 66455b4

@@ -378,6 +378,6 @@ const createParse = ( parseImplementation ) =>
*
* @return {Array} Block list.
*/
export const parseWithGrammar = createParse( grammarParse );
export const parseWithGrammar = createParse( defaultParse );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we offer a filter for JS implementation, too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but I wasn't sure if this PR was the right one for it. that is, filtering out the PHP side seemed somewhat straightforward while filtering the JS side seemed more complicated since we have to take into account things like loading the parser bundles and making sure they are available before the editor loads

do you think we need to do it all here in this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's totally fine as its own PR, I just wanted to ensure we tackle both PHP and JS side of things.

@dmsnell dmsnell force-pushed the parser/rd-trampoline-php branch from 6f4be14 to 07ffe45 Compare August 27, 2018 12:55
return 'EmptyParser';
}

add_filter( 'block_parser_class', select_empty_parser, 10, 1 );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we provide the name of the function as a string in other examples to ensure it works with PHP 5.2. We might also want to prefix the function name with the plugin name:

add_filter( 'block_parser_class', `my_plugin_select_empty_parser`, 10, 1 );

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! I never meant to leave out the string - just neglected it - updated in 96ecfb8

@gziolo
Copy link
Member

gziolo commented Aug 27, 2018

8c7e42c looks great, I left one comment which is a tiny thing that affects only PHP 5.2...

}

function bdsp_select_parser( $prev_parse_class ) {
return 'BSDP_Parser';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a typo at BSDP. Anyway, given that the apply_filters call in gutenberg_parse_blocks defaults to 'BDSP_Parser', we should remove this bit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! Is removed the function in 9c85a60

@mcsf
Copy link
Contributor

mcsf commented Aug 28, 2018

I'm getting a tokenization bug while testing with a personal post. Digging…

const namespace = namespaceMatch || 'core/';
const name = namespace + nameMatch;
const hasAttrs = !! attrsMatch;
const attrs = hasAttrs ? JSON.parse( attrsMatch ) : null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know there's a performance hit with try, but we should play it safe with JSON.parse, or generally speaking make sure we can inform the user of bad input and recover (e.g. isolate bad blocks) as best as possible. Thoughts, @dmsnell?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the try in 9c85a60 but left it out of the PHP since in PHP it already returns null on a failed parse

@mcsf
Copy link
Contributor

mcsf commented Aug 29, 2018

@dmsnell: I've pushed a failing test for the parser. The gist of it is that I think the tokenizer is too greedy when looking for the end of an attributes group ({"some":"json"}). Thus, a document with two self-closing attribute-equipped blocks, not necessarily consecutive, breaks the parser:

<!-- wp:block {"ref":313} /-->
<!-- wp:block {"ref":482} /-->

This makes the parser throw a syntax error in the JSON.parse call:

SyntaxError: Unexpected token / in JSON at position 19

We should guarantee handling of any bad JSON here, but that's not the real issue. The issue is in the tokenizer, as the following fragment was returned as a match for attrsMatch:

{\"ref\":313} /--><!-- wp:block {\"ref\":482}

Note that, in contrast, the following input is correctly parsed:

<!-- wp:block {"ref":313} -->
<!-- /wp:block -->
<!-- wp:block {"ref":482} /-->

I used the following debugger patch:

diff --git a/packages/block-serialization-default-parser/src/index.js b/packages/block-serialization-default-parser/src/index.js
index 9c1983f22..007edd2b5 100644
--- a/packages/block-serialization-default-parser/src/index.js
+++ b/packages/block-serialization-default-parser/src/index.js
@@ -172,7 +172,7 @@ function nextToken() {
 	const namespace = namespaceMatch || 'core/';
 	const name = namespace + nameMatch;
 	const hasAttrs = !! attrsMatch;
-	const attrs = hasAttrs ? JSON.parse( attrsMatch ) : null;
+	const attrs = hasAttrs ? safeParse( attrsMatch ) : null;
 
 	// This state isn't allowed
 	// This is an error
@@ -192,6 +192,17 @@ function nextToken() {
 	return [ 'block-opener', name, attrs, startedAt, length ];
 }
 
+function safeParse( json ) {
+	let r;
+	try {
+		r = JSON.parse( json );
+	} catch ( e ) {
+		console.error( `Input of length ${ json.length }`, json );
+		throw e;
+	}
+	return r;
+}
+
 function addFreeform( rawLength ) {
 	const length = rawLength ? rawLength : document.length - offset;

@mcsf mcsf force-pushed the parser/rd-trampoline-php branch from a2dae1e to c154286 Compare August 29, 2018 10:05
@dmsnell dmsnell force-pushed the parser/rd-trampoline-php branch from c154286 to 138614d Compare August 29, 2018 17:50
@dmsnell
Copy link
Member Author

dmsnell commented Aug 29, 2018

the tokenizer is too greedy when looking for the end of an attributes group

excellent find @mcsf! you are right - I let in a greedy match when I had no reason to! that's been taken out by the addition of the ? to make the (?!-->). group un-greedy as it should be. I'm embarrassed that I let it in but so glad you found it and added the failing tests!

un-greedy modifier added in 9c85a60

also I rebased the branch

@mcsf
Copy link
Contributor

mcsf commented Sep 11, 2018

Concerning the requiring of the PHP implementation, #9791 needs investigating.

@aduth
Copy link
Member

aduth commented Sep 17, 2018

Potential regression noted at #9968

dmsnell added a commit that referenced this pull request Sep 18, 2018
Resolves #9968

It was noted that a classic block preceding a void block would
disappear in the editor while if that same classic block preceded
the long-form non-void representation of an empty block then things
would load as expected.

This behavior was determined to originate in the new default parser
in #8083 and the bug was that with void blocks we weren't sending
any preceding HTML soup/freeform content into the output list.

In this patch I've duplicated some code from the block-closing
function of the parser to spit out this content when a void block
is at the top-level of the document.

This bug did not appear when void blocks are nested because it's
the parent block that eats HTML soup. In the case of the top-level
void however we were immediately pushing that void block to the
output list and neglecting the freeform HTML.

I've added a few tests to verify and demonstrate this behavior.
Actually, since I wasn't sure what was wrong I wrote the tests first
to try and understand the behaviors and bugs. There are a few tests
that are thus not entirely essential but worthwhile to have in here.
dmsnell added a commit that referenced this pull request Sep 18, 2018
* Parser (Fix): Output freeform content before void blocks

Resolves #9968

It was noted that a classic block preceding a void block would
disappear in the editor while if that same classic block preceded
the long-form non-void representation of an empty block then things
would load as expected.

This behavior was determined to originate in the new default parser
in #8083 and the bug was that with void blocks we weren't sending
any preceding HTML soup/freeform content into the output list.

In this patch I've duplicated some code from the block-closing
function of the parser to spit out this content when a void block
is at the top-level of the document.

This bug did not appear when void blocks are nested because it's
the parent block that eats HTML soup. In the case of the top-level
void however we were immediately pushing that void block to the
output list and neglecting the freeform HTML.

I've added a few tests to verify and demonstrate this behavior.
Actually, since I wasn't sure what was wrong I wrote the tests first
to try and understand the behaviors and bugs. There are a few tests
that are thus not entirely essential but worthwhile to have in here.
Copy link
Contributor

@mcsf mcsf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't realized this before — as my primary testing interface was the WP API (gist), through which everything is serialized into the same shape — but I now fear that we're not providing a consistent interface with the parser in its current state.

See my inline comments. Consumers of gutenberg_parse_blocks may make mistakes because of these discrepancies, and I fear they may already have: #10041.

cc @dmsnell


if ( isset( $stack_top->leading_html_start ) ) {
$this->output[] = array(
'attrs' => array(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(copy-pasting a comment that I added in the more recent #9984) In this same file I'm seeing conflicting shapes for attrs:

'attrs' => array(), // here
'attrs' => new stdClass(), // in `add_freeform`

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call - I know there are some lingering inconsistencies too around null vs. {} in the spec grammar. a good follow-up PR that's been on my TODO list

* @since 3.8.0
* @var WP_Block_Parser_Block[]
*/
public $output;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about this promise that $output is an array of WP_Block_Parser_Block, since freeform fragments are added as [associative] arrays and not class instances.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can definitely consider wiping the output clean of its classes - I didn't at first because it seemed benign to retain them, but if we sacrifice a little performance we can json_decode( json_encode( $output ) ) and clear it up

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgefilipecosta mentioned implementing an ArrayObject interface in our classes so that one can traverse our parser output natively, rather than doing the JSON dance. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. It means more divide between the PHP and JS versions of the parser. What's the JSON dance? Wouldn't having ArrayObject be somewhat superfluous?

// this already works with arrays and objects!
$blocks = parse( $document );
$blocks = array_map( $blocks, $my_transformer );

we probably want to fix the bug as a separate thing from adding interfaces. I'm skeptical of the value of the latter if the former is resolved.

Copy link
Contributor

@mcsf mcsf Sep 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By JSON dance I meant json_decode( json_encode( $output ) ), sorry for not being clear.

we probably want to fix the bug as a separate thing from adding interfaces

So this is the actual issue: #10047. It's not the traversal (looking at your array_map example) but rather accessing properties of a block, which can either mean accessing properties of an array or of an object.

Copy link
Member

@jorgefilipecosta jorgefilipecosta Sep 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Classes offer some advantages we can publish abstract classes that contain the fields plugins can safely access, and other parsers can extend this general classes. Simple arrays don't offer this guarantees.

But now we have a problem some plugins are dependent on using simple arrays, even if this bug was already caught I'm not sure we can change the API to use classes.

So I think our options are revert back and use arrays, or advance and change our API to use classes. In the second case to be back-compatible with existing implementation accessing using the array syntax, I think our only solution is ArrayObject. It allows us to temporarily return something that behaves like a class for new implementations and an array for old implementations, in this case, we would add the deprecation messages saying we now return objects.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the traversal (looking at your array_map example) but rather accessing properties of a block, which can either mean accessing properties of an array or of an object.

to me this is just evidence that the work to make all attribute reporting consistent is necessary. some attributes are null, some are objects

By JSON dance I meant json_decode( json_encode( $output ) ), sorry for not being clear.

that would be in the parser and wouldn't have to be manually performed. in fact, the classes are only even there for performance, so we can test the change of sorting everything in plain old objects vs. converting at the end. if it's a degradation then we can simply remove the classes if we want to preserve the simpler interface.

mcsf pushed a commit that referenced this pull request Sep 21, 2018
* Parser (Fix): Output freeform content before void blocks

Resolves #9968

It was noted that a classic block preceding a void block would
disappear in the editor while if that same classic block preceded
the long-form non-void representation of an empty block then things
would load as expected.

This behavior was determined to originate in the new default parser
in #8083 and the bug was that with void blocks we weren't sending
any preceding HTML soup/freeform content into the output list.

In this patch I've duplicated some code from the block-closing
function of the parser to spit out this content when a void block
is at the top-level of the document.

This bug did not appear when void blocks are nested because it's
the parent block that eats HTML soup. In the case of the top-level
void however we were immediately pushing that void block to the
output list and neglecting the freeform HTML.

I've added a few tests to verify and demonstrate this behavior.
Actually, since I wasn't sure what was wrong I wrote the tests first
to try and understand the behaviors and bugs. There are a few tests
that are thus not entirely essential but worthwhile to have in here.
dmsnell added a commit that referenced this pull request Sep 22, 2018
Resolves #10041
Resolves #10047

A few inconsistencies have remained in the grammar specification
concerning freeform blocks and blocks without attributes in the
block delimiters. Freeform blocks were returned without block
names and blocks without attributes returned `null` instead of
an empty set of attributes.

Further, the default parser implementation (from #8083) was
returning an array of block objects instead of an array of
generic arrays. This resulted in mismatches in PHP of accessing
properties with `$block[ 'attrs' ]` syntax vs `$block->attrs`
syntax.

In this patch I've updatd the specification to remove all of
the type ambiguity and have updated the default parser to match
it. After this patch every block should be accessible as a normal
array in PHP and have all properties: `blockName`, `attrs`,
`innerBlocks`, and `innerHTML`. If no attributes are specified
then `attrs` will be an empty set (in JavaScript `{}` and in
PHP `array()`).
dmsnell added a commit that referenced this pull request Sep 22, 2018
There are numerous needs to process posts and block content from its
structured form without demanding that plugin authors implement their
own parsing systems.

Since the new default parser was implemented in #8083 the server-side
parse is now fast enough to consider doing full parses of our documents
and with that brings the idea that we can filter block content from the
parser itself.

In this patch I'm exploring an API to allow extending the parser's
behavior by post-processing blocks as they enter the parser's output
array. This new filter gives the ability to transform all of the block's
properties as they finish parsing.

In the case of inner blocks the filter runs as the inner blocks have
finished their own nesting. In the case of top-level blocks the filter
runs after all inner content has finished parsing.

One use case is in #8760 where we want to replace the HTML parts of
blocks while preserving other structure. Another use case could be
removing specific inner blocks or content based on the current user
requesting a post.

This filter exposes a kind of visitor pattern for the nested parse.

> **THIS IS AN INCOMPLETE PATCH DO NOT MERGE**
mcsf pushed a commit that referenced this pull request Oct 2, 2018
Resolves #10041
Resolves #10047

A few inconsistencies have remained in the grammar specification
concerning freeform blocks and blocks without attributes in the
block delimiters. Freeform blocks were returned without block
names and blocks without attributes returned `null` instead of
an empty set of attributes.

Further, the default parser implementation (from #8083) was
returning an array of block objects instead of an array of
generic arrays. This resulted in mismatches in PHP of accessing
properties with `$block[ 'attrs' ]` syntax vs `$block->attrs`
syntax.

In this patch I've updatd the specification to remove all of
the type ambiguity and have updated the default parser to match
it. After this patch every block should be accessible as a normal
array in PHP and have all properties: `blockName`, `attrs`,
`innerBlocks`, and `innerHTML`. If no attributes are specified
then `attrs` will be an empty set (in JavaScript `{}` and in
PHP `array()`).
mcsf pushed a commit that referenced this pull request Oct 5, 2018
Resolves #10041
Resolves #10047

A few inconsistencies have remained in the grammar specification
concerning freeform blocks and blocks without attributes in the
block delimiters. Freeform blocks were returned without block
names and blocks without attributes returned `null` instead of
an empty set of attributes.

Further, the default parser implementation (from #8083) was
returning an array of block objects instead of an array of
generic arrays. This resulted in mismatches in PHP of accessing
properties with `$block[ 'attrs' ]` syntax vs `$block->attrs`
syntax.

In this patch I've updatd the specification to remove all of
the type ambiguity and have updated the default parser to match
it. After this patch every block should be accessible as a normal
array in PHP and have all properties: `blockName`, `attrs`,
`innerBlocks`, and `innerHTML`. If no attributes are specified
then `attrs` will be an empty set (in JavaScript `{}` and in
PHP `array()`).
dmsnell added a commit that referenced this pull request Oct 6, 2018
* Parser: Normalize data types and fix default implementation

Resolves #10041
Resolves #10047

A few inconsistencies have remained in the grammar specification
concerning freeform blocks and blocks without attributes in the
block delimiters. Freeform blocks were returned without block
names and blocks without attributes returned `null` instead of
an empty set of attributes.

Further, the default parser implementation (from #8083) was
returning an array of block objects instead of an array of
generic arrays. This resulted in mismatches in PHP of accessing
properties with `$block[ 'attrs' ]` syntax vs `$block->attrs`
syntax.

In this patch I've updatd the specification to remove all of
the type ambiguity and have updated the default parser to match
it. After this patch every block should be accessible as a normal
array in PHP and have all properties: `blockName`, `attrs`,
`innerBlocks`, and `innerHTML`. If no attributes are specified
then `attrs` will be an empty set (in JavaScript `{}` and in
PHP `array()`).
dmsnell added a commit that referenced this pull request Oct 10, 2018
Previously we have been using a simplified parse to grab dynamic
blocks and replace them with their rendered content.

Since #8083 we've had a fast default parser which removes the need
for a simplified parse here.

In this patch we're replacing the existing simplified parser in
`do_blocks` with the new default parser. This will open up new
opportunities for working with nested blocks on the server.
dmsnell added a commit that referenced this pull request Nov 9, 2018
Since the introduction of the default parser in #8083 we have had a
subtle bug in the parsing which failed when empty attributes were
specified in a block's comment delimiter - `{}`

The absense of attributes was fine but _empty_ attributes were a
failure. This is due to using `+?` in the RegExp tokenizer instead of
using `*?` (which allows for no inner content in the JSON string).

This patch updates the quantifier to restore functionality and fix the
bug. This didn't appear in practice because we don't intentionally set
`{}` as the attributes - the serializer drops it altogther, and our
tests didn't catch it for similar reasons.
@mcsf mcsf mentioned this pull request Nov 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Feature] Parsing Related to efforts to improving the parsing of a string of data and converting it into a different f [Type] Enhancement A suggestion for improvement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants