Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tag Processor: Prevent bugs from pre-PHP8 strspn/strcspn behavior #45822

Merged
merged 1 commit into from
Nov 16, 2022

Conversation

dmsnell
Copy link
Member

@dmsnell dmsnell commented Nov 16, 2022

What?

Prevent infinite loop in WP_HTML_Tag_Processor when parsing truncated HTML.

Why?

Follows #45537
Replaces #45803

When parsing truncated HTML it was brought to our attention that when passing out-of-bounds indices to strspn() and strcspn() that the behavior is different before and after PHP8. We also realized that when we cleaned up the problems with substr() we left some indices without bounds checking and that led to a different flavor of the same problem.

When parsing the following HTML we run into warnings when calling strspn() and strcspn(). For pre-PHP8 versions this also leads to an infinite loop while in later versions it simply omits a warning.

<!-- wp:gallery {"linkTo":"none"} -->
<figure class="wp-block-gallery has-nested-images ...

In this patch we're adding proper bounds checking wherever we update the internal pointer in the Tag Processor to avoid any further out-of-bounds issues.

While this patch fixes the core issue at stake, it's worth performing a more complete audit of the index usage throughout the class and consider internalizing the string methods to avoid version inconsistencies and provide a more robust mechanism for aborting when passing the end of the provided input document.

Props to @aidvu for quickly identifying this issue.

How?

Inserts bounds-checking wherever we read unchecked string indices.

Testing Instructions

You can use the following script to confirm the behavior in trunk and confirm the fix. Place the script in the root directory of your Gutenberg repo.

<?php
require "lib/experimental/html/index.php";

$html = <<<EOF
<!-- wp:gallery {"linkTo":"none"} -->
<figure class="wp-block-gallery has-nested-images ...
EOF;

$p = new WP_HTML_Tag_Processor( $html );
$p->next_tag();

In trunk on PHP<8 this will create an infinite loop while in the branch it will terminate.
In trunk this should issue a PHP Warning: Uninitialized string offset 92 warning while in this branch there should be no warning.

@dmsnell dmsnell added the [Type] Bug An existing feature does not function as intended label Nov 16, 2022
@dmsnell dmsnell added this to the Gutenberg 14.5 milestone Nov 16, 2022
@codesandbox
Copy link

codesandbox bot commented Nov 16, 2022

CodeSandbox logoCodeSandbox logo  Open in CodeSandbox Web Editor | VS Code | VS Code Insiders

@dmsnell dmsnell force-pushed the tag-processor/avoid-strspn-indexing-issues branch 2 times, most recently from da7fb86 to 08d8a50 Compare November 16, 2022 19:18
Copy link
Member

@fullofcaffeine fullofcaffeine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested using the script provided and the changes here, and it did not enter a deadlock. LGTM!

@dmsnell
Copy link
Member Author

dmsnell commented Nov 16, 2022

thank you @fullofcaffeine! I had to push a small fix for breakage on RCDATA states, but once the tests pass again I'll merge

@dmsnell dmsnell force-pushed the tag-processor/avoid-strspn-indexing-issues branch from 08d8a50 to 9b27586 Compare November 16, 2022 19:31
Follows #45537

When parsing truncated HTML it was brought to our attention that when
passing  out-of-bounds indices to `strspn()` and `strcspn()` that the
behavior is different before and after PHP8. We also realized that when
we cleaned up the problems with `substr()` we left some indices without
bounds checking and that led to a different flavor of the same problem.

When parsing the following HTML we run into warnings when calling `strspn()`
and `strcspn()`. For pre-PHP8 versions this also leads to an infinite loop
while in later versions it simply omits a warning.

```php
<!-- wp:gallery {"linkTo":"none"} -->
<figure class="wp-block-gallery has-nested-images ...
```

In this patch we're adding proper bounds checking wherever we update the
internal pointer in the Tag Processor to avoid any further out-of-bounds
issues.

While this patch fixes the core issue at stake, it's worth performing a
more complete audit of the index usage throughout the class and consider
internalizing the string methods to avoid version inconsistencies and
provide a more robust mechanism for aborting when passing the end of the
provided input document.

Props to @aidvu for quickly identifying this issue.
@dmsnell dmsnell force-pushed the tag-processor/avoid-strspn-indexing-issues branch from 9b27586 to 5f53eaf Compare November 16, 2022 19:41
@dmsnell dmsnell merged commit 5f53eaf into trunk Nov 16, 2022
@dmsnell dmsnell deleted the tag-processor/avoid-strspn-indexing-issues branch November 16, 2022 20:26
@@ -397,6 +397,10 @@ public function next_tag( $query = null ) {
$already_found = 0;

do {
if ( $this->parsed_bytes >= strlen( $this->html ) ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the drive by review but I just noticed that strlen( $this->html ) is being called throughout this class. Would this be optimized by storing the length in a $html_length member variable in the constructor and the get_updated_html method, and then using it instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the question @westonruter

the answer I believe is no because strlen isn't like a normal PHP function call, it's a PHP op-code and the execution is a direct lookup on the internal string object within the PHP runtime. maybe it saves one indirect memory access, but if that's true I'm not sure it'd be measurable (whereas a function call is often measurable!)

I haven't confirmed this but the docs inside PHP suggest this might even be special-cased and there's truly no difference in the generated CPU instructions. the opcode output is different, but not in a way that sheds light on the CPU instructions (other than that it's not a function call with their characteristic overhead).

so we have a lint rule floating around I think that discourages this style but I believe it's based on a false premise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Type] Bug An existing feature does not function as intended
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants