Parsing robustness #8436

Traumflug · 2021-04-19T00:34:59Z

Fallout of a hacking Sunday. Tackles #7567 and #8417.

Tested against my own project, only, as I'm not aware how the testsuite works.

Not so sure about commit 0d91578, "don't shortcut on previously seen elements", as I'm not sure how much performance impact it has. It solves handling ignored tags, but probably needs either removal of then dead code or a different solution for dealing with ignored tags appearing multiple times.

CLAassistant · 2021-04-19T00:35:04Z

All committers have signed the CLA.

Traumflug · 2021-04-19T01:20:13Z

In case somebody wants to enhance the testsuite, here are additional cases:

<script type="text/javascript">
    for (var i = 0; i < imgDefer.length; i++) {
        doNothing();
    }
</script>

<script
  type="text/javascript"
>
  if (a<b && a>c) {
    haha();
  }
</script>

<form
	action="[URL]"
	method="post"
>

<hr	id=a>

Yes, newlines and tab vs. space matters :-)

Failure of each is recognized by non-alphabetical characters appearing in the hugo_stats.json.

bep · 2021-04-20T07:21:09Z

Tested against my own project, only, as I'm not aware how the testsuite works.

go test ./publisher

bep · 2021-04-20T07:54:56Z

Thanks for this.

There are failing tests. It looks mostly to be elements inside pre/textarea, which I don't see any reason to collect (but please prove me wrong).

Also, it would be great if you could

Add some test cases for the cases this PR is supposed to fix.
Run the benchmark inside the ./publisher package and compare it to master (see below)

name                     old time/op    new time/op    delta
ClassCollectorWriter-16    21.2µs ± 2%    47.3µs ± 1%  +123.21%  (p=0.029 n=4+4)

name                     old alloc/op   new alloc/op   delta
ClassCollectorWriter-16    35.3kB ± 0%    86.4kB ± 0%  +144.95%  (p=0.029 n=4+4)

name                     old allocs/op  new allocs/op  delta
ClassCollectorWriter-16       155 ± 0%       329 ± 0%  +112.26%  (p=0.029 n=4+4)

Note that I don't mind it getting slower if really needed, but I fail to understand the motivation behind the removal of elementSet, which I suspect is the reason why it is now so much slower and memory hungry.

bep · 2021-04-20T15:28:03Z

Can you check whether the latest commits re this in the master branch covers your needs?

Traumflug · 2021-04-22T01:37:01Z

Thanks for the hint on how to run these testcases easily. I just pushed a reviewed set of commits, rebased on current master.

There are failing tests. It looks mostly to be elements inside pre/textarea, which I don't see any reason to collect

Well, the simple reason for these failures was me removing special handling of pre/textarea, but not knowing how to remove these tests as well.

That said, removing these was a mistake, I misunderstood the intention of this code earlier. It's not about not collecting non-styleable tags, but about ignoring stuff inside them. Accordingly, I melted down the first commit to just adding another two such tags with preformatted code. Where <title> isn't exactly preformatted, but of no use for styling anyways.

Can you check whether the latest commits re this in the master branch covers your needs?

Well, some, but not all. This time, knowing how to deal with these tests, I managed to find a testcase for each. Committed testcases separately to allow seeing them failing. Luckily, I also found a fix for each.

Also one or two testcases which fell into the keyboard somehow, without a matching reported issue I'd be aware of :-)

Note that I don't mind it getting slower if really needed, but I fail to understand the motivation behind the removal of elementSet

This time I moved the commit with this removal, along with a commit with a demonstrating (previously failing, now working) testcase just before it, all to the end, to allow (cherry-)picking earlier commits.

The problem to solve appears when such a preformatted tag appears twice and contains a < (happens rarely in testcases, often in real world usage). First round everything is fine and dandy. On the second round, the processing time saver misses the part where w.inPreTag gets set. Accordingly, the following parsing doesn't recognize the preformatted tag as such and treats stuff inside the preformatted tag like normal tags.

Perhaps you have a better idea on how to get the best of both, quick processing as well as recognizing repeated preformatted tags.

Reference: https://developer.mozilla.org/en-US/docs/Web/HTML/Element

These are two of the symptoms reported with #7567. The second test passes already. The first test fails, because the '<' lets the parsing logic start a new tag, which only ends on the actual closing tag. This means garbage in front of the actual closing tag, making parseEndTag() fail to recognize it as a closing tag.

This stops parsing of 'preformatted' tags not on any tag found, but only on finding the matching end tag. Key element here is looking at the end of a tag (strings.HasSuffix()), rather than the entire tag. Remainder is simplification of the logic flow. On what failed before, see also previous commit. This is related to issue #7567 and fixes the failing test case introduced with the previous commit.

First case actually fails. Second one tests against a situation I had during development of the previous commit (forgotten ToLower() in parseEndTag()).

This makes the previously introduced testcase pass.

A regular whitespace, a tab and a newline are all equivelant and perfectly valid whitespace characters. Solve this by using strings.Fields(), which considers all these. This appears to solve #8417

Well, uhm, err, ... also failing.

Same for the other table-related tags, of course. This makes the testcase introduced with the previous commit pass.

Here, the strategy to avoid processing tags twice fails. On the first round, everything works properly. Next round, the same tag is no longer recognized as a preformatted tag due to the shortcut taken, the mess happens again.

This fixes the previously introduced testcase, details see there. Also helps with issue #7567.

ghost · 2021-05-16T17:54:17Z

I just rebased the pull request branch to tag v0.83.1 and force-pushed it.

bep · 2021-05-16T19:47:36Z

Thanks, but I've already spent a considerably amount of time simplifying this in another. Would welcome any failed test cases after that PR is merged.

github-actions · 2022-05-17T02:15:33Z

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

bep closed this Apr 20, 2021

bep reopened this Apr 20, 2021

Traumflug added 11 commits May 16, 2021 19:37

publisher: More 'preformatted' tags.

0bfdf12

Reference: https://developer.mozilla.org/en-US/docs/Web/HTML/Element

publisher: Typo, trailing whitespace.

0b1b1a0

publisher: testcases for uppercase/lowercase tags.

8409adc

First case actually fails. Second one tests against a situation I had during development of the previous commit (forgotten ToLower() in parseEndTag()).

publisher: fix tag casing sensitivities.

7963c18

This makes the previously introduced testcase pass.

publisher: be more tolerant on whitespace.

4e9c786

A regular whitespace, a tab and a newline are all equivelant and perfectly valid whitespace characters. Solve this by using strings.Fields(), which considers all these. This appears to solve #8417

publisher: testcase for tables defined with uppercase tags.

763020b

Well, uhm, err, ... also failing.

publisher: handle 'thead' as well as 'THEAD' properly.

07499f2

Same for the other table-related tags, of course. This makes the testcase introduced with the previous commit pass.

publisher: yet another failing testcase.

fa701e2

Here, the strategy to avoid processing tags twice fails. On the first round, everything works properly. Next round, the same tag is no longer recognized as a preformatted tag due to the shortcut taken, the mess happens again.

publisher: don't shortcut on previously seen elements.

b6a8b07

This fixes the previously introduced testcase, details see there. Also helps with issue #7567.

ghost mentioned this pull request May 16, 2021

Hugo_stats.json is missing some classes when there is somewhere an apostrophe in string #8530

Closed

bep closed this May 16, 2021

github-actions bot locked as resolved and limited conversation to collaborators May 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing robustness #8436

Parsing robustness #8436

Traumflug commented Apr 19, 2021 •

edited

Loading

CLAassistant commented Apr 19, 2021 •

edited

Loading

Traumflug commented Apr 19, 2021

bep commented Apr 20, 2021

bep commented Apr 20, 2021

bep commented Apr 20, 2021

Traumflug commented Apr 22, 2021 •

edited

Loading

ghost commented May 16, 2021

bep commented May 16, 2021

github-actions bot commented May 17, 2022

Parsing robustness #8436

Parsing robustness #8436

Conversation

Traumflug commented Apr 19, 2021 • edited Loading

CLAassistant commented Apr 19, 2021 • edited Loading

Traumflug commented Apr 19, 2021

bep commented Apr 20, 2021

bep commented Apr 20, 2021

bep commented Apr 20, 2021

Traumflug commented Apr 22, 2021 • edited Loading

ghost commented May 16, 2021

bep commented May 16, 2021

github-actions bot commented May 17, 2022

Traumflug commented Apr 19, 2021 •

edited

Loading

CLAassistant commented Apr 19, 2021 •

edited

Loading

Traumflug commented Apr 22, 2021 •

edited

Loading