Perf improvement for TokeniserState & CharacterReader's handling of single quotes in double quoted attributes #2045

simenghaS2023 · 2023-11-09T17:58:27Z

No description provided.

…oted attributes

jhy · 2023-11-16T00:13:17Z

Hi -- can you tell me more about this change? Is this in a path hit by jsoup when parsing HTML - can you provide an example? Or is it for a custom parser?

Your testcase shows the method being called when the CharacterReader is not already within the quoted string. But the implementation's expectation is that the reader has consumed the starting character and is then chomping to the end.

See in e.g. AttributeValue_doubleQuoted

simenghaS2023 · 2023-11-16T05:23:23Z

Hi Jonathan! The issue I'm trying to address is that when CharacterReader processes an attribute enclosed in double quotes that happens to also have a single quote in it (as a part of the attribute itself), consumeAttributeQuoted() would stop at the single quote and return everything before it as the value of that attribute, rather than consuming everything between the double quotes, which seems to be the expected behavior.

The cause of the issue is with the switch statement starting on line 448 of ChracterReader:

switch (val[pos]) {
    ...
    case '\'':
        if (single) break OUTER;
    case '"':
        if (!single) break OUTER;
    default:
        pos++;
}

Suppose the attribute itself is double quoted (so single == False). If the current character is ', then we would enter case '\'':, where break OUTER; is not executed since single == False. At which point it would fall through to case '"': and break OUTER; will subsequently execute, causing the method to end despite the fact that the current character is not the closing double quote.

I looked at the section in TokeniserState that you mentioned, and it seemed that so long as the tokenizer recognize that what it is about to consume is an attribute's value that is enclosed in a pair of double quote, it would call consumeAttributeQuoted(false) to consume the value. And if the attribute's value happens to contain a single quote, then the issue I have described above would occur.

Just to provide some context to how I encountered this issue, I am a student at CMU Silicon Valley in the US and I am currently taking a class on software testing. We have a group project where we select a component from an open source library and write tests for it. And my group has chosen CharacterReader from jsoup. This issue was discovered by my teammate @ruobingcS2023 when we were writing tests for consumeAttributeQuoted().

I'm not too familiar with the inner mechanisms of jsoup and I'm not sure if this issue has been addressed in other parts of the parsing process. But I thought this behavior probably deserves a second look so I'm bringing it up just in case.

You're busy and I appreciate you taking the time to read this. Please feel free to close this PR if you think this is not an issue. Thanks!

…uoted attributes (#2045) commit 6edd13594c4b5a3eceabffc5542c12c90820bfa8 Author: Jonathan Hedley <[email protected]> Date: Fri Nov 17 11:30:23 2023 +1100 Updated to reflect reader's position at caller TokeniserState And updated names to reflect intention of change commit 0eebaaf Author: thermodynathan <[email protected]> Date: Thu Nov 9 09:52:21 2023 -0800 fix CharacterReader's incorrect handling of single quote in double quoted attributes commit 379dae2 Author: thermodynathan <[email protected]> Date: Thu Nov 9 09:50:27 2023 -0800 add failing test to reveal fault in CharacterReader

jhy · 2023-11-17T00:55:43Z

Thanks for the context.

The CharacterReader is designed for jsoup's TokeniserState and to follow closely to the HTML spec. So any review needs to be in the context of how it's used.

Handling "counter" quotes (i.e. a single quote in a double quoted attribute, and vice-versa) is already tested in parsesRoughAttributeString

See https://html.spec.whatwg.org/multipage/parsing.html#attribute-value-(double-quoted)-state - "Anything else
Append the current input character to the current attribute's value." is how this currently executes - TokeniserState.AttributeValue_doubleQuoted has that as the default action.

So while this doesn't change or fix any behaviour in the parser, it's a nice performance tweak to save going through the extra append() states when it could be read in just one consume(). Particularly one could imagine some nasty value with a tonne of quotes inside.

I made some changes to the test to reflect how it's expected to be used by the caller (particularly that the reader is already positioned in the attribute value). I tried to push that back to this PR to merge, but I think you've renamed your account or your repository on GitHub? As I get an auth error trying to push back, and the original commit IDs can't be found.

So I did a squash merge to master with 061ee0c instead.

I think a review that would have more impact would be to look at the unit test coverage of TokeniserState. That's at 98% method but 75% line. I would bet there are issues / improvements to be made from inspecting those gaps. And by comparing the current implementation to the current spec - the spec has evolved over time and may have diverged in places.

We do a fuzzer which gets the TokeniserState line and branch coverage to ~ 100% but that's largely testing for explosive errors and stuck loops etc -- not that functionality matches the spec.

jhy · 2023-11-17T01:09:54Z

(I would say it is a fix for the intended behaviour of the method - the previous implementations case fall-through to the next if is wrong. Another implementation would be to have an else and break inner)

Simeng-H added 2 commits November 9, 2023 09:50

add failing test to reveal fault in CharacterReader

379dae2

fix CharacterReader's incorrect handling of single quote in double qu…

0eebaaf

…oted attributes

jhy changed the title ~~Fix CharacterReader's handling of single quotes in double quoted attributes~~ Perf improvement for TokeniserState & CharacterReader's handling of single quotes in double quoted attributes Nov 17, 2023

jhy closed this Nov 17, 2023

jhy added this to the 1.17.1 milestone Nov 17, 2023

jhy added the improvement label Nov 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf improvement for TokeniserState & CharacterReader's handling of single quotes in double quoted attributes #2045

Perf improvement for TokeniserState & CharacterReader's handling of single quotes in double quoted attributes #2045

simenghaS2023 commented Nov 9, 2023

jhy commented Nov 16, 2023

simenghaS2023 commented Nov 16, 2023

jhy commented Nov 17, 2023

jhy commented Nov 17, 2023

Perf improvement for TokeniserState & CharacterReader's handling of single quotes in double quoted attributes #2045

Perf improvement for TokeniserState & CharacterReader's handling of single quotes in double quoted attributes #2045

Conversation

simenghaS2023 commented Nov 9, 2023

jhy commented Nov 16, 2023

simenghaS2023 commented Nov 16, 2023

jhy commented Nov 17, 2023

jhy commented Nov 17, 2023