Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf improvement for TokeniserState & CharacterReader's handling of single quotes in double quoted attributes #2045

Closed

Conversation

simenghaS2023
Copy link

No description provided.

@jhy
Copy link
Owner

jhy commented Nov 16, 2023

Hi -- can you tell me more about this change? Is this in a path hit by jsoup when parsing HTML - can you provide an example? Or is it for a custom parser?

Your testcase shows the method being called when the CharacterReader is not already within the quoted string. But the implementation's expectation is that the reader has consumed the starting character and is then chomping to the end.

See in e.g. AttributeValue_doubleQuoted

@simenghaS2023
Copy link
Author

Hi Jonathan! The issue I'm trying to address is that when CharacterReader processes an attribute enclosed in double quotes that happens to also have a single quote in it (as a part of the attribute itself), consumeAttributeQuoted() would stop at the single quote and return everything before it as the value of that attribute, rather than consuming everything between the double quotes, which seems to be the expected behavior.

The cause of the issue is with the switch statement starting on line 448 of ChracterReader:

switch (val[pos]) {
    ...
    case '\'':
        if (single) break OUTER;
    case '"':
        if (!single) break OUTER;
    default:
        pos++;
}

Suppose the attribute itself is double quoted (so single == False). If the current character is ', then we would enter case '\'':, where break OUTER; is not executed since single == False. At which point it would fall through to case '"': and break OUTER; will subsequently execute, causing the method to end despite the fact that the current character is not the closing double quote.

I looked at the section in TokeniserState that you mentioned, and it seemed that so long as the tokenizer recognize that what it is about to consume is an attribute's value that is enclosed in a pair of double quote, it would call consumeAttributeQuoted(false) to consume the value. And if the attribute's value happens to contain a single quote, then the issue I have described above would occur.

Just to provide some context to how I encountered this issue, I am a student at CMU Silicon Valley in the US and I am currently taking a class on software testing. We have a group project where we select a component from an open source library and write tests for it. And my group has chosen CharacterReader from jsoup. This issue was discovered by my teammate @ruobingcS2023 when we were writing tests for consumeAttributeQuoted().

I'm not too familiar with the inner mechanisms of jsoup and I'm not sure if this issue has been addressed in other parts of the parsing process. But I thought this behavior probably deserves a second look so I'm bringing it up just in case.

You're busy and I appreciate you taking the time to read this. Please feel free to close this PR if you think this is not an issue. Thanks!

jhy added a commit that referenced this pull request Nov 17, 2023
…uoted attributes (#2045)

commit 6edd13594c4b5a3eceabffc5542c12c90820bfa8
Author: Jonathan Hedley <[email protected]>
Date:   Fri Nov 17 11:30:23 2023 +1100

    Updated to reflect reader's position at caller TokeniserState

    And updated names to reflect intention of change

commit 0eebaaf
Author: thermodynathan <[email protected]>
Date:   Thu Nov 9 09:52:21 2023 -0800

    fix CharacterReader's incorrect handling of single quote in double quoted attributes

commit 379dae2
Author: thermodynathan <[email protected]>
Date:   Thu Nov 9 09:50:27 2023 -0800

    add failing test to reveal fault in CharacterReader
@jhy
Copy link
Owner

jhy commented Nov 17, 2023

Thanks for the context.

The CharacterReader is designed for jsoup's TokeniserState and to follow closely to the HTML spec. So any review needs to be in the context of how it's used.

Handling "counter" quotes (i.e. a single quote in a double quoted attribute, and vice-versa) is already tested in parsesRoughAttributeString

See https://html.spec.whatwg.org/multipage/parsing.html#attribute-value-(double-quoted)-state - "Anything else
Append the current input character to the current attribute's value." is how this currently executes - TokeniserState.AttributeValue_doubleQuoted has that as the default action.

So while this doesn't change or fix any behaviour in the parser, it's a nice performance tweak to save going through the extra append() states when it could be read in just one consume(). Particularly one could imagine some nasty value with a tonne of quotes inside.

I made some changes to the test to reflect how it's expected to be used by the caller (particularly that the reader is already positioned in the attribute value). I tried to push that back to this PR to merge, but I think you've renamed your account or your repository on GitHub? As I get an auth error trying to push back, and the original commit IDs can't be found.

So I did a squash merge to master with 061ee0c instead.

I think a review that would have more impact would be to look at the unit test coverage of TokeniserState. That's at 98% method but 75% line. I would bet there are issues / improvements to be made from inspecting those gaps. And by comparing the current implementation to the current spec - the spec has evolved over time and may have diverged in places.

We do a fuzzer which gets the TokeniserState line and branch coverage to ~ 100% but that's largely testing for explosive errors and stuck loops etc -- not that functionality matches the spec.

@jhy jhy changed the title Fix CharacterReader's handling of single quotes in double quoted attributes Perf improvement for TokeniserState & CharacterReader's handling of single quotes in double quoted attributes Nov 17, 2023
@jhy jhy closed this Nov 17, 2023
@jhy jhy added this to the 1.17.1 milestone Nov 17, 2023
@jhy
Copy link
Owner

jhy commented Nov 17, 2023

(I would say it is a fix for the intended behaviour of the method - the previous implementations case fall-through to the next if is wrong. Another implementation would be to have an else and break inner)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants