-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perf improvement for TokeniserState & CharacterReader's handling of single quotes in double quoted attributes #2045
Conversation
Hi -- can you tell me more about this change? Is this in a path hit by jsoup when parsing HTML - can you provide an example? Or is it for a custom parser? Your testcase shows the method being called when the CharacterReader is not already within the quoted string. But the implementation's expectation is that the reader has consumed the starting character and is then chomping to the end. See in e.g. AttributeValue_doubleQuoted |
Hi Jonathan! The issue I'm trying to address is that when CharacterReader processes an attribute enclosed in double quotes that happens to also have a single quote in it (as a part of the attribute itself), consumeAttributeQuoted() would stop at the single quote and return everything before it as the value of that attribute, rather than consuming everything between the double quotes, which seems to be the expected behavior. The cause of the issue is with the switch (val[pos]) {
...
case '\'':
if (single) break OUTER;
case '"':
if (!single) break OUTER;
default:
pos++;
} Suppose the attribute itself is double quoted (so I looked at the section in TokeniserState that you mentioned, and it seemed that so long as the tokenizer recognize that what it is about to consume is an attribute's value that is enclosed in a pair of double quote, it would call Just to provide some context to how I encountered this issue, I am a student at CMU Silicon Valley in the US and I am currently taking a class on software testing. We have a group project where we select a component from an open source library and write tests for it. And my group has chosen CharacterReader from jsoup. This issue was discovered by my teammate @ruobingcS2023 when we were writing tests for consumeAttributeQuoted(). I'm not too familiar with the inner mechanisms of jsoup and I'm not sure if this issue has been addressed in other parts of the parsing process. But I thought this behavior probably deserves a second look so I'm bringing it up just in case. You're busy and I appreciate you taking the time to read this. Please feel free to close this PR if you think this is not an issue. Thanks! |
…uoted attributes (#2045) commit 6edd13594c4b5a3eceabffc5542c12c90820bfa8 Author: Jonathan Hedley <[email protected]> Date: Fri Nov 17 11:30:23 2023 +1100 Updated to reflect reader's position at caller TokeniserState And updated names to reflect intention of change commit 0eebaaf Author: thermodynathan <[email protected]> Date: Thu Nov 9 09:52:21 2023 -0800 fix CharacterReader's incorrect handling of single quote in double quoted attributes commit 379dae2 Author: thermodynathan <[email protected]> Date: Thu Nov 9 09:50:27 2023 -0800 add failing test to reveal fault in CharacterReader
Thanks for the context. The CharacterReader is designed for jsoup's TokeniserState and to follow closely to the HTML spec. So any review needs to be in the context of how it's used. Handling "counter" quotes (i.e. a single quote in a double quoted attribute, and vice-versa) is already tested in parsesRoughAttributeString See https://html.spec.whatwg.org/multipage/parsing.html#attribute-value-(double-quoted)-state - "Anything else So while this doesn't change or fix any behaviour in the parser, it's a nice performance tweak to save going through the extra append() states when it could be read in just one consume(). Particularly one could imagine some nasty value with a tonne of quotes inside. I made some changes to the test to reflect how it's expected to be used by the caller (particularly that the reader is already positioned in the attribute value). I tried to push that back to this PR to merge, but I think you've renamed your account or your repository on GitHub? As I get an auth error trying to push back, and the original commit IDs can't be found. So I did a squash merge to master with 061ee0c instead. I think a review that would have more impact would be to look at the unit test coverage of TokeniserState. That's at 98% method but 75% line. I would bet there are issues / improvements to be made from inspecting those gaps. And by comparing the current implementation to the current spec - the spec has evolved over time and may have diverged in places. We do a fuzzer which gets the TokeniserState line and branch coverage to ~ 100% but that's largely testing for explosive errors and stuck loops etc -- not that functionality matches the spec. |
(I would say it is a fix for the intended behaviour of the method - the previous implementations case fall-through to the next if is wrong. Another implementation would be to have an else and break inner) |
No description provided.