Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decode attribute content differently from text node content #255

Merged
merged 1 commit into from
Jun 8, 2022

Conversation

mikesamuel
Copy link
Contributor

As described in issue #254 &para is a full complete character
reference when decoding text node content, but not when
decoding attribute content which causes problems for URL attribute
values like

/test?param1=foo&param2=bar

As shown via JS test code in that issue, a small set of
next characters prevent a character reference name match
from being considered complete.

This commit:

  • modifies the decode functions to take an extra parameter
    boolean inAttribute, and modifies the Trie traversal
    loops to not store a longest match so far based on that
    parameter and some next character tests
  • modifies the HTML lexer to pass that attribute appropriately
  • for backwards compat, leaves the old APIs in place but @deprecated
  • adds unit tests for the decode functions
  • adds a unit test for the specific input from the issue

This change should make us more conformant with observed
browser behaviour so is not expected to cause compatibility
problems for existing users.

Fixes #254

As described in issue #254 `&para` is a full complete character
reference when decoding text node content, but not when
decoding attribute content which causes problems for URL attribute
values like

    /test?param1=foo&param2=bar

As shown via JS test code in that issue, a small set of
next characters prevent a character reference name match
from being considered complete.

This commit:
- modifies the decode functions to take an extra parameter
  `boolean inAttribute`, and modifies the Trie traversal
  loops to not store a longest match so far based on that
  parameter and some next character tests
- modifies the HTML lexer to pass that attribute appropriately
- for backwards compat, leaves the old APIs in place but `@deprecated`
- adds unit tests for the decode functions
- adds a unit test for the specific input from the issue

This change should make us more conformant with observed
browser behaviour so is not expected to cause compatibility
problems for existing users.

Fixes #254
@mikesamuel mikesamuel merged commit 5372c74 into main Jun 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wrong sanitised output for link
1 participant