Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong sanitised output for link #254

Closed
alex-alvarezg opened this issue Mar 22, 2022 · 7 comments · Fixed by #255
Closed

Wrong sanitised output for link #254

alex-alvarezg opened this issue Mar 22, 2022 · 7 comments · Fixed by #255

Comments

@alex-alvarezg
Copy link

The following input

/test/?param1=valueOne&param2=valueTwo
will be sanitized to:

/test/?param1=valueOne¶m2=valueTwo
but should be sanitized to

/test/?param1=valueOne&param2=valueTwo

The following code is used:

    private final PolicyFactory URL_POLICY = new HtmlPolicyBuilder()
            .toFactory()
            .and(Sanitizers.LINKS);

URL_POLICY.sanitize("/test/?param1=valueOne&param2=valueTwo")

@jmanico
Copy link
Member

jmanico commented Mar 22, 2022 via email

@alex-alvarezg
Copy link
Author

Thanks Jim, will give it a try, it used to work with previous versions.

@alex-alvarezg
Copy link
Author

alex-alvarezg commented Mar 24, 2022

Still fails with output:

<a href="/test/?param1&#61;valueOne¶m2&#61;valueTwo" rel="nofollow">click me</a>

it does work correctly if something else is used, instead of param

So we found out that if any part of the string matches a character entity reference (as in this table: https://dev.w3.org/html5/html-author/charref ), it will be converted to the entity

for instance:

For the entity
&para; = ¶

The input will be converted as follows:

&param2 = ¶m2

@mikesamuel
Copy link
Contributor

Ok, so the problem is that &para without a semicolon is decoded to the paragraph symbol.
I think the operable code here is

"par;", "\u2225",
"para", "\u00b6",
"para;", "\u00b6",
"parallel;", "\u2225",

That's derived from

// Source data: https://html.spec.whatwg.org/multipage/named-characters.html
// More readable: https://html.spec.whatwg.org/entities.json

I think we're actually handling this according to spec since https://html.spec.whatwg.org/entities.json still has a line

"&para": { "codepoints": [182], "characters": "\u00B6" },

Do browsers do some less thorough entity matching for URL attribute values? I don't remember that changing but I haven't been following html.spec as closely as I could.

@kusako
Copy link

kusako commented Mar 25, 2022

Hi,
I'm far from an expert on this, but browsers seem to have different behaviour for attribute values.
Maybe because of https://html.spec.whatwg.org/#named-character-reference-state .

In practice something like <a href="/foo?param1=1&param2=2">foo</a> is probably rather common, so personally I think sanitization shouldn't break this.
Also it used to work in 20191001.1, but seems to stop working with 20200713.1 and later.

@mikesamuel
Copy link
Contributor

Thanks for the pointer, @kusako.
Another thing to check is whether the name match is greedy; whether <a href="?x&para="> should decode to include ¶ but <a href="?x&param="> should not.

<doctype html>
<meta charset="utf-8"/>
<style>a { display: block }</style>

<a href="?x&para="></a>
<a href="?x&param="></a>
<a href="?x&para"></a>
<a title="?x&para=" href=.></a>
<a title="?x&param=" href=.></a>
<a title="?x&para" href=.></a>

<script>
(() => {
    const links = document.querySelectorAll('a')
    links.forEach((link) => {
        let { href, title } = link;
        if (title) {
            link.textContent = `title is ${title}`;
        } else {
            link.textContent = `href is ${href}`;
        }
    });
})();
</script>

produces on Chrome, Firefox, Safari:

href is file:///tmp/foo.html?x&para=
href is file:///tmp/foo.html?x&param=
href is file:///tmp/foo.html?x%C2%B6
title is ?x&para=
title is ?x&param=
title is ?x¶

So it seems like the rule is, if there's no semicolon at the end of the entity name, and the next character is not valid in a character reference, and the next character is not =, then decode.

I can probably write some JS to test that further, but changing the decode loop to something like that should address the problem.

@mikesamuel
Copy link
Contributor

It looks like we need to grab = and ASCII alphanumerics but only when decoding an attribute, not when decoding text node content.

Continues character reference name in CDATA

Continues character reference name in title attribute

  1. U+30 - U+39: [0-9]
  2. U+3d: [=]
  3. U+41 - U+5a: [A-Z]
  4. U+61 - U+7a: [a-z]

The above was derived by looking at each basic plane code-point:

<h1>Continues character reference name in CDATA</h1>

<ol id="continuers-cdata"></ol>

<h1>Continues character reference name in title attribute</h1>
<ol id="continuers-attr"></ol>
<script>
(() => {
    let continuers = {
        cdata: document.getElementById("continuers-cdata"),
        attr: document.getElementById("continuers-attr"),
    };
    let starts = {
        cdata: null,
        attr: null,
    }
    let div = document.createElement("div");
    let limit = 0x10000;
    for (let i = 0; i < limit; ++i) {
        div.innerHTML = `<span title="&para${String.fromCharCode(i)}">&para${String.fromCharCode(i)}</span>`;
        let span = div.querySelector("span");
        let { textContent, title } = span;
        step(i, !textContent.startsWith("\u00b6"), 'cdata');
        step(i, !title.startsWith("\u00b6"), 'attr');
    }
    for (let key in continuers) { step(limit, false, key); }

    function step(i, present, key) {
        let start = starts[key];
        if (start !== null) {
            if (!present) {
                let end = i - 1;
                let continuersList = continuers[key];
                starts[key] = null;
                let li = document.createElement("li");
                let range = (start === end)
                    ? `U+${start.toString(16)}`
                    : `U+${start.toString(16)} - U+${end.toString(16)}`;
                let chars = (start === end)
                    ? String.fromCharCode(start)
                    : `${String.fromCharCode(start)}-${String.fromCharCode(end)}`;
                li.appendChild(document.createTextNode(`${range}: [${chars}]`));
                continuersList.appendChild(li);
            }
        } else if (present) {
            starts[key] = i;
        }
    }
})();
</script>

mikesamuel added a commit that referenced this issue Mar 28, 2022
As described in issue #254 `&para` is a full complete character
reference when decoding text node content, but not when
decoding attribute content which causes problems for URL attribute
values like

    /test?param1=foo&param2=bar

As shown via JS test code in that issue, a small set of
next characters prevent a character reference name match
from being considered complete.

This commit:
- modifies the decode functions to take an extra parameter
  `boolean inAttribute`, and modifies the Trie traversal
  loops to not store a longest match so far based on that
  parameter and some next character tests
- modifies the HTML lexer to pass that attribute appropriately
- for backwards compat, leaves the old APIs in place but `@deprecated`
- adds unit tests for the decode functions
- adds a unit test for the specific input from the issue

This change should make us more conformant with observed
browser behaviour so is not expected to cause compatibility
problems for existing users.

Fixes #254
mikesamuel added a commit that referenced this issue Jun 8, 2022
As described in issue #254 `&para` is a full complete character
reference when decoding text node content, but not when
decoding attribute content which causes problems for URL attribute
values like

    /test?param1=foo&param2=bar

As shown via JS test code in that issue, a small set of
next characters prevent a character reference name match
from being considered complete.

This commit:
- modifies the decode functions to take an extra parameter
  `boolean inAttribute`, and modifies the Trie traversal
  loops to not store a longest match so far based on that
  parameter and some next character tests
- modifies the HTML lexer to pass that attribute appropriately
- for backwards compat, leaves the old APIs in place but `@deprecated`
- adds unit tests for the decode functions
- adds a unit test for the specific input from the issue

This change should make us more conformant with observed
browser behaviour so is not expected to cause compatibility
problems for existing users.

Fixes #254
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants