Wrong sanitised output for link #254

alex-alvarezg · 2022-03-22T17:39:35Z

The following input

/test/?param1=valueOne&param2=valueTwo
will be sanitized to:

/test/?param1=valueOne¶m2=valueTwo
but should be sanitized to

/test/?param1=valueOne&param2=valueTwo

The following code is used:

    private final PolicyFactory URL_POLICY = new HtmlPolicyBuilder()
            .toFactory()
            .and(Sanitizers.LINKS);

URL_POLICY.sanitize("/test/?param1=valueOne&param2=valueTwo")

The text was updated successfully, but these errors were encountered:

jmanico · 2022-03-22T18:43:17Z

That's just text, not a link. I am not set up right now and am on the run, but can you try: |private final PolicyFactory URL_POLICY = new HtmlPolicyBuilder() .toFactory() .and(Sanitizers.LINKS); URL_POLICY.sanitize("<a href=\"/test/?param1=valueOne&param2=valueTwo\">click me</a>"); or similar? |

On 3/22/22 10:39 AM, alex-alvarezg wrote: The following input |/test/?param1=valueOne&param2=valueTwo | will be sanitized to: |/test/?param1=valueOne¶m2=valueTwo | but should be sanitized to |/test/?param1=valueOne&param2=valueTwo | The following code is used: |private final PolicyFactory URL_POLICY = new HtmlPolicyBuilder() .toFactory() .and(Sanitizers.LINKS); URL_POLICY.sanitize("/test/?param1=valueOne&param2=valueTwo") | — Reply to this email directly, view it on GitHub <#254>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEBYCIRNMDB6SI44VBY7GLVBIAWFANCNFSM5RLRZS5A>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- Jim Manico Manicode Security https://www.manicode.com

alex-alvarezg · 2022-03-22T19:30:16Z

Thanks Jim, will give it a try, it used to work with previous versions.

alex-alvarezg · 2022-03-24T12:49:38Z

Still fails with output:

<a href="/test/?param1=valueOne¶m2=valueTwo" rel="nofollow">click me</a>

it does work correctly if something else is used, instead of param

So we found out that if any part of the string matches a character entity reference (as in this table: https://dev.w3.org/html5/html-author/charref ), it will be converted to the entity

for instance:

For the entity
¶ = ¶

The input will be converted as follows:

&param2 = ¶m2

mikesamuel · 2022-03-24T17:30:43Z

Ok, so the problem is that &para without a semicolon is decoded to the paragraph symbol.
I think the operable code here is

java-html-sanitizer/src/main/java/org/owasp/html/HtmlEntities.java

Lines 1724 to 1727 in 33d319f

    
           "par;", "\u2225", 
        
           "para", "\u00b6", 
        
           "para;", "\u00b6", 
        
           "parallel;", "\u2225",

That's derived from

java-html-sanitizer/src/main/java/org/owasp/html/HtmlEntities.java

Lines 46 to 47 in 33d319f

    
           // Source data: https://html.spec.whatwg.org/multipage/named-characters.html 
        
           // More readable: https://html.spec.whatwg.org/entities.json

I think we're actually handling this according to spec since https://html.spec.whatwg.org/entities.json still has a line

"&para": { "codepoints": [182], "characters": "\u00B6" },

Do browsers do some less thorough entity matching for URL attribute values? I don't remember that changing but I haven't been following html.spec as closely as I could.

kusako · 2022-03-25T16:53:20Z

Hi,
I'm far from an expert on this, but browsers seem to have different behaviour for attribute values.
Maybe because of https://html.spec.whatwg.org/#named-character-reference-state .

In practice something like <a href="/foo?param1=1&param2=2">foo</a> is probably rather common, so personally I think sanitization shouldn't break this.
Also it used to work in 20191001.1, but seems to stop working with 20200713.1 and later.

mikesamuel · 2022-03-26T00:08:18Z

Thanks for the pointer, @kusako.
Another thing to check is whether the name match is greedy; whether <a href="?x&para="> should decode to include ¶ but <a href="?x&param="> should not.

<doctype html>
<meta charset="utf-8"/>
<style>a { display: block }</style>

<a href="?x&para="></a>
<a href="?x&param="></a>
<a href="?x&para"></a>
<a title="?x&para=" href=.></a>
<a title="?x&param=" href=.></a>
<a title="?x&para" href=.></a>

<script>
(() => {
    const links = document.querySelectorAll('a')
    links.forEach((link) => {
        let { href, title } = link;
        if (title) {
            link.textContent = `title is ${title}`;
        } else {
            link.textContent = `href is ${href}`;
        }
    });
})();
</script>

produces on Chrome, Firefox, Safari:

href is file:///tmp/foo.html?x&para=
href is file:///tmp/foo.html?x&param=
href is file:///tmp/foo.html?x%C2%B6
title is ?x&para=
title is ?x&param=
title is ?x¶

So it seems like the rule is, if there's no semicolon at the end of the entity name, and the next character is not valid in a character reference, and the next character is not =, then decode.

I can probably write some JS to test that further, but changing the decode loop to something like that should address the problem.

mikesamuel · 2022-03-28T16:27:14Z

It looks like we need to grab = and ASCII alphanumerics but only when decoding an attribute, not when decoding text node content.

Continues character reference name in CDATA

Continues character reference name in title attribute

U+30 - U+39: [0-9]
U+3d: [=]
U+41 - U+5a: [A-Z]
U+61 - U+7a: [a-z]

The above was derived by looking at each basic plane code-point:

<h1>Continues character reference name in CDATA</h1>

<ol id="continuers-cdata"></ol>

<h1>Continues character reference name in title attribute</h1>
<ol id="continuers-attr"></ol>
<script>
(() => {
    let continuers = {
        cdata: document.getElementById("continuers-cdata"),
        attr: document.getElementById("continuers-attr"),
    };
    let starts = {
        cdata: null,
        attr: null,
    }
    let div = document.createElement("div");
    let limit = 0x10000;
    for (let i = 0; i < limit; ++i) {
        div.innerHTML = `<span title="&para${String.fromCharCode(i)}">&para${String.fromCharCode(i)}</span>`;
        let span = div.querySelector("span");
        let { textContent, title } = span;
        step(i, !textContent.startsWith("\u00b6"), 'cdata');
        step(i, !title.startsWith("\u00b6"), 'attr');
    }
    for (let key in continuers) { step(limit, false, key); }

    function step(i, present, key) {
        let start = starts[key];
        if (start !== null) {
            if (!present) {
                let end = i - 1;
                let continuersList = continuers[key];
                starts[key] = null;
                let li = document.createElement("li");
                let range = (start === end)
                    ? `U+${start.toString(16)}`
                    : `U+${start.toString(16)} - U+${end.toString(16)}`;
                let chars = (start === end)
                    ? String.fromCharCode(start)
                    : `${String.fromCharCode(start)}-${String.fromCharCode(end)}`;
                li.appendChild(document.createTextNode(`${range}: [${chars}]`));
                continuersList.appendChild(li);
            }
        } else if (present) {
            starts[key] = i;
        }
    }
})();
</script>

As described in issue #254 `&para` is a full complete character reference when decoding text node content, but not when decoding attribute content which causes problems for URL attribute values like /test?param1=foo&param2=bar As shown via JS test code in that issue, a small set of next characters prevent a character reference name match from being considered complete. This commit: - modifies the decode functions to take an extra parameter `boolean inAttribute`, and modifies the Trie traversal loops to not store a longest match so far based on that parameter and some next character tests - modifies the HTML lexer to pass that attribute appropriately - for backwards compat, leaves the old APIs in place but `@deprecated` - adds unit tests for the decode functions - adds a unit test for the specific input from the issue This change should make us more conformant with observed browser behaviour so is not expected to cause compatibility problems for existing users. Fixes #254

mikesamuel mentioned this issue Mar 28, 2022

Decode attribute content differently from text node content #255

Merged

mikesamuel closed this as completed in #255 Jun 8, 2022

sumitkumar1110 mentioned this issue Apr 5, 2024

Hotfix/zbug 3867 Zimbra/java-html-sanitizer-release-20190610.1#14

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong sanitised output for link #254

Wrong sanitised output for link #254

alex-alvarezg commented Mar 22, 2022

jmanico commented Mar 22, 2022 via email

alex-alvarezg commented Mar 22, 2022

alex-alvarezg commented Mar 24, 2022 •

edited

Loading

mikesamuel commented Mar 24, 2022

kusako commented Mar 25, 2022

mikesamuel commented Mar 26, 2022

mikesamuel commented Mar 28, 2022

Wrong sanitised output for link #254

Wrong sanitised output for link #254

Comments

alex-alvarezg commented Mar 22, 2022

jmanico commented Mar 22, 2022 via email

alex-alvarezg commented Mar 22, 2022

alex-alvarezg commented Mar 24, 2022 • edited Loading

mikesamuel commented Mar 24, 2022

kusako commented Mar 25, 2022

mikesamuel commented Mar 26, 2022

mikesamuel commented Mar 28, 2022

Continues character reference name in CDATA

Continues character reference name in title attribute

alex-alvarezg commented Mar 24, 2022 •

edited

Loading