-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong sanitised output for link #254
Comments
That's just text, not a link. I am not set up right now and am on the
run, but can you try:
|private final PolicyFactory URL_POLICY = new HtmlPolicyBuilder()
.toFactory() .and(Sanitizers.LINKS); URL_POLICY.sanitize("<a
href=\"/test/?param1=valueOne¶m2=valueTwo\">click me</a>"); or similar? |
On 3/22/22 10:39 AM, alex-alvarezg wrote:
The following input
|/test/?param1=valueOne¶m2=valueTwo |
will be sanitized to:
|/test/?param1=valueOne¶m2=valueTwo |
but should be sanitized to
|/test/?param1=valueOne&param2=valueTwo |
The following code is used:
|private final PolicyFactory URL_POLICY = new HtmlPolicyBuilder()
.toFactory() .and(Sanitizers.LINKS);
URL_POLICY.sanitize("/test/?param1=valueOne¶m2=valueTwo") |
—
Reply to this email directly, view it on GitHub
<#254>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEBYCIRNMDB6SI44VBY7GLVBIAWFANCNFSM5RLRZS5A>.
You are receiving this because you are subscribed to this
thread.Message ID: ***@***.***>
--
Jim Manico
Manicode Security
https://www.manicode.com
|
Thanks Jim, will give it a try, it used to work with previous versions. |
Still fails with output:
it does work correctly if something else is used, instead of param So we found out that if any part of the string matches a character entity reference (as in this table: https://dev.w3.org/html5/html-author/charref ), it will be converted to the entity for instance: For the entity The input will be converted as follows:
|
Ok, so the problem is that java-html-sanitizer/src/main/java/org/owasp/html/HtmlEntities.java Lines 1724 to 1727 in 33d319f
That's derived from java-html-sanitizer/src/main/java/org/owasp/html/HtmlEntities.java Lines 46 to 47 in 33d319f
I think we're actually handling this according to spec since https://html.spec.whatwg.org/entities.json still has a line
Do browsers do some less thorough entity matching for URL attribute values? I don't remember that changing but I haven't been following html.spec as closely as I could. |
Hi, In practice something like |
Thanks for the pointer, @kusako. <doctype html>
<meta charset="utf-8"/>
<style>a { display: block }</style>
<a href="?x¶="></a>
<a href="?x¶m="></a>
<a href="?x¶"></a>
<a title="?x¶=" href=.></a>
<a title="?x¶m=" href=.></a>
<a title="?x¶" href=.></a>
<script>
(() => {
const links = document.querySelectorAll('a')
links.forEach((link) => {
let { href, title } = link;
if (title) {
link.textContent = `title is ${title}`;
} else {
link.textContent = `href is ${href}`;
}
});
})();
</script> produces on Chrome, Firefox, Safari:
So it seems like the rule is, if there's no semicolon at the end of the entity name, and the next character is not valid in a character reference, and the next character is not I can probably write some JS to test that further, but changing the decode loop to something like that should address the problem. |
It looks like we need to grab Continues character reference name in CDATAContinues character reference name in title attribute
The above was derived by looking at each basic plane code-point: <h1>Continues character reference name in CDATA</h1>
<ol id="continuers-cdata"></ol>
<h1>Continues character reference name in title attribute</h1>
<ol id="continuers-attr"></ol>
<script>
(() => {
let continuers = {
cdata: document.getElementById("continuers-cdata"),
attr: document.getElementById("continuers-attr"),
};
let starts = {
cdata: null,
attr: null,
}
let div = document.createElement("div");
let limit = 0x10000;
for (let i = 0; i < limit; ++i) {
div.innerHTML = `<span title="¶${String.fromCharCode(i)}">¶${String.fromCharCode(i)}</span>`;
let span = div.querySelector("span");
let { textContent, title } = span;
step(i, !textContent.startsWith("\u00b6"), 'cdata');
step(i, !title.startsWith("\u00b6"), 'attr');
}
for (let key in continuers) { step(limit, false, key); }
function step(i, present, key) {
let start = starts[key];
if (start !== null) {
if (!present) {
let end = i - 1;
let continuersList = continuers[key];
starts[key] = null;
let li = document.createElement("li");
let range = (start === end)
? `U+${start.toString(16)}`
: `U+${start.toString(16)} - U+${end.toString(16)}`;
let chars = (start === end)
? String.fromCharCode(start)
: `${String.fromCharCode(start)}-${String.fromCharCode(end)}`;
li.appendChild(document.createTextNode(`${range}: [${chars}]`));
continuersList.appendChild(li);
}
} else if (present) {
starts[key] = i;
}
}
})();
</script> |
As described in issue #254 `¶` is a full complete character reference when decoding text node content, but not when decoding attribute content which causes problems for URL attribute values like /test?param1=foo¶m2=bar As shown via JS test code in that issue, a small set of next characters prevent a character reference name match from being considered complete. This commit: - modifies the decode functions to take an extra parameter `boolean inAttribute`, and modifies the Trie traversal loops to not store a longest match so far based on that parameter and some next character tests - modifies the HTML lexer to pass that attribute appropriately - for backwards compat, leaves the old APIs in place but `@deprecated` - adds unit tests for the decode functions - adds a unit test for the specific input from the issue This change should make us more conformant with observed browser behaviour so is not expected to cause compatibility problems for existing users. Fixes #254
As described in issue #254 `¶` is a full complete character reference when decoding text node content, but not when decoding attribute content which causes problems for URL attribute values like /test?param1=foo¶m2=bar As shown via JS test code in that issue, a small set of next characters prevent a character reference name match from being considered complete. This commit: - modifies the decode functions to take an extra parameter `boolean inAttribute`, and modifies the Trie traversal loops to not store a longest match so far based on that parameter and some next character tests - modifies the HTML lexer to pass that attribute appropriately - for backwards compat, leaves the old APIs in place but `@deprecated` - adds unit tests for the decode functions - adds a unit test for the specific input from the issue This change should make us more conformant with observed browser behaviour so is not expected to cause compatibility problems for existing users. Fixes #254
The following input
/test/?param1=valueOne¶m2=valueTwo
will be sanitized to:
/test/?param1=valueOne¶m2=valueTwo
but should be sanitized to
/test/?param1=valueOne&param2=valueTwo
The following code is used:
The text was updated successfully, but these errors were encountered: