Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed fix for issue #223: forbidden character references in sanitized html #225

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

simon-greatrix
Copy link

No description provided.

@@ -108,13 +111,16 @@ private static void stripBannedCodeunits(StringBuilder sb, int start) {
if (i+1 < n) {
char next = sb.charAt(i+1);
if (Character.isSurrogatePair(ch, next)) {
sb.setCharAt(k++, ch);
sb.setCharAt(k++, next);
// The last two code points in each plane are non-characters that should be elided.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added check for the supplementary plane non-characters

++i;
}
}
continue;
} else if ((ch & 0xfffe) == 0xfffe) {
} else if ((ch & 0xfffe) == 0xfffe || (0xfdd0 <= ch && ch <= 0xfdef)) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added check for BMP non-characters

if (i + 1 == n || plainText.charAt(i + 1) == '{') {
repl = braceReplacement;
if( repl==null ) {
if (ch == '{') {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A none-replacement could still be a new-line needing normalization.

}
}
if (repl != null) {
output.append(plainText, pos, i).append(repl);
pos = i + 1;
}
} else if ((0x93A <= ch && ch <= 0xC4C)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the 2018 i-OS crashing test because:
(a) it should be patched in devices by now
(b) it was always legal HTML
(c) there was no test for the 2020 i-OS crashing flag text
(d) I do not believe it is this libraries job to catch all device specific risks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with this because of (a), but disagree on (d) because preventing denial of service is in scope for this project. For the record, (c) was probably an oversight on my part.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these magic character sequences are in scope, I think it would be excellent if they could be handled via a configuration file so that they can be quickly fixed. In production systems I've worked on, changing configuration is generally something that can be done faster than changing code, because the scope of potential regression issues is much smaller with a configuration file.

// and get involved in UTF-16/UCS-2 confusion.
int codepoint = Character.toCodePoint(ch, next);
output.append(plainText, pos, i);
} else if (RISKY_NORMALIZATION.contains(ch)) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New check which expands on the idea that U+1FEF normalized to back-tick. We now catch all unicode characters than have a compatibility decomposition that is even slightly risky.

}
} else if (0xfe60 <= ch) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is now handled by the risky normalization set check.

*
* @throws IOException if the output cannot be written to
* @throws IllegalArgumentException if the codepoint cannot be represented as a numeric escape.
*/
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this now throws an IllegalArgumentException if the numeric code point is forbidden in HTML.

int hexDigit = (codepoint >>> (digit << 2)) & 0xf;
output.append(HEX_NUMERAL[hexDigit]);
}
output.append(Integer.toHexString(codepoint));
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is simpler. Java is guaranteed to use lower case ASCII hex characters to represent the code point, so there seems no need for bespoke code.

/** Set of all Unicode characters which when processed with unicode compatibility decomposition will include a non-alphanumeric ascii character. */
static final Set<Character> RISKY_NORMALIZATION;
static {
HashSet<Character> set = new HashSet<Character>();
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For how these code-points were identified, see the unit-tests where the full scan of the BMP is explained and verified against the set here. The code was written this way for fast start up instead of doing the scan of the entire BMP.

@@ -527,7 +527,7 @@ private HtmlToken parseToken() {
break;
}
}
} else if (!Character.isWhitespace(ch)) {
} else if (!isAsciiWhitespace(ch)) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTML requires ASCII whitespace, not Unicode whitespace.

@@ -313,7 +313,10 @@ public static final void testScriptInTable() {
.and(Sanitizers.STYLES)
.and(Sanitizers.IMAGES)
.and(Sanitizers.TABLES);
assertEquals("<table></table>Hallo\r\n\nEnde\n\r", pf.sanitize(input));
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new line normalization changed the expected output

@@ -392,53 +392,6 @@ public static final void testNbsps() {
codeUnits));
}

@Test
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer supported. See my previous comment for Encoding.java

@@ -305,4 +371,66 @@ void testBadlyDonePostProcessingWillnotAllowInsertingNonceAttributes()
Encoding.encodeHtmlAttribOnto("a nonce=xyz ", attrib);
assertEquals("a nonce&#61;xyz ", attrib.toString());
}

@Test
public static final void testRiskyNormalizationSetContents() {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unit test does the complete scan of the BMP for characters with a compatibility decomposition that could have side-effects.

*
* @author Simon Greatrix on 25/01/2021.
*/
public class ElidedCharactersTest extends TestCase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look at this PR in more detail shortly. In the meantime, these tests seem to be failing on Travis. Does that differ from what you see running locally?

@mikesamuel
Copy link
Contributor

I'll look at this in more detail soon. Thanks so much.

@simon-greatrix
Copy link
Author

When you asked I was like "I would NEVER submit a merge request with failing test!". Then I looked at my IDE and saw that it said "No tests found", because it has forgotten how JUnit 4 works apparently. So, I'll have to change my claim to "I would never KNOWINGLY submit a merge request with failing tests!"

@mikesamuel
Copy link
Contributor

Heh. Yeah, the codebase has some Java5/6 compatibility baggage.

@simon-greatrix
Copy link
Author

Looks OK now - hope those were the correct changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants