-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposed fix for issue #223: forbidden character references in sanitized html #225
base: main
Are you sure you want to change the base?
Proposed fix for issue #223: forbidden character references in sanitized html #225
Conversation
@@ -108,13 +111,16 @@ private static void stripBannedCodeunits(StringBuilder sb, int start) { | |||
if (i+1 < n) { | |||
char next = sb.charAt(i+1); | |||
if (Character.isSurrogatePair(ch, next)) { | |||
sb.setCharAt(k++, ch); | |||
sb.setCharAt(k++, next); | |||
// The last two code points in each plane are non-characters that should be elided. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added check for the supplementary plane non-characters
++i; | ||
} | ||
} | ||
continue; | ||
} else if ((ch & 0xfffe) == 0xfffe) { | ||
} else if ((ch & 0xfffe) == 0xfffe || (0xfdd0 <= ch && ch <= 0xfdef)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added check for BMP non-characters
if (i + 1 == n || plainText.charAt(i + 1) == '{') { | ||
repl = braceReplacement; | ||
if( repl==null ) { | ||
if (ch == '{') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A none-replacement could still be a new-line needing normalization.
} | ||
} | ||
if (repl != null) { | ||
output.append(plainText, pos, i).append(repl); | ||
pos = i + 1; | ||
} | ||
} else if ((0x93A <= ch && ch <= 0xC4C) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the 2018 i-OS crashing test because:
(a) it should be patched in devices by now
(b) it was always legal HTML
(c) there was no test for the 2020 i-OS crashing flag text
(d) I do not believe it is this libraries job to catch all device specific risks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with this because of (a), but disagree on (d) because preventing denial of service is in scope for this project. For the record, (c) was probably an oversight on my part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these magic character sequences are in scope, I think it would be excellent if they could be handled via a configuration file so that they can be quickly fixed. In production systems I've worked on, changing configuration is generally something that can be done faster than changing code, because the scope of potential regression issues is much smaller with a configuration file.
// and get involved in UTF-16/UCS-2 confusion. | ||
int codepoint = Character.toCodePoint(ch, next); | ||
output.append(plainText, pos, i); | ||
} else if (RISKY_NORMALIZATION.contains(ch)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New check which expands on the idea that U+1FEF normalized to back-tick. We now catch all unicode characters than have a compatibility decomposition that is even slightly risky.
} | ||
} else if (0xfe60 <= ch) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case is now handled by the risky normalization set check.
* | ||
* @throws IOException if the output cannot be written to | ||
* @throws IllegalArgumentException if the codepoint cannot be represented as a numeric escape. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this now throws an IllegalArgumentException if the numeric code point is forbidden in HTML.
int hexDigit = (codepoint >>> (digit << 2)) & 0xf; | ||
output.append(HEX_NUMERAL[hexDigit]); | ||
} | ||
output.append(Integer.toHexString(codepoint)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is simpler. Java is guaranteed to use lower case ASCII hex characters to represent the code point, so there seems no need for bespoke code.
/** Set of all Unicode characters which when processed with unicode compatibility decomposition will include a non-alphanumeric ascii character. */ | ||
static final Set<Character> RISKY_NORMALIZATION; | ||
static { | ||
HashSet<Character> set = new HashSet<Character>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For how these code-points were identified, see the unit-tests where the full scan of the BMP is explained and verified against the set here. The code was written this way for fast start up instead of doing the scan of the entire BMP.
@@ -527,7 +527,7 @@ private HtmlToken parseToken() { | |||
break; | |||
} | |||
} | |||
} else if (!Character.isWhitespace(ch)) { | |||
} else if (!isAsciiWhitespace(ch)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HTML requires ASCII whitespace, not Unicode whitespace.
@@ -313,7 +313,10 @@ public static final void testScriptInTable() { | |||
.and(Sanitizers.STYLES) | |||
.and(Sanitizers.IMAGES) | |||
.and(Sanitizers.TABLES); | |||
assertEquals("<table></table>Hallo\r\n\nEnde\n\r", pf.sanitize(input)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new line normalization changed the expected output
@@ -392,53 +392,6 @@ public static final void testNbsps() { | |||
codeUnits)); | |||
} | |||
|
|||
@Test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer supported. See my previous comment for Encoding.java
@@ -305,4 +371,66 @@ void testBadlyDonePostProcessingWillnotAllowInsertingNonceAttributes() | |||
Encoding.encodeHtmlAttribOnto("a nonce=xyz ", attrib); | |||
assertEquals("a nonce=xyz ", attrib.toString()); | |||
} | |||
|
|||
@Test | |||
public static final void testRiskyNormalizationSetContents() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This unit test does the complete scan of the BMP for characters with a compatibility decomposition that could have side-effects.
* | ||
* @author Simon Greatrix on 25/01/2021. | ||
*/ | ||
public class ElidedCharactersTest extends TestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look at this PR in more detail shortly. In the meantime, these tests seem to be failing on Travis. Does that differ from what you see running locally?
I'll look at this in more detail soon. Thanks so much. |
When you asked I was like "I would NEVER submit a merge request with failing test!". Then I looked at my IDE and saw that it said "No tests found", because it has forgotten how JUnit 4 works apparently. So, I'll have to change my claim to "I would never KNOWINGLY submit a merge request with failing tests!" |
Heh. Yeah, the codebase has some Java5/6 compatibility baggage. |
Looks OK now - hope those were the correct changes. |
No description provided.