Proposed fix for issue #223: forbidden character references in sanitized html #225

simon-greatrix · 2021-02-07T21:00:44Z

No description provided.

simon-greatrix · 2021-02-07T21:08:01Z

src/main/java/org/owasp/html/Encoding.java

@@ -108,13 +111,16 @@ private static void stripBannedCodeunits(StringBuilder sb, int start) {
          if (i+1 < n) {
            char next = sb.charAt(i+1);
            if (Character.isSurrogatePair(ch, next)) {
-              sb.setCharAt(k++, ch);
-              sb.setCharAt(k++, next);
+              // The last two code points in each plane are non-characters that should be elided.


Added check for the supplementary plane non-characters

simon-greatrix · 2021-02-07T21:08:21Z

src/main/java/org/owasp/html/Encoding.java

              ++i;
            }
          }
          continue;
-        } else if ((ch & 0xfffe) == 0xfffe) {
+        } else if ((ch & 0xfffe) == 0xfffe || (0xfdd0 <= ch && ch <= 0xfdef)) {


Added check for BMP non-characters

simon-greatrix · 2021-02-07T21:09:45Z

src/main/java/org/owasp/html/Encoding.java

-          if (i + 1 == n || plainText.charAt(i + 1) == '{') {
-            repl = braceReplacement;
+        if( repl==null ) {
+          if (ch == '{') {


A none-replacement could still be a new-line needing normalization.

simon-greatrix · 2021-02-07T21:11:32Z

src/main/java/org/owasp/html/Encoding.java

          }
        }
        if (repl != null) {
          output.append(plainText, pos, i).append(repl);
          pos = i + 1;
        }
-      } else if ((0x93A <= ch && ch <= 0xC4C)


I removed the 2018 i-OS crashing test because:
(a) it should be patched in devices by now
(b) it was always legal HTML
(c) there was no test for the 2020 i-OS crashing flag text
(d) I do not believe it is this libraries job to catch all device specific risks

I'm ok with this because of (a), but disagree on (d) because preventing denial of service is in scope for this project. For the record, (c) was probably an oversight on my part.

If these magic character sequences are in scope, I think it would be excellent if they could be handled via a configuration file so that they can be quickly fixed. In production systems I've worked on, changing configuration is generally something that can be done faster than changing code, because the scope of potential regression issues is much smaller with a configuration file.

simon-greatrix · 2021-02-07T21:13:00Z

src/main/java/org/owasp/html/Encoding.java

-            // and get involved in UTF-16/UCS-2 confusion.
-            int codepoint = Character.toCodePoint(ch, next);
-            output.append(plainText, pos, i);
+      } else if (RISKY_NORMALIZATION.contains(ch)) {


New check which expands on the idea that U+1FEF normalized to back-tick. We now catch all unicode characters than have a compatibility decomposition that is even slightly risky.

simon-greatrix · 2021-02-07T21:16:39Z

src/main/java/org/owasp/html/Encoding.java

          }
-        } else if (0xfe60 <= ch) {


This case is now handled by the risky normalization set check.

simon-greatrix · 2021-02-07T21:17:39Z

src/main/java/org/owasp/html/Encoding.java

+   *
+   * @throws IOException              if the output cannot be written to
+   * @throws IllegalArgumentException if the codepoint cannot be represented as a numeric escape.
+   */


Note that this now throws an IllegalArgumentException if the numeric code point is forbidden in HTML.

simon-greatrix · 2021-02-07T21:18:27Z

src/main/java/org/owasp/html/Encoding.java

-        int hexDigit = (codepoint >>> (digit << 2)) & 0xf;
-        output.append(HEX_NUMERAL[hexDigit]);
-      }
+      output.append(Integer.toHexString(codepoint));


This code is simpler. Java is guaranteed to use lower case ASCII hex characters to represent the code point, so there seems no need for bespoke code.

simon-greatrix · 2021-02-07T21:19:44Z

src/main/java/org/owasp/html/Encoding.java

+  /** Set of all Unicode characters which when processed with unicode compatibility decomposition will include a non-alphanumeric ascii character. */
+  static final Set<Character> RISKY_NORMALIZATION;
+  static {
+    HashSet<Character> set = new HashSet<Character>();


For how these code-points were identified, see the unit-tests where the full scan of the BMP is explained and verified against the set here. The code was written this way for fast start up instead of doing the scan of the entire BMP.

simon-greatrix · 2021-02-07T21:21:27Z

src/main/java/org/owasp/html/HtmlLexer.java

@@ -527,7 +527,7 @@ private HtmlToken parseToken() {
            break;
          }
        }
-      } else if (!Character.isWhitespace(ch)) {
+      } else if (!isAsciiWhitespace(ch)) {


HTML requires ASCII whitespace, not Unicode whitespace.

simon-greatrix · 2021-02-07T21:23:07Z

src/test/java/org/owasp/html/SanitizersTest.java

@@ -313,7 +313,10 @@ public static final void testScriptInTable() {
      .and(Sanitizers.STYLES)
      .and(Sanitizers.IMAGES)
      .and(Sanitizers.TABLES);
-    assertEquals("<table></table>Hallo\r\n\nEnde\n\r", pf.sanitize(input));


new line normalization changed the expected output

simon-greatrix · 2021-02-07T21:23:27Z

src/test/java/org/owasp/html/HtmlSanitizerTest.java

@@ -392,53 +392,6 @@ public static final void testNbsps() {
            codeUnits));
  }

-  @Test


No longer supported. See my previous comment for Encoding.java

simon-greatrix · 2021-02-07T21:24:11Z

src/test/java/org/owasp/html/EncodingTest.java

@@ -305,4 +371,66 @@ void testBadlyDonePostProcessingWillnotAllowInsertingNonceAttributes()
    Encoding.encodeHtmlAttribOnto("a nonce=xyz ", attrib);
    assertEquals("a nonce&#61;xyz ", attrib.toString());
  }
+
+  @Test
+  public static final void testRiskyNormalizationSetContents() {


This unit test does the complete scan of the BMP for characters with a compatibility decomposition that could have side-effects.

mikesamuel · 2021-02-07T21:53:35Z

src/test/java/org/owasp/html/ElidedCharactersTest.java

+ *
+ * @author Simon Greatrix on 25/01/2021.
+ */
+public class ElidedCharactersTest extends TestCase {


I'll look at this PR in more detail shortly. In the meantime, these tests seem to be failing on Travis. Does that differ from what you see running locally?

mikesamuel · 2021-02-07T21:59:05Z

I'll look at this in more detail soon. Thanks so much.

simon-greatrix · 2021-02-07T23:32:23Z

When you asked I was like "I would NEVER submit a merge request with failing test!". Then I looked at my IDE and saw that it said "No tests found", because it has forgotten how JUnit 4 works apparently. So, I'll have to change my claim to "I would never KNOWINGLY submit a merge request with failing tests!"

mikesamuel · 2021-02-07T23:59:35Z

Heh. Yeah, the codebase has some Java5/6 compatibility baggage.

simon-greatrix · 2021-02-08T00:05:46Z

Looks OK now - hope those were the correct changes.

simon-greatrix added 5 commits January 29, 2021 18:08

end-of-day

4077b19

End of day

348b7aa

Finalizing encoding changes and tests

dedafc7

Remove unwanted imports

5d711d0

Fixing whitespace in Encoding

2964e59

simon-greatrix commented Feb 7, 2021

View reviewed changes

mikesamuel mentioned this pull request Feb 7, 2021

Forbidden numeric character references appear in sanitized HTML #223

Open

mikesamuel self-assigned this Feb 7, 2021

mikesamuel reviewed Feb 7, 2021

View reviewed changes

simon-greatrix added 3 commits February 7, 2021 23:42

Delete was missing from the required character elisions

594cc4d

Increased Guava version to 30.1 to avoid security issue

121c6c0

Additional usage of guava 27.1-jre replaced with 30.1-jre

d78fc8a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed fix for issue #223: forbidden character references in sanitized html #225

Proposed fix for issue #223: forbidden character references in sanitized html #225

simon-greatrix commented Feb 7, 2021

simon-greatrix Feb 7, 2021

simon-greatrix Feb 7, 2021

simon-greatrix Feb 7, 2021

simon-greatrix Feb 7, 2021

mikesamuel Feb 7, 2021

simon-greatrix Feb 7, 2021

simon-greatrix Feb 7, 2021

simon-greatrix Feb 7, 2021

simon-greatrix Feb 7, 2021

simon-greatrix Feb 7, 2021

simon-greatrix Feb 7, 2021

simon-greatrix Feb 7, 2021

simon-greatrix Feb 7, 2021

simon-greatrix Feb 7, 2021

simon-greatrix Feb 7, 2021

mikesamuel Feb 7, 2021

mikesamuel commented Feb 7, 2021

simon-greatrix commented Feb 7, 2021

mikesamuel commented Feb 7, 2021

simon-greatrix commented Feb 8, 2021

Proposed fix for issue #223: forbidden character references in sanitized html #225

Are you sure you want to change the base?

Proposed fix for issue #223: forbidden character references in sanitized html #225

Conversation

simon-greatrix commented Feb 7, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikesamuel commented Feb 7, 2021

simon-greatrix commented Feb 7, 2021

mikesamuel commented Feb 7, 2021

simon-greatrix commented Feb 8, 2021