Improve the error message for invalid characters in domain names afte…

…r Unicode NFC normalization These cases were previously handled by the call to idna.encode or idna.alabel, but the error message wasn't consistent with similar checks we do for the local part. See #142.
JoshData · Jun 19, 2024 · 8051347 · 8051347
1 parent 7f1f281
commit 8051347
Show file tree

Hide file tree

Showing 3 changed files with 13 additions and 7 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,7 +3,7 @@ In Development
 
 * Email addresses with internationalized local parts could, with rare Unicode characters, be returned as valid but actually be invalid in their normalized form (returned in the `normalized` field). Local parts now re-validated after Unicode NFC normalization to ensure that invalid characters cannot be injected into the normalized address and that characters with length-increasing NFC normalizations cannot cause a local part to exceed the maximum length after normalization.
 * The length check for email addresses with internationalized local parts is now also applied to the original address string prior to Unicode NFC normalization, which may be longer and could exceed the maximum email address length, to protect callers who do not use the returned normalized address.
-* Improved error message for IDNA domains that are too long.
+* Improved error message for IDNA domains that are too long or have invalid characters after Unicode normalization.
 * A new option to parse `My Name <address@domain>` strings, i.e. a display name plus an email address in angle brackets, is now available. It is off by default.
 
 2.1.2 (June 16, 2024)

diff --git a/email_validator/syntax.py b/email_validator/syntax.py
@@ -476,6 +476,16 @@ def validate_email_domain_name(domain: str, test_environment: bool = False, glob
     except idna.IDNAError as e:
         raise EmailSyntaxError(f"The part after the @-sign contains invalid characters ({e}).") from e
 
+    # Check for invalid characters after Unicode normalization which are not caught
+    # by uts46_remap (see tests for examples).
+    bad_chars = {
+        safe_character_display(c)
+        for c in domain
+        if not ATEXT_HOSTNAME_INTL.match(c)
+    }
+    if bad_chars:
+        raise EmailSyntaxError("The part after the @-sign contains invalid characters after Unicode normalization: " + ", ".join(sorted(bad_chars)) + ".")
+
     # The domain part is made up dot-separated "labels." Each label must
     # have at least one character and cannot start or end with dashes, which
     # means there are some surprising restrictions on periods and dashes.

diff --git a/tests/test_syntax.py b/tests/test_syntax.py
@@ -392,12 +392,8 @@ def test_domain_literal() -> None:
         ('me@⒈wouldbeinvalid.com',
          "The part after the @-sign contains invalid characters (Codepoint U+2488 not allowed "
          "at position 1 in '⒈wouldbeinvalid.com')."),
-        ('me@\u037e.com',
-         "The part after the @-sign is invalid (Codepoint U+003B at position 1 "
-         "of ';' not allowed)."),
-        ('me@\u1fef.com',
-         "The part after the @-sign is invalid (Codepoint U+0060 at position 1 "
-         "of '`' not allowed)."),
+        ('me@\u037e.com', "The part after the @-sign contains invalid characters after Unicode normalization: ';'."),
+        ('me@\u1fef.com', "The part after the @-sign contains invalid characters after Unicode normalization: '`'."),
         ('@example.com', 'There must be something before the @-sign.'),
         ('white space@test', 'The email address contains invalid characters before the @-sign: SPACE.'),
         ('test@white space', 'The part after the @-sign contains invalid characters: SPACE.'),