Skip to content

Commit

Permalink
Parse display name <addr> syntax
Browse files Browse the repository at this point in the history
Per request in #116, parse display name syntax also, but don't allow it unless a new allow_display_name option is set. Parsing according to the MIME specification probably isn't what's generally wanted since the use case is probably parsing inputs in email composition-like user interfaces. So it's in the spirit of a MIME message but not the letter.

If display name syntax is used, return the unquoted/unescaped display name in the returned object.
  • Loading branch information
JoshData committed Apr 12, 2024
1 parent 20b4400 commit 7e14282
Show file tree
Hide file tree
Showing 8 changed files with 223 additions and 43 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ In Development
--------------

* The library now includes an asynchronous version of the main method named validate_email_async, which can be called with await, that runs DNS-based deliverability checks asychronously.
* A new option to parse `My Name <address@domain>` strings, i.e. a display name plus an email address in angle brackets, is now included. It is off by default.

2.1.1 (February 26, 2024)
-------------------------
Expand Down
16 changes: 9 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,7 @@ Python 3.8+ by [Joshua Tauberer](https://joshdata.me).
This library validates that a string is of the form `[email protected]`
and optionally checks that the domain name is set up to receive email.
This is the sort of validation you would want when you are identifying
users by their email address like on a registration/login form (but not
necessarily for composing an email message, see below).
users by their email address like on a registration form.

Key features:

Expand All @@ -19,7 +18,8 @@ Key features:
* Checks deliverability (optional): Does the domain name resolve?
(You can override the default DNS resolver to add query caching.)
* Can be called asynchronously with `await`.
* Supports internationalized domain names and internationalized local parts.
* Supports internationalized domain names and internationalized local parts,
and optionally supports display names (e.g. `"My Name" <[email protected]>`).
* Rejects addresses with unsafe Unicode characters, obsolete email address
syntax that you'd find unexpected, special use domain names like
`@localhost`, and domains without a dot by default. This is an
Expand All @@ -29,9 +29,8 @@ Key features:
* Python type annotations are used.

This is an opinionated library. You should definitely also consider using
the less-opinionated [pyIsEmail](https://github.com/michaelherold/pyIsEmail) and
[flanker](https://github.com/mailgun/flanker) if they are better for your
use case.
the less-opinionated [pyIsEmail](https://github.com/michaelherold/pyIsEmail)
if it works better for you.

[![Build Status](https://github.com/JoshData/python-email-validator/actions/workflows/test_and_build.yaml/badge.svg)](https://github.com/JoshData/python-email-validator/actions/workflows/test_and_build.yaml)

Expand Down Expand Up @@ -148,6 +147,8 @@ The `validate_email` function also accepts the following keyword arguments

`allow_domain_literal=False`: Set to `True` to allow bracketed IPv4 and "IPv6:"-prefixd IPv6 addresses in the domain part of the email address. No deliverability checks are performed for these addresses. In the object returned by `validate_email`, the normalized domain will use the condensed IPv6 format, if applicable. The object's `domain_address` attribute will hold the parsed `ipaddress.IPv4Address` or `ipaddress.IPv6Address` object if applicable. You can also set `email_validator.ALLOW_DOMAIN_LITERAL` to `True` to turn this on for all calls by default.

`allow_display_name=False`: Set to `True` to allow a display name and bracketed address in the input string, like `My Name <[email protected]>`. It's implemented in the spirit but not the letter of RFC 5322 3.4, so it may be stricter or more relaxed than what you want. The display name, if present, is provided in the returned object's `display_name` field after being unquoted and unescaped. You can also set `email_validator.ALLOW_DISPLAY_NAME` to `True` to turn this on for all calls by default.

`allow_empty_local=False`: Set to `True` to allow an empty local part (i.e.
`@example.com`), e.g. for validating Postfix aliases.

Expand Down Expand Up @@ -423,6 +424,7 @@ are:
| `domain` | The canonical internationalized Unicode form of the domain part of the email address. If the returned string contains non-ASCII characters, either the [SMTPUTF8](https://tools.ietf.org/html/rfc6531) feature of your mail relay will be required to transmit the message or else the email address's domain part must be converted to IDNA ASCII first: Use `ascii_domain` field instead. |
| `ascii_domain` | The [IDNA](https://tools.ietf.org/html/rfc5891) [Punycode](https://www.rfc-editor.org/rfc/rfc3492.txt)-encoded form of the domain part of the given email address, as it would be transmitted on the wire. |
| `domain_address` | If domain literals are allowed and if the email address contains one, an `ipaddress.IPv4Address` or `ipaddress.IPv6Address` object. |
| `display_name` | If no display name was present and angle brackets do not surround the address, this will be `None`; otherwise, it will be set to the display name, or the empty string if there were angle brackets but no display name. If the display name was quoted, it will be unquoted and unescaped. |
| `smtputf8` | A boolean indicating that the [SMTPUTF8](https://tools.ietf.org/html/rfc6531) feature of your mail relay will be required to transmit messages to this address because the local part of the address has non-ASCII characters (the local part cannot be IDNA-encoded). If `allow_smtputf8=False` is passed as an argument, this flag will always be false because an exception is raised if it would have been true. |
| `mx` | A list of (priority, domain) tuples of MX records specified in the DNS for the domain (see [RFC 5321 section 5](https://tools.ietf.org/html/rfc5321#section-5)). May be `None` if the deliverability check could not be completed because of a temporary issue like a timeout. |
| `mx_fallback_type` | `None` if an `MX` record is found. If no MX records are actually specified in DNS and instead are inferred, through an obsolete mechanism, from A or AAAA records, the value is the type of DNS record used instead (`A` or `AAAA`). May be `None` if the deliverability check could not be completed because of a temporary issue like a timeout. |
Expand Down Expand Up @@ -486,4 +488,4 @@ git push --tags
License
-------

This project is free of any copyright restrictions per the [Unlicense](https://unlicense.org/). (Prior to Feb. 4, 2024, the project was made available under the terms of the [CC0 1.0 Universal public domain dedication](http://creativecommons.org/publicdomain/zero/1.0/).) See [LICENSE](LICENSE) and [CONTRIBUTING.md](CONTRIBUTING.md).
This project is free of any copyright restrictions per the [Unlicense](https://unlicense.org/). (Prior to Feb. 4, 2024, the project was made available under the terms of the [CC0 1.0 Universal public domain dedication](http://creativecommons.org/publicdomain/zero/1.0/).) See [LICENSE](LICENSE) and [CONTRIBUTING.md](CONTRIBUTING.md).
1 change: 1 addition & 0 deletions email_validator/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ def caching_async_resolver(*args, **kwargs):
ALLOW_SMTPUTF8 = True
ALLOW_QUOTED_LOCAL = False
ALLOW_DOMAIN_LITERAL = False
ALLOW_DISPLAY_NAME = False
GLOBALLY_DELIVERABLE = True
CHECK_DELIVERABILITY = True
TEST_ENVIRONMENT = False
Expand Down
7 changes: 6 additions & 1 deletion email_validator/exceptions_types.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,9 @@ class ValidatedEmail:
mechanism, from A or AAAA records, the value is the type of DNS record used instead (`A` or `AAAA`)."""
mx_fallback_type: str

"""The display name in the original input text."""
display_name: str

"""Tests use this constructor."""
def __init__(self, **kwargs):
for k, v in kwargs.items():
Expand Down Expand Up @@ -120,6 +123,7 @@ def __eq__(self, other):
and repr(sorted(self.mx) if getattr(self, 'mx', None) else None)
== repr(sorted(other.mx) if getattr(other, 'mx', None) else None)
and getattr(self, 'mx_fallback_type', None) == getattr(other, 'mx_fallback_type', None)
and getattr(self, 'display_name', None) == getattr(other, 'display_name', None)
)

"""This helps producing the README."""
Expand All @@ -128,7 +132,8 @@ def as_constructor(self):
+ ",".join(f"\n {key}={repr(getattr(self, key))}"
for key in ('normalized', 'local_part', 'domain',
'ascii_email', 'ascii_local_part', 'ascii_domain',
'smtputf8', 'mx', 'mx_fallback_type')
'smtputf8', 'mx', 'mx_fallback_type',
'display_name')
if hasattr(self, key)
) \
+ ")"
Expand Down
5 changes: 2 additions & 3 deletions email_validator/rfc_constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
# RFC 3629 section 4, which appear to be the Unicode code points from
# U+0080 to U+10FFFF.
ATEXT_INTL = ATEXT + "\u0080-\U0010FFFF"
ATEXT_INTL_RE = re.compile('[.' + ATEXT_INTL + ']') # ATEXT_INTL plus dots
ATEXT_INTL_DOT_RE = re.compile('[.' + ATEXT_INTL + ']') # ATEXT_INTL plus dots
DOT_ATOM_TEXT_INTL = re.compile('[' + ATEXT_INTL + ']+(?:\\.[' + ATEXT_INTL + r']+)*\Z')

# The domain part of the email address, after IDNA (ASCII) encoding,
Expand All @@ -30,10 +30,9 @@
# Quoted-string local part (RFC 5321 4.1.2, internationalized by RFC 6531 3.3)
# The permitted characters in a quoted string are the characters in the range
# 32-126, except that quotes and (literal) backslashes can only appear when escaped
# by a backslash. When internationalized, UTF8 strings are also permitted except
# by a backslash. When internationalized, UTF-8 strings are also permitted except
# the ASCII characters that are not previously permitted (see above).
# QUOTED_LOCAL_PART_ADDR = re.compile(r"^\"((?:[\u0020-\u0021\u0023-\u005B\u005D-\u007E]|\\[\u0020-\u007E])*)\"@(.*)")
QUOTED_LOCAL_PART_ADDR = re.compile(r"^\"((?:[^\"\\]|\\.)*)\"@(.*)")
QTEXT_INTL = re.compile(r"[\u0020-\u007E\u0080-\U0010FFFF]")

# Length constants
Expand Down
164 changes: 140 additions & 24 deletions email_validator/syntax.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
from .exceptions_types import EmailSyntaxError
from .rfc_constants import EMAIL_MAX_LENGTH, LOCAL_PART_MAX_LENGTH, DOMAIN_MAX_LENGTH, \
DOT_ATOM_TEXT, DOT_ATOM_TEXT_INTL, ATEXT_RE, ATEXT_INTL_RE, ATEXT_HOSTNAME_INTL, QTEXT_INTL, \
DNS_LABEL_LENGTH_LIMIT, DOT_ATOM_TEXT_HOSTNAME, DOMAIN_NAME_REGEX, DOMAIN_LITERAL_CHARS, \
QUOTED_LOCAL_PART_ADDR
DOT_ATOM_TEXT, DOT_ATOM_TEXT_INTL, ATEXT_RE, ATEXT_INTL_DOT_RE, ATEXT_HOSTNAME_INTL, QTEXT_INTL, \
DNS_LABEL_LENGTH_LIMIT, DOT_ATOM_TEXT_HOSTNAME, DOMAIN_NAME_REGEX, DOMAIN_LITERAL_CHARS

import re
import unicodedata
Expand All @@ -12,31 +11,148 @@


def split_email(email):
# Return the local part and domain part of the address and
# whether the local part was quoted as a three-tuple.
# Return the display name, unescaped local part, and domain part
# of the address, and whether the local part was quoted. If no
# display name was present and angle brackets do not surround
# the address, display name will be None; otherwise, it will be
# set to the display name or the empty string if there were
# angle brackets but no display name.

# Typical email addresses have a single @-sign and no quote
# characters, but the awkward "quoted string" local part form
# (RFC 5321 4.1.2) allows @-signs and escaped quotes to appear
# in the local part if the local part is quoted.

# A `display name <addr>` format is also present in MIME messages
# (RFC 5322 3.4) and this format is also often recognized in
# mail UIs. It's not allowed in SMTP commands or in typical web
# login forms, but parsing it has been requested, so it's done
# here as a convenience. It's implemented in the spirit but not
# the letter of RFC 5322 3.4 because MIME messages allow newlines
# and comments as a part of the CFWS rule, but this is typically
# not allowed in mail UIs (although comment syntax was requested
# once too).
#
# Display names are either basic characters (the same basic characters
# permitted in email addresses, but periods are not allowed and spaces
# are allowed; see RFC 5322 Appendix A.1.2), or or a quoted string with
# the same rules as a quoted local part. (Multiple quoted strings might
# be allowed? Unclear.) Optional space (RFC 5322 3.4 CFWS) and then the
# email address follows in angle brackets.
#
# An initial quote is ambiguous between starting a display name or
# a quoted local part --- fun.
#
# We assume the input string is already stripped of leading and
# trailing CFWS.

def split_string_at_unquoted_special(text, specials):
# Split the string at the first character in specials (an @-sign
# or left angle bracket) that does not occur within quotes.
inside_quote = False
escaped = False
left_part = ""
for c in text:
if inside_quote:
left_part += c
if c == '\\' and not escaped:
escaped = True
elif c == '"' and not escaped:
# The only way to exit the quote is an unescaped quote.
inside_quote = False
escaped = False
else:
escaped = False
elif c == '"':
left_part += c
inside_quote = True
elif c in specials:
# When unquoted, stop before a special character.
break
else:
left_part += c

# The right part is whatever is left.
right_part = text[len(left_part):]

return left_part, right_part

def unquote_quoted_string(text):
# Remove surrounding quotes and unescape escaped backslashes
# and quotes. Escapes are parsed liberally. I think only
# backslashes and quotes can be escaped but we'll allow anything
# to be.
quoted = False
escaped = False
value = ""
for i, c in enumerate(text):
if quoted:
if escaped:
value += c
escaped = False
elif c == '\\':
escaped = True
elif c == '"':
if i != len(text) - 1:
raise EmailSyntaxError("Extra character(s) found after close quote: "
+ ", ".join(safe_character_display(c) for c in text[i + 1:]))
break
else:
value += c
elif i == 0 and c == '"':
quoted = True
else:
value += c

return value, quoted

# Split the string at the first unquoted @-sign or left angle bracket.
left_part, right_part = split_string_at_unquoted_special(email, ("@", "<"))

# If the right part starts with an angle bracket,
# then the left part is a display name and the rest
# of the right part up to the final right angle bracket
# is the email address, .
if right_part.startswith("<"):
# Remove space between the display name and angle bracket.
left_part = left_part.rstrip()

# Unquote and unescape the display name.
display_name, display_name_quoted = unquote_quoted_string(left_part)

# Check that only basic characters are present in a
# non-quoted display name.
if not display_name_quoted:
bad_chars = {
safe_character_display(c)
for c in display_name
if (not ATEXT_RE.match(c) and c != ' ') or c == '.'
}
if bad_chars:
raise EmailSyntaxError("The display name contains invalid characters when not quoted: " + ", ".join(sorted(bad_chars)) + ".")

# Typical email addresses have a single @-sign, but the
# awkward "quoted string" local part form (RFC 5321 4.1.2)
# allows @-signs (and escaped quotes) to appear in the local
# part if the local part is quoted. If the address is quoted,
# split it at a non-escaped @-sign and unescape the escaping.
if m := QUOTED_LOCAL_PART_ADDR.match(email):
local_part, domain_part = m.groups()
# Check for other unsafe characters.
check_unsafe_chars(display_name, allow_space=True)

# Since backslash-escaping is no longer needed because
# the quotes are removed, remove backslash-escaping
# to return in the normalized form.
local_part = re.sub(r"\\(.)", "\\1", local_part)
# Remove the initial and trailing angle brackets.
addr_spec = right_part[1:].rstrip(">")

return local_part, domain_part, True
# Split the email address at the first unquoted @-sign.
local_part, domain_part = split_string_at_unquoted_special(addr_spec, ("@",))

# Otherwise there is no display name. The left part is the local
# part and the right part is the domain.
else:
# Split at the one and only at-sign.
parts = email.split('@')
if len(parts) != 2:
raise EmailSyntaxError("The email address is not valid. It must have exactly one @-sign.")
local_part, domain_part = parts
return local_part, domain_part, False
display_name = None
local_part, domain_part = left_part, right_part

if domain_part.startswith("@"):
domain_part = domain_part[1:]

# Unquote the local part if it is quoted.
local_part, is_quoted_local_part = unquote_quoted_string(local_part)

return display_name, local_part, domain_part, is_quoted_local_part


def get_length_reason(addr, utf8=False, limit=EMAIL_MAX_LENGTH):
Expand Down Expand Up @@ -215,7 +331,7 @@ def validate_email_local_part(local: str, allow_smtputf8: bool = True, allow_emp
bad_chars = {
safe_character_display(c)
for c in local
if not ATEXT_INTL_RE.match(c)
if not ATEXT_INTL_DOT_RE.match(c)
}
if bad_chars:
raise EmailSyntaxError("The email address contains invalid characters before the @-sign: " + ", ".join(sorted(bad_chars)) + ".")
Expand Down
Loading

0 comments on commit 7e14282

Please sign in to comment.