-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to parse i18n characters #9
Comments
Non-ascii is forbidden in legitimate email addresses, at least in 'classic' addresses. There are more recent extensions to SMTP that I don't know much about that allow non-ascii in email headers, but AFAIK the standard protocol is still to use RFC 2047 to encode non-ascii as ascii. You seem to have a decoded address, there. So one option is to make sure you aren't decoding the addresses from the raw header before giving it to the validator. But of course you are right: our class should be able to extract the address parts, even if the personal name is invalid per the RFC's. I don't have time to work on this, personally, but maybe @bbottema can take a look at toughening the parser in these cases. |
@chconnor thanks for explaining. I saw the similar behavior using an npm module in node so I was guessing that it (non-ascii character) is not allowed as per RFC. However, I am actually getting email addresses like this from an email api, and just wanted to extract the actual address (local + domain) and personal name from the entire address. Seems like no present Java/node library can perfectly do that :( |
Hopefully @bbottema has some time to check it out; shouldn't be hard to catch an appropriate exception and just not-fail when this happens. Or better, I suppose, to check for non-ascii preemptively and behave accordingly. Seems like an increasing number of mail servers are accepting and passing through UTF-8 type characters, so we should be able to handle it. |
I would love to add extra support this, but I recently became father and have my hands full (literally!). Adding non-standardized support isn't exactly on the top of my list currently. |
Oh, sure, pull the father card! :-) I just took a look at it and it's going to be too complicated (and probably not appropriate) for us to handle non-ascii in addresses. I'd suggest pre-processing your addresses before sending them to our class. A brutal but simple way is to just strip out non-ascii characters. If you know the email address is not null, you can just do:
I don't know how you're getting these email addresses: it's possible that whoever is sending them to you is decoding them from properly-encoded RFC2047 strings, in which case you could ask them to stop doing that and send you the raw addresses from the email headers. |
Using Normalizer
This removes diacritics, but keeps base letters |
How to parse following address for example?
Current version throws exception while parsing above email address.
And
EmailAddressParser.getAddressParts
returnsnull
The text was updated successfully, but these errors were encountered: