Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to parse i18n characters #9

Closed
kdabir opened this issue Mar 26, 2018 · 6 comments
Closed

How to parse i18n characters #9

kdabir opened this issue Mar 26, 2018 · 6 comments
Labels

Comments

@kdabir
Copy link

kdabir commented Mar 26, 2018

How to parse following address for example?

Current version throws exception while parsing above email address.

And EmailAddressParser.getAddressParts returns null

@chconnor
Copy link
Contributor

Non-ascii is forbidden in legitimate email addresses, at least in 'classic' addresses. There are more recent extensions to SMTP that I don't know much about that allow non-ascii in email headers, but AFAIK the standard protocol is still to use RFC 2047 to encode non-ascii as ascii. You seem to have a decoded address, there. So one option is to make sure you aren't decoding the addresses from the raw header before giving it to the validator.

But of course you are right: our class should be able to extract the address parts, even if the personal name is invalid per the RFC's.

I don't have time to work on this, personally, but maybe @bbottema can take a look at toughening the parser in these cases.

@kdabir
Copy link
Author

kdabir commented Mar 26, 2018

@chconnor thanks for explaining. I saw the similar behavior using an npm module in node so I was guessing that it (non-ascii character) is not allowed as per RFC.

However, I am actually getting email addresses like this from an email api, and just wanted to extract the actual address (local + domain) and personal name from the entire address. Seems like no present Java/node library can perfectly do that :(

@chconnor
Copy link
Contributor

Hopefully @bbottema has some time to check it out; shouldn't be hard to catch an appropriate exception and just not-fail when this happens. Or better, I suppose, to check for non-ascii preemptively and behave accordingly. Seems like an increasing number of mail servers are accepting and passing through UTF-8 type characters, so we should be able to handle it.

@bbottema
Copy link
Owner

bbottema commented Mar 27, 2018

I would love to add extra support this, but I recently became father and have my hands full (literally!). Adding non-standardized support isn't exactly on the top of my list currently.

@chconnor
Copy link
Contributor

Oh, sure, pull the father card! :-)

I just took a look at it and it's going to be too complicated (and probably not appropriate) for us to handle non-ascii in addresses. I'd suggest pre-processing your addresses before sending them to our class. A brutal but simple way is to just strip out non-ascii characters. If you know the email address is not null, you can just do:

EmailAdressParser.getAddressParts(emailAddressreplaceAll("[^\\x00-\\x7F]", ""), EmailAddressCriteria.RFC_COMPLIANT, false);
...but that may not be what you want to accomplish since it will erase the personal name altogether. Actually extracting the unicode characters would require a significant re-write of this project, and I will guess that it isn't going to happen any time soon.

I don't know how you're getting these email addresses: it's possible that whoever is sending them to you is decoding them from properly-encoded RFC2047 strings, in which case you could ask them to stop doing that and send you the raw addresses from the email headers.

@bbottema
Copy link
Owner

Using Normalizer

string = Normalizer.normalize(string, Normalizer.Form.NFD);
string = string.replaceAll("[^\\p{ASCII}]", "");
// or for unicode: 
string.replaceAll("\\p{M}", "");

This removes diacritics, but keeps base letters

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants