-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What to do about ZWJ emoji sequences #40071
Comments
Just allowing ZWJ as a (non-initial) identifier character is not a bad option. Sequences of emoji with and without the ZWJ are distinct, so to the extent we allow emoji it seems we should simply allow it. |
What about ZWJ outside of emoji sequences? Unicode doesn't specify what emoji sequences are valid to join, but we could look at the character class to disallow it if what's being joined is not an emoji. |
Sounds good to me. |
See also Unicode Standard Annex #31 “Unicode Identifier and Pattern Syntax” Section 2.3:
|
Maybe nothing and status quo just ok? Are there any emojis that we need to support (e.g. some ZWJ sequences for math?)? Because you'll be opening a large can of worms: "This Emoji ZWJ Sequence has not been Recommended For General Interchange (RGI) by Unicode. Expect limited cross-platform support." for e.g. https://emojipedia.org/family-woman-woman-boy-girl/ Copying and pasting it into the REPL works, but incorrectly, as four heads, not as a square image. Otherwise I have nothing against gender neutral, or "food baby": https://blog.emojipedia.org/why-is-there-a-pregnant-man-emoji/ Maybe just close this, we already have the best Unicode support? |
Allowing ZWJ/U+200D as a non-initial character would be the simplest option, but if you allow it between arbitrary characters then it does allow a whole new type of obfuscated code, e.g. And if you only allow it in emoji sequences it makes the parser more complicated, though not insurmountably so. (This is what Unicode Annex 31 recommends for languages adopting the emoji profile. Not that we hew particularly close to Annex 31 in any case.) One compromise option would be to allow ZWJ as any non-initial character, but to normalize it away in non-emoji sequences (or normalize it away entirely) — that way the complexity is pushed to the symbol normalization, out of the parser. |
I just had a brainwave — what we really want here is to not break identifiers within graphemes. So, an improved rule for identifiers would be:
This way we only need one more bit of state during parsing, for the grapheme-break state in utf8proc, and it will handle all of the emoji rules etcetera for us, and won't allow ZWJ in non-emoji sequences. Edit: actually, because of grapheme rule GB9, this will allow ZWJ in non-emoji sequences at the end of the identifier. So, we would need one more rule: identifiers cannot end with ZWJ. |
Some emojis are composed of sequences of other emoji, combined by ZWJ. At the moment, ZWJ is disallowed in the parser, so these emoji, cannot be used, e.g.:
This is because
🏳️🌈
is really🏳️[ZWJ]🌈
. We should decide what to do here, since use of these sequences is likely to expand in future Unicode versions. One option is of course to just do nothing and continue to disallow these. Another option may be to just normalize out the ZWJ in emoji sequences and treat that equivalently to the constituent emoji next to each other (since that's what they look like if ZWJ sequences are not supported by the font renderer.The text was updated successfully, but these errors were encountered: