Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extended bare key ranges to include all emojis #1002

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ChristianSi
Copy link
Contributor

@ChristianSi ChristianSi commented Oct 27, 2023

Also explain better what's allowed in bare keys and remove the emoji example, since (though it's possible) we don't advice using emojis as bare keys.

Fixes #954.

Also explain better what's allowed in bare keys and remove the emoji
example, since (though it's possible) we don't advice using emojis as bare
keys).
@ChristianSi ChristianSi changed the title Extended bare key ranges to include all emojis. Extended bare key ranges to include all emojis Oct 27, 2023
@ChristianSi
Copy link
Contributor Author

ChristianSi commented Oct 27, 2023

This is an alternative fix for #954. It extends the bare key range to allow arbitrary emojis and generally many harmless symbols that were so far excluded. At the same time it continues to allow arbitrary words in arbitrary languages in bare keys, which I consider very important for reasons of fairness and proper internationalization. And the pragmatic approach of simply defining ranges is preserved, instead of switching to an approach based on Unicode character classes that would make implementation considerably more complicated for those who don't have access to a fully-fledged Unicode support library.

The language of the written spec is also adapted to better describe what's now allowed in bare keys, and the emoji example has been deleted. (Using more or less arbitrary emojis as bare keys is now possible, but it's not something we recommend, as words make more meaningful keys.)

The extended ranges still try to exclude, in so far as reasonably possible, characters that are "problematic" in bare keys since they look similar to TOML's own meaningful punctuation (especially quotation marks, hashes, equals signs, commas, parentheses). This is just a "reasonable best effort" and not meant to be totally comprehensive – truly ensuring that visual inspection of a file gives identical results to what a parser sees is in any case impossible with Unicode (even in the total absence of bare keys).

@ChristianSi
Copy link
Contributor Author

Here's a list of all the new characters added by this PR: https://gist.github.com/ChristianSi/f3d97247c79d234326c47779227b1ff0.

Note that I didn't include the first few emojis (#️ – hash sign, *️ – asterisk, ©️ – copyright, ®️ – registered, ‼️ – double exclamation mark, ⁉️ – exclamation question mark), since they seem more like punctuation or special markers than emojis to me, and of course the hash sign is reserved for comments in TOML. All the other emojis listed in https://unicode.org/Public/UNIDATA/emoji/emoji-data.txt should be included, however.

Let me know what you think!

@ChristianSi
Copy link
Contributor Author

@pradyunsg Can you take a look, please? Time to get 1.1 closer a release candidate!

@arp242
Copy link
Contributor

arp242 commented Nov 6, 2023

It's not really about "emoji" but about consistent character ranges; i.e. "this type is allowed, and this type isn't".

Here's just the first thing I looked it. It currently skips:

% uni print 2768..2775
'❨'  U+2768  10088  e2 9d a8    ❨   MEDIUM LEFT PARENTHESIS ORNAMENT (Open_Punctuation)
'❩'  U+2769  10089  e2 9d a9    ❩   MEDIUM RIGHT PARENTHESIS ORNAMENT (Close_Punctuation)
'❪'  U+276A  10090  e2 9d aa    ❪   MEDIUM FLATTENED LEFT PARENTHESIS ORNAMENT (Open_Punctuation)
'❫'  U+276B  10091  e2 9d ab    ❫   MEDIUM FLATTENED RIGHT PARENTHESIS ORNAMENT (Close_Punctuation)
'❬'  U+276C  10092  e2 9d ac    ❬   MEDIUM LEFT-POINTING ANGLE BRACKET ORNAMENT (Open_Punctuation)
'❭'  U+276D  10093  e2 9d ad    ❭   MEDIUM RIGHT-POINTING ANGLE BRACKET ORNAMENT (Close_Punctuation)
'❮'  U+276E  10094  e2 9d ae    ❮   HEAVY LEFT-POINTING ANGLE QUOTATION MARK ORNAMENT (Open_Punctuation)
'❯'  U+276F  10095  e2 9d af    ❯   HEAVY RIGHT-POINTING ANGLE QUOTATION MARK ORNAMENT (Close_Punctuation)
'❰'  U+2770  10096  e2 9d b0    ❰   HEAVY LEFT-POINTING ANGLE BRACKET ORNAMENT (Open_Punctuation)
'❱'  U+2771  10097  e2 9d b1    ❱   HEAVY RIGHT-POINTING ANGLE BRACKET ORNAMENT (Close_Punctuation)
'❲'  U+2772  10098  e2 9d b2    ❲    LIGHT LEFT TORTOISE SHELL BRACKET ORNAMENT (Open_Punctuation)
'❳'  U+2773  10099  e2 9d b3    ❳    LIGHT RIGHT TORTOISE SHELL BRACKET ORNAMENT (Close_Punctuation)
'❴'  U+2774  10100  e2 9d b4    ❴   MEDIUM LEFT CURLY BRACKET ORNAMENT (Open_Punctuation)
'❵'  U+2775  10101  e2 9d b5    ❵   MEDIUM RIGHT CURLY BRACKET ORNAMENT (Close_Punctuation)

But there are many parenthesis; I marked the allowed ones with here:

% uni search 'LEFT PARENTHESIS'
N    '('  U+0028  40     28          (     LEFT PARENTHESIS (Open_Punctuation)
Y    '◌'  U+1AC1  6849   e1 ab 81    ᫁   COMBINING LEFT PARENTHESIS ABOVE LEFT (Nonspacing_Mark)
Y    '◌'  U+1AC3  6851   e1 ab 83    ᫃   COMBINING LEFT PARENTHESIS BELOW LEFT (Nonspacing_Mark)
Y    '⁽'  U+207D  8317   e2 81 bd    ⁽   SUPERSCRIPT LEFT PARENTHESIS (Open_Punctuation)
Y    '₍'  U+208D  8333   e2 82 8d    ₍   SUBSCRIPT LEFT PARENTHESIS (Open_Punctuation)
Y    '⎛'  U+239B  9115   e2 8e 9b    ⎛   LEFT PARENTHESIS UPPER HOOK (Math_Symbol)
Y    '⎜'  U+239C  9116   e2 8e 9c    ⎜   LEFT PARENTHESIS EXTENSION (Math_Symbol)
Y    '⎝'  U+239D  9117   e2 8e 9d    ⎝   LEFT PARENTHESIS LOWER HOOK (Math_Symbol)
N    '❨'  U+2768  10088  e2 9d a8    ❨   MEDIUM LEFT PARENTHESIS ORNAMENT (Open_Punctuation)
N    '❪'  U+276A  10090  e2 9d aa    ❪   MEDIUM FLATTENED LEFT PARENTHESIS ORNAMENT (Open_Punctuation)
Y    '⹙'  U+2E59  11865  e2 b9 99    ⹙   TOP HALF LEFT PARENTHESIS (Open_Punctuation)
Y    '⹛'  U+2E5B  11867  e2 b9 9b    ⹛   BOTTOM HALF LEFT PARENTHESIS (Open_Punctuation)
Y    '﴾'  U+FD3E  64830  ef b4 be    ﴾   ORNATE LEFT PARENTHESIS (Close_Punctuation)
Y    '︵' U+FE35  65077  ef b8 b5    ︵   PRESENTATION FORM FOR VERTICAL LEFT PARENTHESIS (Open_Punctuation)
Y    '﹙' U+FE59  65113  ef b9 99    ﹙   SMALL LEFT PARENTHESIS (Open_Punctuation)
Y    '(' U+FF08  65288  ef bc 88    (   FULLWIDTH LEFT PARENTHESIS (Open_Punctuation)
Y    '�'  U+E0028 917544 f3 a0 80 a8 󠀨  TAG LEFT PARENTHESIS (Format)

So basically all except these that are specifically excluded.

"You can have super- and subscript parens, small parens, and fullwidth parens, but not regular parens or ornamental parens" is unexplainable, just as "you can use this smiling emoji but not that other smiling emoji" is.

@ChristianSi
Copy link
Contributor Author

@arp242: It's actually quite simple to explain: "You can use arbitrary words in arbitrary languages as bare keys. If you want to use more than one word as a bare key, use dashes or underscores to connect them, as whitespace is not allowed. Better don't use other characters in bare keys, as that may or may not work."

That's in fact more or less how I explain it now in the README.

These additional non-letters are simply allowed in order to simplify the range definitions, it's not that anybody is supposed to use them. So there is no point in, and no need for, detailed explanations.

@arp242
Copy link
Contributor

arp242 commented Nov 10, 2023

Better don't use other characters in bare keys, as that may or may not work.

I can "explain" everything with this.

People don't read specs, nor should they. They try stuff and see what happens. What you get now is that they try something, which works, and then they change it to something else and that doesn't work, which makes no sense.

@ChristianSi
Copy link
Contributor Author

Why should people try using something like LEFT PARENTHESIS UPPER HOOK in a key and then, when it works, move on to MEDIUM LEFT PARENTHESIS ORNAMENT, only to be disappointed that it doesn't work? That doesn't strike me as a very likely scenario. I think you're worrying too much.

@ChristianSi
Copy link
Contributor Author

@pradyunsg Are you around to take a look?

@ChristianSi
Copy link
Contributor Author

@pradyunsg Kind reminder that this is still open, and that maybe you need additional maintainers to support you?

@ChristianSi
Copy link
Contributor Author

@pradyunsg Ping?

@abelbraaksma
Copy link
Contributor

abelbraaksma commented May 19, 2024

And the pragmatic approach of simply defining ranges is preserved, instead of switching to an approach based on Unicode character classes that would make implementation considerably more complicated for those who don't have access to a fully-fledged Unicode support library.

I think this is the key improvement here. Allowing only certain character classes was never gonna work (issues with different versions of Unicode, or, if specified, way too specific and unwieldy to be maintainable long-term in any practical way).

My opinions have already been discussed in the linked issue. The main point was this one: blowing up the spec for the sake of one single character class specification, and that is addressed here. While this PR includes many characters that arguably shouldn't be part of a bare key, ultimately, that's up to users (and there are many languages, including certain .NET languages, that support all characters in type and member names, so there's precedent): if they want to write code that includes emojis or brackets, they can.

The more pragmatic and welcome change here is to simply be inclusive of other scripts and languages (which was the aim of the original PR that extended bare keys). Anything else is just what fishermen call "by-catch" ;).

Great work for getting this ready!

Comment on lines +111 to +115
are generally accepted, while not all symbols and punctuation marks are. If you
want to use a bare key made up of several words, use a suitable separator
character (such as an underscore or hyphen) between the words, as spaces are not
allowed. Note that bare keys are allowed to be composed of only digits, e.g.
1234, but are always interpreted as strings.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this. We tried to come up with some advisary language in the earlier attempt to extend bare keys, but decided against it. I think it is good to have it in 👍.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Not all emojis work as bare keys
3 participants