Allow non-ASCII identifiers #2457

pyfisch · 2018-06-03T17:25:37Z

Allow non-ASCII letters (such as accented characters, Cyrillic, Greek, Kanji, etc.) in Rust identifiers.

It uses the feedback from rust-lang/rust#28979, #2455 and was updated several times based on the comments received in this PR.

The different options for identifier Unicode support can be summarized in this list, ordered from least complex to most complex:

ASCII identifiers (currently implemented in stable Rust)
Immutable Identifiers (implemented in C++)
Unicode Identifiers (JavaScript, currently implemented in feature(non_ascii_idents))
Normalized Identifiers (Python)
Confusable Detection (proposed in this RFC)

Level 2 already provides good Unicode support for most uses. Normalizing identifiers and confusable detection improves security and usability.

sfackler · 2018-06-03T17:30:21Z

text/0000-unicode-idents.md

+
+> **<sup>Lexer:<sup>**  
+> IDENTIFIER_OR_KEYWORD:  
+> &nbsp;&nbsp; XID_Start&nbsp;XID_Continue<sup>\*</sup>  


How does this compare to Go's identifier grammar: https://golang.org/ref/spec#Identifiers? It doesn't use XID_Start and XID_Continue, but instead the "Letter" and "Number, decimal digit" character classes.

XID_Start includes "Letter" but also includes "Number, Letter" (for example Roman numerals like Ⅻ) and a few characters for compatibility and to ensure that XID_Start is closed under NFKC.

XID_Continue includes all of XID_Start and "Number, decimal digit" , marks to build combining characters from parts (e + ^ → ê), 10 "Punctuation, Connector" and again a few for compatibility and NFKC.

There is golang/go#194 where combining characters are discussed as they are needed to write some Asian languages. See also: https://golang.org/doc/faq#unicode_identifiers

est31 · 2018-06-03T21:29:39Z

text/0000-unicode-idents.md

+
+Two identifiers X, Y are considered to be equal if there [NFKC forms][TR15] are equal: NFKC(X) = NFKC(Y).
+
+A `unicode_idents` lint is added to the compiler. This lint is `allow` by default. The lint checks if any identifier in the current context contains a codepoint with a value equal to or greater than 0x80 (outside ASCII range). Not only locally defined identifiers are checked but also those imported from other crates and modules into the current context. 


I don't think the lint should be allow by default. It should at least be warn by default. Starting to use non ascii idents should be a conscious choice, not an accidental one.

As for checking imported idents as well, I think this should be a separate lint. You might want to be able to import something from a foreign language crate but not want to have foreign language idents in your own code.

Reposting my comment from #2455 (comment), I think that there should be different lints, with different levels, as some languages should be allow/warn by default, and some others should be deny by default.

mark-i-m · 2018-06-03T23:05:41Z

text/0000-unicode-idents.md

+* Identifiers starting with numbers or "non letters": `42_the_answer`, `third√of7`, `◆◆◆`, ...
+* Emojis: 🙂, 🦀, 💩, ...
+
+Similar Unicode identifiers are normalized: `a1` and `a₁` refer to the same variable. This also applies to accented characters which can be represented in different ways.


FWIW, this sentence does not render properly on my phone. I get a1 and a<box>...

One other drawback that is not mentioned in the RFC is that unicode characters don't render properly on all terminals and fonts.

mark-i-m · 2018-06-03T23:08:50Z

I wanted to emphasize how much of an ergonomics regression it would be for me if this was added and actually used. The combination of "hard to type" and "often incorrectly rendered" would make most libraries that use this useless for me. (Needless to say, I would be very unhappy if non-ascii characters were ever used in libstd)...

burdges · 2018-06-03T23:23:10Z

I think the NFKC form restriction here sounds essential, so very nice catch there.

We should keep the skeleton test proposed here too. Yet, I'm not convinced it addresses homograph attacks because confusions might arise with identifiers not in scope. We cannot run the skeleton test across all scopes obviously.

In practice, there are almost no scenarios where more than a few scripts get used together, even when doing mathematics, so enabling each script separately sounds viable, ala

#[unicode_idents(Cyrillic)]
#[unicode_idents(Greek)]
#[unicode_idents(Hebrew)]

If the compiler encounters a allowable but currently disallowed script, then it might issue a warning giving both the line and column, as well as propose the correct #[unicode_idents(..)] line. Also, there are many ancient scripts like say Linear A that should never occur in identifiers, so ignoring them sounds helpful.

dscorbett · 2018-06-03T22:20:38Z

text/0000-unicode-idents.md

+> 
+> For some languages, common transliteration systems exist (in particular, for the Latin-based writing systems). For other languages, users have larger difficulties to use Latin to write their native words.
+
+Additionally some math oriented projects may want to use identifiers closely resembling mathematical writing.


NFKC removes the mathematical distinction of font style, as in ⟨ℍ⟩ vs. ⟨ℋ⟩.

@dscorbett yeah the idea is that while you can use stylized characters you should not go completely overboard and use all of them in the same formula. This applies also to the distinguished in natural language orthographies.

dscorbett · 2018-06-03T22:28:36Z

text/0000-unicode-idents.md

+
+`XID_Start` and `XID_Continue` are used as defined in the aforementioned standard. The definition of identifiers is forward compatible with each successive release of Unicode as only appropriate new characters are added to the classes but none are removed.
+
+Two identifiers X, Y are considered to be equal if there [NFKC forms][TR15] are equal: NFKC(X) = NFKC(Y).


NFKC folds together some characters distinguished in natural language orthographies, such as Tifinagh ⟨ⵡ⟩ and ⟨ⵯ⟩.

dscorbett · 2018-06-03T22:53:52Z

text/0000-unicode-idents.md

+> **<sup>Lexer:<sup>**  
+> IDENTIFIER_OR_KEYWORD:  
+> &nbsp;&nbsp; XID_Start&nbsp;XID_Continue<sup>\*</sup>  
+> &nbsp;&nbsp; | `_` XID_Continue<sup>+</sup>  


The guide-level explanation says that identifiers may contain “more letters, digits and some connecting punctuation”, but XID_Continue does not include connecting punctuation.

XID_Continue includes characters from the Pc class called "Punctuation, Connector".

So it does. I must have been looking at XID_Start.

dscorbett · 2018-06-03T22:58:45Z

text/0000-unicode-idents.md

+# Motivation
+[motivation]: #motivation
+
+Rust is written by many people who are not fluent in the English language. Using identifiers in ones native language eases writing and reading code for these developers.


Some languages use U+200C ZERO WIDTH NON-JOINER and U+200D ZERO WIDTH JOINER, which are not in XID_Continue. Should they be allowed too? See section 2.3 of UAX #31.

To have proper support then they should be allowed, as it affects the rendering of text, which may cause it to have different meanings (I'm not sure, but it seems likely to me). However, this can cause issues like in C++ and Swift for example， different number of zero-width joiners or spaces are different identifiers, and that leads to a readability nightmare.

That is why UAX #31 suggests allowing them only in certain positions, to avoid having multiple distinct identifiers look identical.

So AIUI these are used for:

Forcing explicit viramas in Indic scripts in consonant clusters. There is a semantic difference but it's exceedingly minor and mostly comes in play for Sanskrit. It's a pretty minor difference; AIUI it's used in certain kinds of compound words and is very easily omitted

Certain vowel presentational forms in Bengali and Oriya. Also minor.

Forcing letters to take different (word-medial, etc) forms in the Perso-Arabic script, used for:

Arabic affixes, when shown in isolation or when used with non-arabic words. Imagine you had to write something like "Rust's" where "Rust" is in the Latin script but the 's is in Arabic, you need the Arabic suffix to not use a word-initial text form. An example of this is the ب prefix preposition

Arabic abbreviations

AFAICT these are all somewhat optional (though preferred). The abbreviations one might be the most used.

(that said, there are much larger problems with using an RTL script in rust)

ZWNJ has important use in Persian at least https://en.m.wikipedia.org/wiki/Zero-width_non-joiner

dscorbett · 2018-06-03T23:04:21Z

text/0000-unicode-idents.md

+Examples of valid identifiers are:
+
+* English language words: `color`, `image_width`, `line2`, `Photo`, `_unused`, ...
+* ASCII words in foreign languages: `die_eisenbahn`, `el_tren`, `artikel_1_grundgesetz`


Since the primary beneficiaries are developers who prefer not using English identifiers, these examples wouldn’t be in foreign languages for all readers.

Yeah I tried to follow the spirit of the section to explain as if you were "teaching it to another Rust programmer" and since the explanation in English I found it appropriate to refer to these languages as foreign. (Do you know a better word?)

Normally I would say: "Du kannst den Variablen auch deutsche Namen geben, Umlaute funktionieren auch." 😉

How about “other”? Compare the last example, which says “other scripts” instead of “foreign scripts”.

I think that “non-English languages” works. But that distinction from the previous point isn’t really relevant in the first place. Identifiers made of ASCII letters that do not form a “real” word in any language are also valid. Maybe replace both points with “ASCII letters and digits”?

dscorbett · 2018-06-03T23:14:19Z

text/0000-unicode-idents.md

+# Reference-level explanation
+[reference-level-explanation]: #reference-level-explanation
+
+Identifiers in Rust are based on the [Unicode® Standard Annex #31 Unicode Identifier and Pattern Syntax][TR31]. Rust compilers shall use at least Revision 27 of the standard.


The revision should be specified exactly. Otherwise, the same identifier in the same Rust version could be valid in one compiler but another.

I don't want to lock Rust compilers to a specific Unicode revision. An alternative would be to state the supported Unicode revision for each Rust edition.

New versions of Rust should use new versions of Unicode, of course, but any given version of Rust should have an unambiguous definition of identifiers. Your alternative is good.

Here is why the current text is wrong. Let’s say, for example, that revision 28 adds ⟨⍙⟩ to XID_Start, just in time for Rust 1.40.0. Would ⟨⍙abc⟩ be a valid identifier in Rust 1.40.0? It would be valid in compilers that use revision 28, but not in compilers that still use revision 27. Both compilers would be correct and yet disagree about something as basic as identifier validity.

FWIW we’ve already been updating Unicode for the standard library (e.g. char::to_lowercase), and have https://doc.rust-lang.org/std/char/constant.UNICODE_VERSION.html to indicate which version is in use.

UAX 31 should be referenced for information/context, but I think that its revision number can be removed entirely. This RFC defines identifier syntax based on the XID_Start and XID_Continue properties, which are specified exactly for a given version of Unicode (e.g. 10.0.0).

To maintain Rust’s stability promise, we need to ensure that:

If a given string is a valid identifier in a given version of Rust, it needs to stay a valid identifier in later versions of Rust.

If two strings are valid identifiers in a given version of Rust, whether they compare equal after normalization needs to be unchanged in later versions of Rust.

My reading of UAX 31 and 15 is that we can do that and still update the to newer versions of Unicode in newer versions of Rust backward-compatibly without involving Rust editions. We should have some way communicate to users the version mapping, and I think that the existing std::char::UNICODE_VERSION (with docs archive like https://doc.rust-lang.org/1.15.0/std/char/constant.UNICODE_VERSION.html) is already satisfactory for that.

This translates 1. above into the requirement UAX31-R1b Stable Identifier:

To meet this requirement, an implementation shall guarantee that identifiers are stable across versions of the Unicode Standard: that is, once a string qualifies as an identifier, it does so in all future versions.

Per https://www.unicode.org/reports/tr31/#Backward_Compatibility I believe that XID_Start XID_Continue* | "_" XID_Continue+ meets that requirement. (Since XID_Start / XID_Continue include Other_ID_Start / Other_ID_Continue respectively. I’d appreciate if someone could double-check and confirm my understanding.)

Additionally, code points that are not assigned in a given Unicode version cannot be in XID_Start or XID_Continue in that Unicode version. And https://www.unicode.org/reports/tr15/#Versioning defines:

It is crucial that Normalization Forms remain stable over time. That is, if a string that does not have any unassigned characters is normalized under one version of Unicode, it must remain normalized under all future versions of Unicode.

Updating to a newer Unicode does not change the result of NFKC(X) or NFKC(Y), so NFKC(X) = NFKC(Y) is also unchanged and requirement 2. above is also met.

@dscorbett

Let’s say, for example, that revision 28 adds ⟨⍙⟩ to XID_Start, just in time for Rust 1.40.0. Would ⟨⍙abc⟩ be a valid identifier in Rust 1.40.0?

It would a Unicode version rather than in a revision of UAX 31 that adds it, but yes, if Rust 1.40.0 is the one that upgrades to that version of Unicode (regardless of when that version of Unicode was released), then ⍙abc would be a valid ident in Rust 1.40.0 but not 1.39.0. This is similar to using new a language feature newly in 1.40.0.

Right. It was a rhetorical question to show why specifying the Unicode version exactly is necessary.

Bear in mind, the unicode annexes also do get updated every few versions, often to accommodate for new kinds of code points but sometimes to just fix things.

dscorbett · 2018-06-03T23:22:43Z

text/0000-unicode-idents.md

+
+Note: The confusable detection is set to `warn` instead of `deny` to enable forward compatibility. The list of confusable characters will be extended in the future and programs that were once valid would fail to compile.
+
+The confusable detection algorithm is based on [Unicode® Technical Standard #39 Unicode Security Mechanisms Section 4 Confusable Detection][TR39Confusable]. For every distinct identifier X in the current scope execute the function `skeleton(X)`. If there exist two distinct identifiers X and Yin the same crate where `skeleton(X) = skeleton(Y)` report it.


skeleton does not handle default-ignorable code points, but many are in XID_Continue. The lint should delete them before running skeleton.

I see why this would be a good idea but if we decide to do it I would like to consult with the Unicode people first to find out why they did not do it in this way.

I suspect they didn’t do it that way because UTS #39’s confusability data’s “primary goal is to include characters that would be Status=Allowed”, and default-ignorable code points are Status=Restricted.

Typo: Yin -> Y in (missing space)

dscorbett · 2018-06-03T23:25:23Z

text/0000-unicode-idents.md

+Examples of invalid identifiers are:
+
+* Keywords: `impl`, `fn`, `_` (underscore), ...
+* Identifiers starting with numbers or "non letters": `42_the_answer`, `third√of7`, `◆◆◆`, ...


third√of7 does not start with a number or non-letter.

dscorbett · 2018-06-03T23:30:59Z

text/0000-unicode-idents.md

+1. Require all identifiers to be in NFKC or NFC form.
+2. Two identifiers are only equal if their codepoints are equal.
+3. Perform NFC mapping instead of NFKC mapping for identifiers.
+4. Only a number of common scripts could be supported.


Other possible variants are the restriction levels of UTS #39.

This is listed as variant 5.

sfackler · 2018-06-03T23:49:52Z

Swift supports unicode identifier as a another point of comparison for the exact allowed behavor: https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/LexicalStructure.html#//apple_ref/doc/uid/TP40014097-CH30-ID412

@mark-i-m are those concerns about widespread use of non-ascii identifiers borne out in your experience in other languages that allow non-ascii identifiers?

mark-i-m · 2018-06-04T00:27:54Z

@mark-i-m are those concerns about widespread use of non-ascii identifiers borne out in your experience in other languages that allow non-ascii identifiers?

Yes, in particular,

I used to TA in an intro programming class (in Java). One of the things we always had problems with was that when students copied code from browsers (e.g. class examples) there were often random unicode characters that were hard to find and get rid of, even in an IDE. This problem is worse in terminal-based editors in my experience. Java doesn't allow unicode identifier TMK, but the experience left a decidedly negative taste in my mouth for editing files with unicode characters.
I took an optimization class recently and we used a lot of Julia. Julia does allow unicode characters as idents. In fact, the prof often used greek letters for variable names in sample code. In Jupyter notebooks, you can add unicode characters with \name<tab>, which opens a drop-down to insert the actual character. This always slowed me down a lot, and I ended up just putting placeholders everywhere and find/replacing with the unicode characters later. In vim, this is just plain painful because there is no drop-down. There is probably a better way to do this, but my defacto way of getting a unicode character in vim is
- to google the character
- find a reputable-looking website that has the unicode character I want
- copy/paste from firefox into my terminal window with vim.
- hope that I have copied the right character and nothing else.
This is very suboptimal IMHO...

shingtaklam1324 · 2018-06-04T04:16:51Z

As well as the other concerns that have been raised in the thread, there is something which I experienced while doing some work in Ruby, which is that for some languages, there are multiple encoding systems, and all of which have significant use. One example is Chinese, which has Unicode, and also GBK which is used about as much as Unicode. This can cause issues as code encoded in GBK is valid code right now with ASCII idents, but with this addition, if a user decides to use GBK to encode their (library for example), then the user uses Unicode, this can cause issues. This is a valid concern as AFAIK, GBK is a superset of Unicode 1.1, and it should be supported somehow.

Rendering these kinds of code on terminals which do not support past their designed and required character ranges (ASCII, GBK ...), can be an issue, as they cause random glyphs to occur in place of the actual characters.

Raise two more questions. Suggest restriction levels as an alternative design. Describe the Go language identifier syntax.

pyfisch · 2018-06-04T07:35:05Z

@est31

Starting to use non ascii idents should be a conscious choice, not an accidental one.

In my experience it is quite rare to accidentially use non ASCII idents. Almost all developers are aware of the distinction between basic latin (ASCII) and other characters so they wont introduce them if they don't intend to do so. Typos containing Unicode characters are also unlikely as one usually needs to make the same mistake at least twice (definition and use).

@est31

You might want to be able to import something from a foreign language crate
The expectation is that all open-source crates (a business can of course adhere to the same policy) for general use have an ASCII-only interface. But ultimately this is the library authors decision. For this reason I have not proposed distinct lints.

@shingtaklam1324 @burdges

I think that there should be different lints, with different levels, as some languages should be allow/warn by default, and some others should be deny by default.

I did not propose different lints based on how the recommended script list or enabling each script individually to limit implementation complexity and make it easier for each developer to use the scripts he/she wants to use.

Limiting scripts is intended to improve security. As there is already a Unicode on/off switch the problem applies only to projects that want to use non select Latin scripts but are afraid that other scripts sneak in.

@shingtaklam1324

Regarding GBK and other encodings. Rustc currently only accepts UTF-8 (and thus also ASCII-only) encoded files. Because of this an editor to write Rust code should be set to UTF-8 and files in GBK need to converted to UTF-8.

@mark-i-m

here is probably a better way to do this, but my defacto way of getting a unicode character in vim is

to google the character

find a reputable-looking website that has the unicode character I want

copy/paste from firefox into my terminal window with vim.

hope that I have copied the right character and nothing else.

This must be very annoying. When I need to use Greek letters I open my Greek letter sheet in a separate window and copy the characters as needed which works reasonably well. On the other hand it did not seem to annoy you enough to look up a better way: 1 or 2

shingtaklam1324 · 2018-06-04T10:05:46Z

@pyfisch

rustc does actually accept GBK code if all characters are within Unicode 1.1, as it would be valid Unicode. I should have clarified that. This could cause issues with rustc in the future, but right now it is fine. It shouldn't be much of an issue, as Rust should throw an error if any Chinese Characters are used in the code at all, but adding this shouldn't break any existing GBK code AFAIK

SimonSapin · 2018-06-04T14:46:30Z

text/0000-unicode-idents.md

+To disallow any Unicode identifiers in a project (for example to ease collaboration or for security reasons) limiting the accepted identifiers to ASCII add this lint to the `lib.rs` or `main.rs` file of your project:
+
+```rust
+#![forbid(unicode_idents)]


To split hairs: this should be non_ascii_idents.

Rust source files are always Unicode. The U+0061 to U+007A range is part of Unicode.

Came here to say this 😄

SimonSapin · 2018-06-04T15:00:12Z

text/0000-unicode-idents.md

+
+For open source crates it is recommended to write them in English and use ASCII-only. An exception should be made if the application domain (e.g. math) benefits from Unicode and the target audience (e.g. for a crate interfacing with Russian passports) is comfortable with the used language and characters. Additionally crates should provide an ASCII-only API.
+
+Private projects can use any script and language the developer(s) desire. It is still a good idea (as with any language feature) not to overuse it.


I would perhaps weaken the language in these two paragraphs, with phrases like “suggested” and “should consider”. English is indeed the de-facto international language and ASCII-only idents are indeed more friendly to an international audience (if only for typing), but it is not a Rust RFC’s place to judge which other concerns are or are not acceptable reasons to do otherwise, or how much use is overuse.

SimonSapin · 2018-06-04T15:04:39Z

text/0000-unicode-idents.md

+
+`XID_Start` and `XID_Continue` are used as defined in the aforementioned standard. The definition of identifiers is forward compatible with each successive release of Unicode as only appropriate new characters are added to the classes but none are removed.
+
+Two identifiers X, Y are considered to be equal if there [NFKC forms][TR15] are equal: NFKC(X) = NFKC(Y).


s/there/their/ ?

est31 · 2018-06-04T15:14:08Z

In my experience it is quite rare to accidentially use non ASCII idents. Almost all developers are aware of the distinction between basic latin (ASCII) and other characters so they wont introduce them if they don't intend to do so.

But a PR sent to you could contain special characters without you noticing. Surely, the confusable check might catch it, but it might also have a bug. I am not generally against giving an option to allow unicode letters, but this danger still looms as long as the lint is default on.

Another point: when I want a codebase to be in german, most times I don't, I don't think I'd want to allow non-german characters or non latin scripts. Here, narrowing the lint down to a specific script would be really useful.

SimonSapin · 2018-06-04T16:09:09Z

text/0000-unicode-idents.md

+* Are crates with Unicode names allowed and can they be published to crates.io?
+* Are `unicode_idents` and `confusable_unicode_idents` good names?
+* Should [ZWNJ and ZWJ be allowed in identifiers][TR31Layout]?
+* Should *rustc* accept files in a different encoding than *UTF-8*?


This one is easy: no. Why would it? UTF-8 by definition supports all of Unicode. Also, source files encodings seem off-topic for this RFC.

SimonSapin · 2018-06-04T16:50:52Z

Sorry I’m splitting hairs again, but please replace “Unicode idents” with “Non-ASCII idents” in various places (PR title, RFC filename, lint name, etc.)

Rust syntax is already based on Unicode. ASCII-range letters are part of Unicode.

SimonSapin · 2018-06-04T16:58:41Z

text/0000-unicode-idents.md

+
+`XID_Start` and `XID_Continue` are used as defined in the aforementioned standard. The definition of identifiers is forward compatible with each successive release of Unicode as only appropriate new characters are added to the classes but none are removed.
+
+Two identifiers X, Y are considered to be equal if there [NFKC forms][TR15] are equal: NFKC(X) = NFKC(Y).


I’d like this to go further and specify:

Parsers for Rust syntax normalize idents to NFKC

APIs such as proc_macro::Ident::new normalize to NFKC

As a consequence, identifiers are considered equal if their NFKC forms are equal (module hygiene concerns, which are out of scope for this RFC), and APIs such as proc_macro::Ident::to_string return a normalized string.

joshtriplett · 2018-06-04T17:49:08Z

So, first of all, this seems well-thought-out; if we're going to allow Unicode identifiers, this seems like the right approach to do so.

That said, I personally would prefer not to do this at all, and to require that identifiers remain ASCII.

Also, as another alternative to include (in the alternatives section): we could require that any identifiers exported from a crate must always be ASCII, but identifiers that remain entirely within the same crate may be Unicode.

EDIT: this comment is obsolete, please see my comments starting from #2457 (comment) instead, which discuss specific concerns and tradeoffs.

est31 · 2018-06-04T17:58:40Z

I did not propose different lints based on how the recommended script list or enabling each script individually to limit implementation complexity and make it easier for each developer to use the scripts he/she wants to use.

Maybe I'm missing something, maybe this is super complicated as well, but I think that confusability checks as well as normalisation algorithms are far harder to implement than a simple lookup to which script a character belongs to.

SimonSapin · 2018-06-04T18:11:02Z

That said, I personally would prefer not to do this at all, and to require that identifiers remain ASCII.

Could you expand a bit more on what that preference is, and why?

My preference it to keep idents ASCII and based on English words, and keep comments in English, in projects that I create, maintain, contribute to, or use. (And English is not my native language.) But that doesn’t require an ASCII-only restriction to be baked into the language. It’s more of a social question than a technical one in my opinion.

we could require that any identifiers exported from a crate must always be ASCII

The crate boundary is rather arbitrary here. A single system with a region-specific purpose could be split into multiple crates, for example to manage recompilation speed.

joshtriplett · 2018-06-04T18:25:05Z

@SimonSapin It's a tradeoff between "should people be able to write identifiers in languages that can't be represented in ASCII" versus "should people be able to read arbitrary code". Both of those are important, and I don't want to discount either, but I'd favor the latter.

Another useful alternative that I think seems worth documenting: allowing unicode identifiers, but leaving them as deny by default, and requiring an explicit allow in files using them.

SimonSapin · 2018-06-04T18:52:09Z

@joshtriplett Yes this is an important trade off, but it is up to the author of a given piece of code to make that choice. It is not our place to dictate what language people should speak or write in contexts we can’t even think of. Even if it were, a character set restriction is a poor way to do it.

est31 · 2018-06-04T19:07:19Z

@SimonSapin anything that would speak against a deny by default lint?

Thanks to SimonSapin for the suggestions.

nox · 2018-10-20T16:33:08Z

Oooooh, TIL. Thanks for the pointer to the reference, I only knew about deny and my brain didn't even register that forbid is a different verb altogether. 🤷🏻‍♂️

Manishearth · 2018-10-20T22:08:05Z

@ubsan mentioned that _ is currently an identifer (one that is a reserved keyword) in the grammar, and this RFC accidentally makes that no longer the case. Fixed -- this RFC now allows _ to be an identifier.

(This is mostly an implementation detail, but might as well not change things we don't need to)

strega-nil · 2018-10-21T00:26:25Z

@Manishearth I actually don't know that that's true, I just dislike differentiating between _ and all other keywords.

On another thing entirely: frig yes this RFC is friggin awesome. I'm so excited that this is happening :D

Centril · 2018-10-21T10:02:12Z

So long as no observable behavior changes wrt. _ then that is fine by me; in other words, this should continue to be an error:

macro_rules! m { ($i:ident) => {} }
m!(_); // error: no rules expected the token `_`

Manishearth · 2018-10-21T12:29:50Z

Yeah, I mentioned in the RFC that this is just to match current behavior.

rfcbot · 2018-10-29T09:44:11Z

The final comment period, with a disposition to merge, as per the review above, is now complete.

Centril · 2018-10-29T10:29:59Z

Huzzah! This RFC has been merged!

Tracking issue: rust-lang/rust#55467

earthengine · 2018-11-01T04:46:52Z

Mmmm... I would say, if we would allow non-ASCII identifiers, I didn't see a reason to not allow emojis as well. Emojis are today's international language, when being well designed they can be very well presented the intensions of the code.

fn check_☎(☎_no: &str) -> Result<(),Error> {
}

Of cause people will abuse it. But we can at least design a limited set of such symbols?

Tom-Phinney · 2018-11-01T04:58:14Z

We now do allow non-ASCII identifiers; it already landed. The identifiers we allow are most of the subset recommended by the Unicode Consortium, as detailed in many prior posts in this thread. Emoji are not among the consortium's recommendations. IIUC, any proposal to add them needs its own RFC.

See also Rust RFC 2457: rust-lang/rfcs#2457

CAD97 · 2021-12-22T16:15:16Z

https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html

bjorn3 · 2021-12-22T16:17:00Z

For homoglyphs there are several lints to prevent that. These lints have existed since long before that paper was released. For bidir control characters we got a new lint as part of the coordinated disclosure of CVE-2021-42574.

crlf0710 · 2022-01-24T02:43:39Z

And CVE-2021-42574 is about string literals, not identifiers. (Those control characters are not allowed in identifiers)

The feature `non_ascii_idents` gives Rune feature parity with Rust in supporting Unicode identifiers. Refer to: rust-lang/rfcs#2457

Initial draft of unicode-idents RFC

ec728b3

sfackler reviewed Jun 3, 2018

View reviewed changes

pyfisch mentioned this pull request Jun 3, 2018

Allow non-ASCII identifiers #2455

Closed

Centril added the T-lang Relevant to the language team, which will review and decide on the RFC. label Jun 3, 2018

est31 reviewed Jun 3, 2018

View reviewed changes

mark-i-m reviewed Jun 3, 2018

View reviewed changes

dscorbett reviewed Jun 3, 2018

View reviewed changes

Include expected Usage Notes and minor changes

4c1bda9

Raise two more questions. Suggest restriction levels as an alternative design. Describe the Go language identifier syntax.

SimonSapin reviewed Jun 4, 2018

View reviewed changes

zbraniecki mentioned this pull request Jun 4, 2018

Allow non-ASCII identifiers projectfluent/fluent#117

Open

Improve descriptions and fix typos

619f5b4

Thanks to SimonSapin for the suggestions.

Allow bare underscore identifiers

c4dff64

rfcbot added the finished-final-comment-period The final comment period is finished for this RFC. label Oct 29, 2018

rfcbot removed the final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. label Oct 29, 2018

Centril mentioned this pull request Oct 29, 2018

Tracking issue for RFC 2457, "Allow non-ASCII identifiers" rust-lang/rust#55467

Closed

23 tasks

RFC 2457

0c78631

Centril merged commit 7bf6206 into rust-lang:master Oct 29, 2018

SimonSapin mentioned this pull request Oct 29, 2018

XID_Start / XID_Continue might not be quite right rust-lang/rust#4928

Closed

pyfisch deleted the unicode-idents branch October 29, 2018 13:51

Xanewok mentioned this pull request Nov 2, 2018

Fix crashes when editing wide characters spanning 2 u16s (UTF-16) rust-lang/rls#1112

Merged

Centril added A-syntax Syntax related proposals & ideas A-non-ascii-idents Proposals relating to non-ascii identifiers labels Nov 23, 2018

crlf0710 mentioned this pull request Nov 13, 2019

Plans for development of key crates supporting standard library and/or compiler. rust-lang/wg-governance#27

Closed

Manishearth mentioned this pull request Apr 6, 2021

Allow non-ascii-idents in extern blocks rust-lang/rust#83942

Open

evolutics added a commit to evolutics/iftree that referenced this pull request May 24, 2021

Respect non-ASCII identifiers in sanitization for clearer names

04e8fc0

See also Rust RFC 2457: rust-lang/rfcs#2457

workingjubilee mentioned this pull request Feb 6, 2024

Lint against non-NFC items? rust-lang/rust#120723

Open

lotem added a commit to lotem/rune that referenced this pull request Aug 21, 2024

Add feature: non-ASCII identifiers

2bb699e

The feature `non_ascii_idents` gives Rune feature parity with Rust in supporting Unicode identifiers. Refer to: rust-lang/rfcs#2457

lotem mentioned this pull request Aug 21, 2024

Support non-ASCII identifiers rune-rs/rune#813

Merged

lotem added a commit to lotem/rune that referenced this pull request Aug 21, 2024

Add feature: non-ASCII identifiers

6a218c8

The feature `non_ascii_idents` gives Rune feature parity with Rust in supporting Unicode identifiers. Refer to: rust-lang/rfcs#2457


		Two identifiers X, Y are considered to be equal if there [NFKC forms][TR15] are equal: NFKC(X) = NFKC(Y).

		A `unicode_idents` lint is added to the compiler. This lint is `allow` by default. The lint checks if any identifier in the current context contains a codepoint with a value equal to or greater than 0x80 (outside ASCII range). Not only locally defined identifiers are checked but also those imported from other crates and modules into the current context.


		`XID_Start` and `XID_Continue` are used as defined in the aforementioned standard. The definition of identifiers is forward compatible with each successive release of Unicode as only appropriate new characters are added to the classes but none are removed.

		Two identifiers X, Y are considered to be equal if there [NFKC forms][TR15] are equal: NFKC(X) = NFKC(Y).


		Note: The confusable detection is set to `warn` instead of `deny` to enable forward compatibility. The list of confusable characters will be extended in the future and programs that were once valid would fail to compile.

		The confusable detection algorithm is based on [Unicode® Technical Standard #39 Unicode Security Mechanisms Section 4 Confusable Detection][TR39Confusable]. For every distinct identifier X in the current scope execute the function `skeleton(X)`. If there exist two distinct identifiers X and Yin the same crate where `skeleton(X) = skeleton(Y)` report it.


		For open source crates it is recommended to write them in English and use ASCII-only. An exception should be made if the application domain (e.g. math) benefits from Unicode and the target audience (e.g. for a crate interfacing with Russian passports) is comfortable with the used language and characters. Additionally crates should provide an ASCII-only API.

		Private projects can use any script and language the developer(s) desire. It is still a good idea (as with any language feature) not to overuse it.

Allow non-ASCII identifiers #2457

Allow non-ASCII identifiers #2457

Conversation

pyfisch commented Jun 3, 2018 • edited by Centril Loading

Choose a reason for hiding this comment

pyfisch Jun 3, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shingtaklam1324 Jun 4, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mark-i-m commented Jun 3, 2018

burdges commented Jun 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfackler commented Jun 3, 2018

mark-i-m commented Jun 4, 2018

shingtaklam1324 commented Jun 4, 2018

pyfisch commented Jun 4, 2018 • edited Loading

shingtaklam1324 commented Jun 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

est31 commented Jun 4, 2018

Choose a reason for hiding this comment

SimonSapin commented Jun 4, 2018

Choose a reason for hiding this comment

joshtriplett commented Jun 4, 2018 • edited Loading

est31 commented Jun 4, 2018

SimonSapin commented Jun 4, 2018

joshtriplett commented Jun 4, 2018

SimonSapin commented Jun 4, 2018 • edited Loading

est31 commented Jun 4, 2018

nox commented Oct 20, 2018

Manishearth commented Oct 20, 2018

strega-nil commented Oct 21, 2018 • edited Loading

Centril commented Oct 21, 2018

Manishearth commented Oct 21, 2018

rfcbot commented Oct 29, 2018

Centril commented Oct 29, 2018

earthengine commented Nov 1, 2018 • edited Loading

Tom-Phinney commented Nov 1, 2018

CAD97 commented Dec 22, 2021

bjorn3 commented Dec 22, 2021

crlf0710 commented Jan 24, 2022

pyfisch commented Jun 3, 2018 •

edited by Centril

Loading

pyfisch Jun 3, 2018 •

edited

Loading

shingtaklam1324 Jun 4, 2018 •

edited

Loading

pyfisch commented Jun 4, 2018 •

edited

Loading

joshtriplett commented Jun 4, 2018 •

edited

Loading

SimonSapin commented Jun 4, 2018 •

edited

Loading

strega-nil commented Oct 21, 2018 •

edited

Loading

earthengine commented Nov 1, 2018 •

edited

Loading