-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking issue for non-ASCII identifiers (feature "non_ascii_idents") #28979
Comments
/cc @rust-lang/lang |
nominating |
cc @SimonSapin Apparently we implement this: http://www.unicode.org/reports/tr31/ or something like it. I would like to see this stabilised, but it will take some work to persuade ourselves that we are doing the right thing. |
I have no idea what the right thing is here. In addition to Unicode recommendations, we might want to look at what other languages actually do, and what related bug reports or criticism they get. Or was this already done when the feature was first introduced? |
@SimonSapin There's also a problem with normalization of identifiers and mapping unicode |
Yes #2253 is the big issue I know of that makes me worry about premature stabilization of non-unicode identifiers. (The discussion there is more broad and arguably could be forked off into two threads; e.g. we could take one normalization path for identifiers and another for string literal contents.) |
we may want to migrate This discussion to the RFCS repo, e.g. at rust-lang/rfcs#802 |
I agree that this is a feature that deserves to be put through the RFC process. |
I've repurposed this issue to track stabilization (or deprecation, etc) of the |
After discussion in the lang team meeting, we decided that yes, an RFC would be the proper way forward here. We need something that collects the solutions from other languages, analyzes their pros/cons, and suggests the appropriate choice for Rust. This is controversial and complex enough that it should be brought to the community at large -- especially as many of us hacking on Rust on a daily basis don't have a lot of experience with non-ASCII anyhow. |
triage: P-low Marking as low as there is no RFC at present and hence no actionable content. |
cc #7539 |
In JavaScript, Perl 5 and Perl 6 this feature is available. function Слово(стойност) {
this.стойност = стойност;
}
var здрасти = new Слово("Здравей, свят");
console.log(здрасти.стойност) //Здравей, свят Perl >=5.12 use utf8;
{
package Слово;
sub new {
my $self = bless {}, shift;
$self->{стойност} = shift;
$self
}
};
my $здрасти = Слово->new("здравей, свят");
say ucfirst($здрасти->{стойност}); #Здравей, свят Perl6 (this is not just next version of Perl. This is a new language) class Слово {
has $.стойност;
}
my $здрасти = Слово.new(стойност => 'здравей, свят');
say $здрасти.tc; #Здравей, свят I would be happy to see it in Rust too. |
For what it’s worth identifiers in ECMAScript 2015 are based on the Default Identifier Syntax from Unicode Standard Annex #31. Perl with
|
Yes! Thanks @SimonSapin! |
For Python it’s So it looks like many programming languages that allow non-ASCII identifiers are based on the same standard, but in the details they each do something slightly different… |
I would personally love to see support for math-related identifiers. For example, ∅ (and set operators, like ∩ and ∪). Translating equations from research papers/specifications into code is often a terrible process resulting in verbose and difficult to read code. Being able to use the same identifiers in the code that are in the paper's math equations would simplify implementation and would make the code easier to check and compare against the paper's equations. |
What's point of this feature exactly? Aside from adding possibility to create truly ugly mix of different languages in your code(english is the only truly international language), it gives no benefits to language functionality wise. Or is it support of unicode for the sake of supporting unicode? |
I'd like to cross-link this comment: #4928 (comment) |
I haven't seen the possibility of enabling homoglyph-based attacks here (If somebody mentioned them please ignore the noise), but I just filled a clippy issue to request a lint that warns on code like this: #![feature(non_ascii_idents)]
fn main() {
let a = 2;
let а = 3;
assert_eq!(a, 2); // OK
assert_eq!(а, 3); // OK
} In a nutshell, those two This "feature" can be used to introduce exploits in Rust programs that are harder to detect, in particular given that shadowing let bindings are considered idiomatic Rust by many, myself included. P.S.: this "feature" might be useful in underhanded Rust contests, although that |
@gnzlbg I believe there's already some support for confusables detection to stop people swapping out your semicolons for Greek question marks and such, but I don't know if it applies to identifiers. If it does, then that solves that problem; if it doesn't, at least we have the tooling to do it ready to go. I'm a little concerned that this is a candidate for being closed and the code removed from the compiler because it's not had significant movement for a while and requires an RFC. I care a fair amount about Rust being a language of the 21st century, which means Unicode, and about Rust being friendly to non-English-speaking programmers. What I lack is the ability to actually write an RFC. |
yes, I think that, as suggested by @oli-obk in the clippy issue, Rust implementation would instead just use the latest official confusable list: http://www.unicode.org/Public/security/revision-06/confusables.txt homoglyph-based attacks can be prevented. This list would need to be kept in sync though, but that is something that can be automated as part of the build system. |
If you care about this, there are other languages that support unicode in their identifiers, and these languages have processes similar to the RFC process. You could start by checking those. Who knows, maybe you can just merge them together with the feedback in this issue, and get a pre-RFC in the internals forum going? From that point on, it is just about incorporating/arguing feedback with others, and before you know it you will have an RFC ready. |
The grammar defines identifiers in terms of XID_start and XID_continue, but this is referring to the unstable non_ascii_idents feature. The documentation implies that non_ascii_idents is forthcoming, but this is left over from pre-1.0 documentation; in reality, non_ascii_idents has been without even an RFC for several years now, and will not be stabilized anytime soon. Furthermore, according to the tracking issue at rust-lang#28979 , it's highly questionable whether or not this feature will use XID_start or XID_continue even when or if non_ascii_idents is stabilized. This commit fixes this by respecifying identifiers as the usual [a-zA-Z_][a-zA-Z0-9_]*
Fix grammar documentation wrt Unicode identifiers The grammar defines identifiers in terms of XID_start and XID_continue, but this is referring to the unstable non_ascii_idents feature. The documentation implies that non_ascii_idents is forthcoming, but this is left over from pre-1.0 documentation; in reality, non_ascii_idents has been without even an RFC for several years now, and will not be stabilized anytime soon. Furthermore, according to the tracking issue at rust-lang#28979 , it's highly questionable whether or not this feature will use XID_start or XID_continue even when or if non_ascii_idents is stabilized. This commit fixes this by respecifying identifiers as the usual [a-zA-Z_][a-zA-Z0-9_]*
In a way I hope we stick with ASCII identifiers forever. Handling unicode identifiers is such a massive interoperability pain. Some of the more bizarre examples of NFKC mappings is that things like this map to the same identifier: >>> ℌ = 1
>>> H
1
>>> Ⅸ = 42
>>> IX
42
>>> ℕ = 23
>>> N
23
>>> import math
>>> ℯ = math.e
>>> e
2.718281828459045
>>> ℨ = 2
>>> Z
2 |
@mitsuhiko The real world has that kind of pain. We can't just ignore this problem because it's hard to deal with and involves a feature that you personally have no use for. |
Also, the current RFC explicitly proposes NFC over NFKC, after a lot of discussion about examples very similar to those. |
Closing in favor of #55467. |
Issue rust-lang#28979 was closed with a link to rust-lang#55467.
Update references to closed issue Issue rust-lang#28979 was closed with a link to rust-lang#55467.
Issue rust-lang#28979 was closed with a link to rust-lang#55467.
It's 2021 and we should be more inclusive in language design and whats allowed in identifiers; coming from Python world Python3 support for unicode identifiers/functions/module names is truly great progress and I wish this for Rust community as well. |
@arcturusannamalai please see the final comment here, this work is still ongoing, and including that is the plan. |
CPython 3's support for non-ASCII identifiers is pretty spotty, variable and hard to determine. See https://tjol.eu/blog/unicode-identifiers.html One interesting problem is that with Unicode changes between versions, valid identifiers in Python source code can become invalid. |
I'm actually comfortable programming in 1's and 0's too just happen to want non-ASCII .. but English suffices for systems level work. |
Just to be clear, this feature landed in stable Rust in 1.53.0, almost two years ago #83799 |
Non-ASCII identifiers are currently feature gated. Handling of them should be fixed and the feature gate removed.
The text was updated successfully, but these errors were encountered: