-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[libstd_unicode] Change UNICODE_VERSION to use u32 #42998
Conversation
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @brson (or someone else) soon. If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes. Please see the contribution instructions for more information. |
CC'ing from file's blame/history: @petrochenkov, @cuviper, @alexcrichton, @SimonSapin |
It's public as |
Oh, but it's not stable... |
Yep, gated: |
Unicode thought |
That's why I'm going with Also, if anyone's curious, semver is also going withe pub struct Version {
pub major: u64,
pub minor: u64,
pub patch: u64,
pub pre: Vec<Identifier>,
pub build: Vec<Identifier>,
} I don't know if that's a concern in this context or not. |
What's the benefit to narrowing from |
I want to use the same type for all unicode-version-based types in UNIC, which are not singleton and would go into a table. I think we're going to optimize those mappings tables eventually (with lookup tables, PHF or something similar), but I wanted to not have tables of size 4-8 times what's needed without the optimization. |
Ah, right, that was likely the reason! |
☔ The latest upstream changes (presumably #42999) made this pull request unmergeable. Please resolve the merge conflicts. |
Updated patch:
Here's the API: /// Type for Unicode Version.
#[derive(Clone, Copy, Debug, Eq, Ord, PartialEq, PartialOrd)]
pub struct UnicodeVersion(
pub u16, // Major version
pub u16, // Minor version
pub u16 // Micro (or Update) version
);
impl UnicodeVersion {
/// Major version
pub fn major(&self) -> u16 {
self.0
}
/// Minor version
pub fn minor(&self) -> u16 {
self.1
}
/// Micro (or Update) version
pub fn micro(&self) -> u16 {
self.2
}
} Besides notes mentioned in the commit message (copies above), here's another reason I like to see this change: We've been improving the API for the UCD Age Property in UNIC, and we landed on this design, which reused Having a more user-friendly And this would become possible without writing extra from/into conversions:
Although it's possible to do the same with an unnamed tuple type, the API is not going to be user-friendly, because users would have to use So, I would put this suggestion in the make-rust-more-user-friendly category. What do you think? |
Trivial accessors for public fields seem fairly pointless. If we do add a new type, why not use a not-tuple-like struct with named fields? Regardless, I think the Unicode version number is not worth stabilizing new APIs beyond a tuple constant. |
I agree with both these points. Public fields + accessors is not a pattern we use elsewhere. Either the type encapsulates its details or not. Can tuple-structs have named fields? |
Not naming the fields is what makes a tuple-struct not be just a struct. |
I believe the only problem with a named-struct here the verbose construction: (At least that's the reason I kept it as tuple-struct. I wish there was a way to have a nameless static constructors for named-structs.) Agree with all the points. So, is it obvious that we don't want a named-struct here? Also, @SimonSapin, what do you think about the integer size, should we keep |
I don’t have much of an opinion on the integer size. I vaguely remember reading that Unicode is now planning on a new version every year, this leaves many years before we’re close to overflowing Regarding a nameless constructor, the convention is to call it But yes, more strongly than the integer size, I feel that there shouldn’t be a |
Cool! So, looks like I won't be re-using this type anywhere. With that, there's almost no gain in reducing the size to |
Updated the PR to only change the integer type. So, we can either land or just let it be and close the PR. Not a big deal either way. Thanks everyone for the feedback! :) |
I think that it's possible (if not likely) to use some ridiculous version naming in the future, such as the Ubuntu style. As this doesn't impose any performance drawbacks, I think it's fine to keep it u64, just in the case that the version changes into something like |
Why does this constant exist in the first place? It doesn't seem to be used anywhere in this repository at least. |
@sfackler I added because the information should be available somewhere of what Unicode version is included in each Rust version for the purpose of |
Yeah, sticking that in the crate/module level docs seems reasonable to me. |
Unfortunately, both |
I think it's actually useful to have this value available in the API, as it allows third-party libraries and their users to make assertions, or at least throw warnings when a version mismatch exists. Also, this value, along with the character Age property (unic::ucd::age impl) allow users to know if some character is supported by standard lib functions, or not. Emojis, for example, are a common fast-growing sets of characters that are usually hard to handle in some platforms because of this data missing. I have talked a bit about these issues in this UNIC's Unicode and Rust doc. |
Create named struct `UnicodeVersion` to use instead of tuple type for `UNICODE_VERSION` value. This allows user to access the fields with meaningful field names: `major`, `minor`, and `micro`. Per request, an empty private field is added to the struct, so it can be extended in the future without API breakage.
@sfackler, I have updated the PR and added the And, based on the discussions here, and looking at how other systems deal with this issue, I have put together a proposal for the UTC to define and stabilize the format for Unicode Version. Here's the submission: https://docs.google.com/document/d/1F5ysN477tzz8ZOl5DP40al1KLqnHpPEfiENmXDEbVsY/. I'll be glad to hear your feedback and address them before the meeting next week, when (most probably) it will be discussed. Also, I think it's a good idea to wait a few more days on this PR until UTC reviews the submission. We may finally have an answer to the integer-size question here, after all. |
This looks good to me - up to you if we r+ this now or wait a bit. |
Thanks, @sfackler. Let's wait for a week or two, then. I'll post an update then. |
This PR has been without comments for a week. I guess the informal RFC period is over? ^^ |
We've been waiting for a review of the proposal to specify and stabilize the value format for Unicode version by the UTC. Now, looks like there was no time to get to this proposal at this meeting. With that, I think we can land this diff as it is, and refine more (e.g. make the type instantiable) on further development on the standard side. r? @sfackler |
label- S-waiting-on-author |
@bors r+ |
📌 Commit 42f8861 has been approved by |
[libstd_unicode] Change UNICODE_VERSION to use u32 Looks like there's no strong reason to keep these values at `u64`. With the current plans for the Unicode Standard, `u8` should be enough for the next 200 years. To stay on the safe side, I'm using `u16` here. I don't see a reason to go with anything machine-dependent/more-efficient.
☀️ Test successful - status-appveyor, status-travis |
In <rust-lang#42998>, we added an uninstantiable type for the internal `UNICODE_VERSION` value, `UnicodeVersion`, but it was not made public to the outside of the crate, resulting in the value becoming less useful. Here we make the type accessible from the outside. Also add a run-pass test to make sure the type and value can be accessed as intended.
[libstd_unicode] Expose UnicodeVersion type In <rust-lang#42998>, we added an uninstantiable type for the internal `UNICODE_VERSION` value, `UnicodeVersion`, but it was not made public to the outside of the crate, resulting in the value becoming less useful. Here we make the type accessible from the outside. Also add a run-pass test to make sure the type and value can be accessed as intended.
Looks like there's no strong reason to keep these values at
u64
.With the current plans for the Unicode Standard,
u8
should be enough for the next 200 years. To stay on the safe side, I'm usingu16
here. I don't see a reason to go with anything machine-dependent/more-efficient.