[libstd_unicode] Change UNICODE_VERSION to use u32 #42998

behnam · 2017-06-30T23:29:35Z

Looks like there's no strong reason to keep these values at u64.

With the current plans for the Unicode Standard, u8 should be enough for the next 200 years. To stay on the safe side, I'm using u16 here. I don't see a reason to go with anything machine-dependent/more-efficient.

rust-highfive · 2017-06-30T23:29:49Z

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @brson (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

behnam · 2017-06-30T23:30:16Z

CC'ing from file's blame/history: @petrochenkov, @cuviper, @alexcrichton, @SimonSapin

cuviper · 2017-06-30T23:43:19Z

It's public as std::char::UNICODE_VERSION, so wouldn't this be a breaking change?

cuviper · 2017-06-30T23:44:25Z

Oh, but it's not stable...

behnam · 2017-06-30T23:44:36Z

It's public as std::char::UNICODE_VERSION, so wouldn't this be a breaking change?

Yep, gated: 🔬 This is a nightly-only experimental API. (unicode #27783)

retep998 · 2017-07-01T01:37:31Z

Unicode thought u16 would be enough to fit any character. A few years later they realized how horribly wrong they were and we ended up with the mess that is UTF-16 surrogates. I wonder if we'll ever look back at this decision in the future and realize u16 wasn't enough to store any version.

behnam · 2017-07-01T01:43:23Z

I wonder if we'll ever look back at this decision in the future and realize u16 wasn't enough to store any version.

That's why I'm going with u16 here, instead of u64.

Also, if anyone's curious, semver is also going withe u64: https://github.com/steveklabnik/semver-parser/blob/master/src/version.rs#L10

pub struct Version {
    pub major: u64,
    pub minor: u64,
    pub patch: u64,
    pub pre: Vec<Identifier>,
    pub build: Vec<Identifier>,
}

I don't know if that's a concern in this context or not.

cuviper · 2017-07-01T02:19:03Z

What's the benefit to narrowing from u64? I don't see the point of worrying about a few bytes in a singleton object.

behnam · 2017-07-01T02:34:15Z

I want to use the same type for all unicode-version-based types in UNIC, which are not singleton and would go into a table. I think we're going to optimize those mappings tables eventually (with lookup tables, PHF or something similar), but I wanted to not have tables of size 4-8 times what's needed without the optimization.

petrochenkov · 2017-07-01T08:20:49Z

Also, if anyone's curious, semver is also going withe u64

Ah, right, that was likely the reason!
(Semver was in the standard library back then.)

bors · 2017-07-03T19:38:54Z

☔ The latest upstream changes (presumably #42999) made this pull request unmergeable. Please resolve the merge conflicts.

behnam · 2017-07-03T22:09:11Z

Updated patch:

Create UnicodeVersion tuple struct type to use for UNICODE_VERSION
with helper pub API, to make the type and value more useful in
third-party libraries.
Use u16 for version components, as u64 is just an overkill.
There's no expectation for Unicode Versions to even reach one thousand
in the next hundered years. (This is different from package versions,
which may become something auto-generated and exceed human-friendly
range of integer values.)

Here's the API:

/// Type for Unicode Version.
#[derive(Clone, Copy, Debug, Eq, Ord, PartialEq, PartialOrd)]
pub struct UnicodeVersion(
    pub u16, // Major version
    pub u16, // Minor version
    pub u16 // Micro (or Update) version
);

impl UnicodeVersion {
    /// Major version
    pub fn major(&self) -> u16 {
        self.0
    }

    /// Minor version
    pub fn minor(&self) -> u16 {
        self.1
    }

    /// Micro (or Update) version
    pub fn micro(&self) -> u16 {
        self.2
    }
}

Besides notes mentioned in the commit message (copies above), here's another reason I like to see this change:

We've been improving the API for the UCD Age Property in UNIC, and we landed on this design, which reused UnicodeVersion in the Assigned variant: open-i18n/rust-unic#23 (comment)

Having a more user-friendly UnicodeVersion type here allows third-party libraries, like UNIC, to reuse the type/concept and not have to redefine a new one.

And this would become possible without writing extra from/into conversions:

assert_eq!(
    unic::UNICODE_VERSION,
    std::char::UNICODE_VERSION
);

Although it's possible to do the same with an unnamed tuple type, the API is not going to be user-friendly, because users would have to use .0/.1/.2, or write from/into conversions to local types.

So, I would put this suggestion in the make-rust-more-user-friendly category.

What do you think?

SimonSapin · 2017-07-03T23:03:18Z

Trivial accessors for public fields seem fairly pointless. If we do add a new type, why not use a not-tuple-like struct with named fields?

Regardless, I think the Unicode version number is not worth stabilizing new APIs beyond a tuple constant.

brson · 2017-07-04T00:46:10Z

Trivial accessors for public fields seem fairly pointless. If we do add a new type, why not use a not-tuple-like struct with named fields?

Regardless, I think the Unicode version number is not worth stabilizing new APIs beyond a tuple constant.

I agree with both these points. Public fields + accessors is not a pattern we use elsewhere. Either the type encapsulates its details or not.

Can tuple-structs have named fields?

SimonSapin · 2017-07-04T00:57:08Z

Can tuple-structs have named fields?

Not naming the fields is what makes a tuple-struct not be just a struct.

behnam · 2017-07-04T01:30:42Z

I believe the only problem with a named-struct here the verbose construction: UnicodeVersion { major: 10 , minor: 0, micro: 0 } instead of UnicodeVersion(10, 0, 0).

(At least that's the reason I kept it as tuple-struct. I wish there was a way to have a nameless static constructors for named-structs.)

Agree with all the points. So, is it obvious that we don't want a named-struct here?

Also, @SimonSapin, what do you think about the integer size, should we keep u64, or can we change to u16?

SimonSapin · 2017-07-04T01:45:13Z

I don’t have much of an opinion on the integer size. I vaguely remember reading that Unicode is now planning on a new version every year, this leaves many years before we’re close to overflowing u16.

Regarding a nameless constructor, the convention is to call it new.

But yes, more strongly than the integer size, I feel that there shouldn’t be a UnicodeVersion type in std which only supports one version of Unicode. Every item (function, method, type, impl, …) that we stabilize in std has a long-term cost: we’ll have to maintain them basically forever. In that regard, a (u16, u16, u16) constant is much "lighter" than a constant + a dedicated type (which will need to be documented) + seven trait impls. If you define that type in UNIC it’ll be much easier for you to evolve it later.

behnam · 2017-07-04T09:45:54Z

Cool! So, looks like I won't be re-using this type anywhere. With that, there's almost no gain in reducing the size to u16. So, unless there's any interest to go ahead with the size change, I think we can close this PR.

behnam · 2017-07-05T19:21:21Z

Updated the PR to only change the integer type. So, we can either land or just let it be and close the PR. Not a big deal either way.

Thanks everyone for the feedback! :)

ishitatsuyuki · 2017-07-08T06:46:36Z

I think that it's possible (if not likely) to use some ridiculous version naming in the future, such as the Ubuntu style. As this doesn't impose any performance drawbacks, I think it's fine to keep it u64, just in the case that the version changes into something like 20170401.0.0.

sfackler · 2017-07-11T15:17:50Z

Why does this constant exist in the first place? It doesn't seem to be used anywhere in this repository at least.

SimonSapin · 2017-07-11T15:47:28Z

@sfackler I added because the information should be available somewhere of what Unicode version is included in each Rust version for the purpose of std APIs like to_lowercase that are based on Unicode tables. Though I don’t really know of a use case to use it from Rust code, so as far as I’m concerned it could also be a generated doc-comment.

sfackler · 2017-07-11T19:34:58Z

Yeah, sticking that in the crate/module level docs seems reasonable to me.

SimonSapin · 2017-07-11T19:47:10Z

Unfortunately, both include!("generated.rs"); with an outer doc-comment in foo.rs and #![doc = include_str!("generated.txt")] cause compile errors.

behnam · 2017-07-11T19:56:53Z

I think it's actually useful to have this value available in the API, as it allows third-party libraries and their users to make assertions, or at least throw warnings when a version mismatch exists.

Also, this value, along with the character Age property (unic::ucd::age impl) allow users to know if some character is supported by standard lib functions, or not. Emojis, for example, are a common fast-growing sets of characters that are usually hard to handle in some platforms because of this data missing.

I have talked a bit about these issues in this UNIC's Unicode and Rust doc.

Create named struct `UnicodeVersion` to use instead of tuple type for `UNICODE_VERSION` value. This allows user to access the fields with meaningful field names: `major`, `minor`, and `micro`. Per request, an empty private field is added to the struct, so it can be extended in the future without API breakage.

behnam · 2017-07-24T19:29:11Z

@sfackler, I have updated the PR and added the UnicodeVersion type as an uninstantiable struct.

And, based on the discussions here, and looking at how other systems deal with this issue, I have put together a proposal for the UTC to define and stabilize the format for Unicode Version. Here's the submission: https://docs.google.com/document/d/1F5ysN477tzz8ZOl5DP40al1KLqnHpPEfiENmXDEbVsY/. I'll be glad to hear your feedback and address them before the meeting next week, when (most probably) it will be discussed.

Also, I think it's a good idea to wait a few more days on this PR until UTC reviews the submission. We may finally have an answer to the integer-size question here, after all.

sfackler · 2017-07-26T16:09:09Z

This looks good to me - up to you if we r+ this now or wait a bit.

behnam · 2017-07-26T16:29:05Z

Thanks, @sfackler. Let's wait for a week or two, then. I'll post an update then.

oli-obk · 2017-08-05T06:17:09Z

This PR has been without comments for a week. I guess the informal RFC period is over? ^^

behnam · 2017-08-05T06:57:44Z

This PR has been without comments for a week. I guess the informal RFC period is over? ^^

We've been waiting for a review of the proposal to specify and stabilize the value format for Unicode version by the UTC.

Now, looks like there was no time to get to this proposal at this meeting. With that, I think we can land this diff as it is, and refine more (e.g. make the type instantiable) on further development on the standard side.

r? @sfackler

behnam · 2017-08-07T20:36:40Z

label- S-waiting-on-author
label+ S-waiting-on-review

sfackler · 2017-08-08T04:29:47Z

@bors r+

bors · 2017-08-08T04:29:48Z

📌 Commit 42f8861 has been approved by sfackler

bors · 2017-08-08T06:48:50Z

⌛ Testing commit 42f8861 with merge d69cdca...

[libstd_unicode] Change UNICODE_VERSION to use u32 Looks like there's no strong reason to keep these values at `u64`. With the current plans for the Unicode Standard, `u8` should be enough for the next 200 years. To stay on the safe side, I'm using `u16` here. I don't see a reason to go with anything machine-dependent/more-efficient.

bors · 2017-08-08T09:16:49Z

☀️ Test successful - status-appveyor, status-travis
Approved by: sfackler
Pushing d69cdca to master...

In <rust-lang#42998>, we added an uninstantiable type for the internal `UNICODE_VERSION` value, `UnicodeVersion`, but it was not made public to the outside of the crate, resulting in the value becoming less useful. Here we make the type accessible from the outside. Also add a run-pass test to make sure the type and value can be accessed as intended.

[libstd_unicode] Expose UnicodeVersion type In <rust-lang#42998>, we added an uninstantiable type for the internal `UNICODE_VERSION` value, `UnicodeVersion`, but it was not made public to the outside of the crate, resulting in the value becoming less useful. Here we make the type accessible from the outside. Also add a run-pass test to make sure the type and value can be accessed as intended.

rust-highfive assigned brson Jun 30, 2017

behnam mentioned this pull request Jun 30, 2017

[libstd_unicode] Upgrade to Unicode 10.0.0 #42999

Merged

Mark-Simulacrum added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jul 2, 2017

behnam force-pushed the uni-ver-type branch from 648fe0d to 577408d Compare July 3, 2017 21:47

behnam force-pushed the uni-ver-type branch from 577408d to ae49221 Compare July 5, 2017 19:19

behnam force-pushed the uni-ver-type branch from 5844b94 to 42f8861 Compare July 21, 2017 18:09

sfackler added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Jul 24, 2017

carols10cents added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jul 31, 2017

clarfonthey mentioned this pull request Aug 5, 2017

Lint tuple structs that could be named-field structs rust-lang/rust-clippy#1926

Open

Mark-Simulacrum added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Aug 8, 2017

bors merged commit 42f8861 into rust-lang:master Aug 8, 2017

behnam deleted the uni-ver-type branch August 30, 2017 20:51

behnam mentioned this pull request Sep 19, 2017

[libstd_unicode] Expose UnicodeVersion type #44689

Merged

SimonSapin mentioned this pull request Apr 6, 2018

Tracking issue for UnicodeVersion and UNICODE_VERSION #49726

Closed

[libstd_unicode] Change UNICODE_VERSION to use u32 #42998

[libstd_unicode] Change UNICODE_VERSION to use u32 #42998

Conversation

behnam commented Jun 30, 2017

rust-highfive commented Jun 30, 2017

behnam commented Jun 30, 2017

cuviper commented Jun 30, 2017

cuviper commented Jun 30, 2017

behnam commented Jun 30, 2017 • edited Loading

retep998 commented Jul 1, 2017

behnam commented Jul 1, 2017

cuviper commented Jul 1, 2017

behnam commented Jul 1, 2017

petrochenkov commented Jul 1, 2017

bors commented Jul 3, 2017

behnam commented Jul 3, 2017

SimonSapin commented Jul 3, 2017

brson commented Jul 4, 2017

SimonSapin commented Jul 4, 2017

behnam commented Jul 4, 2017 • edited Loading

SimonSapin commented Jul 4, 2017

behnam commented Jul 4, 2017

behnam commented Jul 5, 2017

ishitatsuyuki commented Jul 8, 2017

sfackler commented Jul 11, 2017

SimonSapin commented Jul 11, 2017

sfackler commented Jul 11, 2017

SimonSapin commented Jul 11, 2017

behnam commented Jul 11, 2017

behnam commented Jul 24, 2017

sfackler commented Jul 26, 2017

behnam commented Jul 26, 2017

oli-obk commented Aug 5, 2017

behnam commented Aug 5, 2017

behnam commented Aug 7, 2017

sfackler commented Aug 8, 2017

bors commented Aug 8, 2017

bors commented Aug 8, 2017

bors commented Aug 8, 2017

behnam commented Jun 30, 2017 •

edited

Loading

behnam commented Jul 4, 2017 •

edited

Loading