Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[libstd_unicode] Change UNICODE_VERSION to use u32 #42998

Merged
merged 2 commits into from
Aug 8, 2017

Conversation

behnam
Copy link
Contributor

@behnam behnam commented Jun 30, 2017

Looks like there's no strong reason to keep these values at u64.

With the current plans for the Unicode Standard, u8 should be enough for the next 200 years. To stay on the safe side, I'm using u16 here. I don't see a reason to go with anything machine-dependent/more-efficient.

@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @brson (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@behnam
Copy link
Contributor Author

behnam commented Jun 30, 2017

CC'ing from file's blame/history: @petrochenkov, @cuviper, @alexcrichton, @SimonSapin

@cuviper
Copy link
Member

cuviper commented Jun 30, 2017

It's public as std::char::UNICODE_VERSION, so wouldn't this be a breaking change?

@cuviper
Copy link
Member

cuviper commented Jun 30, 2017

Oh, but it's not stable...

@behnam
Copy link
Contributor Author

behnam commented Jun 30, 2017

It's public as std::char::UNICODE_VERSION, so wouldn't this be a breaking change?

Yep, gated: 🔬 This is a nightly-only experimental API. (unicode #27783)

@retep998
Copy link
Member

retep998 commented Jul 1, 2017

Unicode thought u16 would be enough to fit any character. A few years later they realized how horribly wrong they were and we ended up with the mess that is UTF-16 surrogates. I wonder if we'll ever look back at this decision in the future and realize u16 wasn't enough to store any version.

@behnam
Copy link
Contributor Author

behnam commented Jul 1, 2017

I wonder if we'll ever look back at this decision in the future and realize u16 wasn't enough to store any version.

That's why I'm going with u16 here, instead of u64.

Also, if anyone's curious, semver is also going withe u64: https://github.com/steveklabnik/semver-parser/blob/master/src/version.rs#L10

pub struct Version {
    pub major: u64,
    pub minor: u64,
    pub patch: u64,
    pub pre: Vec<Identifier>,
    pub build: Vec<Identifier>,
}

I don't know if that's a concern in this context or not.

@cuviper
Copy link
Member

cuviper commented Jul 1, 2017

What's the benefit to narrowing from u64? I don't see the point of worrying about a few bytes in a singleton object.

@behnam
Copy link
Contributor Author

behnam commented Jul 1, 2017

I want to use the same type for all unicode-version-based types in UNIC, which are not singleton and would go into a table. I think we're going to optimize those mappings tables eventually (with lookup tables, PHF or something similar), but I wanted to not have tables of size 4-8 times what's needed without the optimization.

@petrochenkov
Copy link
Contributor

Also, if anyone's curious, semver is also going withe u64

Ah, right, that was likely the reason!
(Semver was in the standard library back then.)

@Mark-Simulacrum Mark-Simulacrum added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jul 2, 2017
@bors
Copy link
Contributor

bors commented Jul 3, 2017

☔ The latest upstream changes (presumably #42999) made this pull request unmergeable. Please resolve the merge conflicts.

@behnam
Copy link
Contributor Author

behnam commented Jul 3, 2017

Updated patch:

  • Create UnicodeVersion tuple struct type to use for UNICODE_VERSION
    with helper pub API, to make the type and value more useful in
    third-party libraries.

  • Use u16 for version components, as u64 is just an overkill.
    There's no expectation for Unicode Versions to even reach one thousand
    in the next hundered years. (This is different from package versions,
    which may become something auto-generated and exceed human-friendly
    range of integer values.)


Here's the API:

/// Type for Unicode Version.
#[derive(Clone, Copy, Debug, Eq, Ord, PartialEq, PartialOrd)]
pub struct UnicodeVersion(
    pub u16, // Major version
    pub u16, // Minor version
    pub u16 // Micro (or Update) version
);

impl UnicodeVersion {
    /// Major version
    pub fn major(&self) -> u16 {
        self.0
    }

    /// Minor version
    pub fn minor(&self) -> u16 {
        self.1
    }

    /// Micro (or Update) version
    pub fn micro(&self) -> u16 {
        self.2
    }
}

Besides notes mentioned in the commit message (copies above), here's another reason I like to see this change:

We've been improving the API for the UCD Age Property in UNIC, and we landed on this design, which reused UnicodeVersion in the Assigned variant: open-i18n/rust-unic#23 (comment)

Having a more user-friendly UnicodeVersion type here allows third-party libraries, like UNIC, to reuse the type/concept and not have to redefine a new one.

And this would become possible without writing extra from/into conversions:

assert_eq!(
    unic::UNICODE_VERSION,
    std::char::UNICODE_VERSION
);

Although it's possible to do the same with an unnamed tuple type, the API is not going to be user-friendly, because users would have to use .0/.1/.2, or write from/into conversions to local types.

So, I would put this suggestion in the make-rust-more-user-friendly category.

What do you think?

@SimonSapin
Copy link
Contributor

Trivial accessors for public fields seem fairly pointless. If we do add a new type, why not use a not-tuple-like struct with named fields?

Regardless, I think the Unicode version number is not worth stabilizing new APIs beyond a tuple constant.

@brson
Copy link
Contributor

brson commented Jul 4, 2017

Trivial accessors for public fields seem fairly pointless. If we do add a new type, why not use a not-tuple-like struct with named fields?

Regardless, I think the Unicode version number is not worth stabilizing new APIs beyond a tuple constant.

I agree with both these points. Public fields + accessors is not a pattern we use elsewhere. Either the type encapsulates its details or not.

Can tuple-structs have named fields?

@SimonSapin
Copy link
Contributor

Can tuple-structs have named fields?

Not naming the fields is what makes a tuple-struct not be just a struct.

@behnam
Copy link
Contributor Author

behnam commented Jul 4, 2017

I believe the only problem with a named-struct here the verbose construction: UnicodeVersion { major: 10 , minor: 0, micro: 0 } instead of UnicodeVersion(10, 0, 0).

(At least that's the reason I kept it as tuple-struct. I wish there was a way to have a nameless static constructors for named-structs.)

Agree with all the points. So, is it obvious that we don't want a named-struct here?

Also, @SimonSapin, what do you think about the integer size, should we keep u64, or can we change to u16?

@SimonSapin
Copy link
Contributor

I don’t have much of an opinion on the integer size. I vaguely remember reading that Unicode is now planning on a new version every year, this leaves many years before we’re close to overflowing u16.

Regarding a nameless constructor, the convention is to call it new.

But yes, more strongly than the integer size, I feel that there shouldn’t be a UnicodeVersion type in std which only supports one version of Unicode. Every item (function, method, type, impl, …) that we stabilize in std has a long-term cost: we’ll have to maintain them basically forever. In that regard, a (u16, u16, u16) constant is much "lighter" than a constant + a dedicated type (which will need to be documented) + seven trait impls. If you define that type in UNIC it’ll be much easier for you to evolve it later.

@behnam
Copy link
Contributor Author

behnam commented Jul 4, 2017

Cool! So, looks like I won't be re-using this type anywhere. With that, there's almost no gain in reducing the size to u16. So, unless there's any interest to go ahead with the size change, I think we can close this PR.

@behnam
Copy link
Contributor Author

behnam commented Jul 5, 2017

Updated the PR to only change the integer type. So, we can either land or just let it be and close the PR. Not a big deal either way.

Thanks everyone for the feedback! :)

@ishitatsuyuki
Copy link
Contributor

I think that it's possible (if not likely) to use some ridiculous version naming in the future, such as the Ubuntu style. As this doesn't impose any performance drawbacks, I think it's fine to keep it u64, just in the case that the version changes into something like 20170401.0.0.

@sfackler
Copy link
Member

Why does this constant exist in the first place? It doesn't seem to be used anywhere in this repository at least.

@SimonSapin
Copy link
Contributor

@sfackler I added because the information should be available somewhere of what Unicode version is included in each Rust version for the purpose of std APIs like to_lowercase that are based on Unicode tables. Though I don’t really know of a use case to use it from Rust code, so as far as I’m concerned it could also be a generated doc-comment.

@sfackler
Copy link
Member

Yeah, sticking that in the crate/module level docs seems reasonable to me.

@SimonSapin
Copy link
Contributor

Unfortunately, both include!("generated.rs"); with an outer doc-comment in foo.rs and #![doc = include_str!("generated.txt")] cause compile errors.

@behnam
Copy link
Contributor Author

behnam commented Jul 11, 2017

I think it's actually useful to have this value available in the API, as it allows third-party libraries and their users to make assertions, or at least throw warnings when a version mismatch exists.

Also, this value, along with the character Age property (unic::ucd::age impl) allow users to know if some character is supported by standard lib functions, or not. Emojis, for example, are a common fast-growing sets of characters that are usually hard to handle in some platforms because of this data missing.

I have talked a bit about these issues in this UNIC's Unicode and Rust doc.

Create named struct `UnicodeVersion` to use instead of tuple type for
`UNICODE_VERSION` value. This allows user to access the fields with
meaningful field names: `major`, `minor`, and `micro`.

Per request, an empty private field is added to the struct, so it can be
extended in the future without API breakage.
@behnam
Copy link
Contributor Author

behnam commented Jul 24, 2017

@sfackler, I have updated the PR and added the UnicodeVersion type as an uninstantiable struct.

And, based on the discussions here, and looking at how other systems deal with this issue, I have put together a proposal for the UTC to define and stabilize the format for Unicode Version. Here's the submission: https://docs.google.com/document/d/1F5ysN477tzz8ZOl5DP40al1KLqnHpPEfiENmXDEbVsY/. I'll be glad to hear your feedback and address them before the meeting next week, when (most probably) it will be discussed.

Also, I think it's a good idea to wait a few more days on this PR until UTC reviews the submission. We may finally have an answer to the integer-size question here, after all.

@sfackler sfackler added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Jul 24, 2017
@sfackler
Copy link
Member

This looks good to me - up to you if we r+ this now or wait a bit.

@behnam
Copy link
Contributor Author

behnam commented Jul 26, 2017

Thanks, @sfackler. Let's wait for a week or two, then. I'll post an update then.

@carols10cents carols10cents added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jul 31, 2017
@oli-obk
Copy link
Contributor

oli-obk commented Aug 5, 2017

This PR has been without comments for a week. I guess the informal RFC period is over? ^^

@behnam
Copy link
Contributor Author

behnam commented Aug 5, 2017

This PR has been without comments for a week. I guess the informal RFC period is over? ^^

We've been waiting for a review of the proposal to specify and stabilize the value format for Unicode version by the UTC.

Now, looks like there was no time to get to this proposal at this meeting. With that, I think we can land this diff as it is, and refine more (e.g. make the type instantiable) on further development on the standard side.

r? @sfackler

@behnam
Copy link
Contributor Author

behnam commented Aug 7, 2017

label- S-waiting-on-author
label+ S-waiting-on-review

@Mark-Simulacrum Mark-Simulacrum added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Aug 8, 2017
@sfackler
Copy link
Member

sfackler commented Aug 8, 2017

@bors r+

@bors
Copy link
Contributor

bors commented Aug 8, 2017

📌 Commit 42f8861 has been approved by sfackler

@bors
Copy link
Contributor

bors commented Aug 8, 2017

⌛ Testing commit 42f8861 with merge d69cdca...

bors added a commit that referenced this pull request Aug 8, 2017
[libstd_unicode] Change UNICODE_VERSION to use u32

Looks like there's no strong reason to keep these values at `u64`.

With the current plans for the Unicode Standard, `u8` should be enough for the next 200 years. To stay on the safe side, I'm using `u16` here. I don't see a reason to go with anything machine-dependent/more-efficient.
@bors
Copy link
Contributor

bors commented Aug 8, 2017

☀️ Test successful - status-appveyor, status-travis
Approved by: sfackler
Pushing d69cdca to master...

@bors bors merged commit 42f8861 into rust-lang:master Aug 8, 2017
@behnam behnam deleted the uni-ver-type branch August 30, 2017 20:51
behnam added a commit to behnam/rust that referenced this pull request Sep 19, 2017
In <rust-lang#42998>, we added an
uninstantiable type for the internal `UNICODE_VERSION` value,
`UnicodeVersion`, but it was not made public to the outside of the
crate, resulting in the value becoming less useful. Here we make the
type accessible from the outside.

Also add a run-pass test to make sure the type and value can be accessed
as intended.
GuillaumeGomez added a commit to GuillaumeGomez/rust that referenced this pull request Sep 19, 2017
[libstd_unicode] Expose UnicodeVersion type

In <rust-lang#42998>, we added an
uninstantiable type for the internal `UNICODE_VERSION` value,
`UnicodeVersion`, but it was not made public to the outside of the
crate, resulting in the value becoming less useful. Here we make the
type accessible from the outside.

Also add a run-pass test to make sure the type and value can be accessed
as intended.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.