Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Treat gc=No characters as numeric #51609

Merged
merged 1 commit into from
Aug 1, 2018
Merged

Treat gc=No characters as numeric #51609

merged 1 commit into from
Aug 1, 2018

Conversation

dscorbett
Copy link
Contributor

char::is_numeric and char::is_alphanumeric are documented to be defined “in terms of the Unicode General Categories 'Nd', 'Nl', 'No'”, but unicode.py does not group 'No' with the other 'N' categories. These functions therefore currently return false for characters like ⟨¾⟩ and ⟨①⟩.

@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @Mark-Simulacrum (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jun 17, 2018
@kennytm kennytm added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Jun 17, 2018
@Mark-Simulacrum
Copy link
Member

r? @SimonSapin

@SimonSapin
Copy link
Contributor

#15283 changed from literally testing Nd, Nl, and No to testing for category N which is correct according to Unicode but unicode.py has been buggy there ever since it was added in #13700.

Changing the behavior now (even if back from a bug) after 4 years might be considered breaking, but we already do that when updating to a new version of Unicode. (Though only code points that were unassigned in the earlier version are affected.)

@rust-lang/libs, any thoughts on the stability aspect?

@SimonSapin SimonSapin added the S-waiting-on-team Status: Awaiting decision from the relevant subteam (see the T-<team> label). label Jun 17, 2018
@SimonSapin
Copy link
Contributor

By the way, I’m very late for this conversation since this method has been stable since 1.0, but I wondered what it would be used for. In https://github.com/search?l=Rust&p=1&q=.is_numeric%28%29&type=Code most of the results seem to be one of:

  • Results for exercism.io, Advent of Code, Project Euler, etc
  • Code that is buggy and should use is_ascii_digit instead
  • Copies of Rust’s own test cases

If this method was proposed now I’d argue against including it in the standard library since it’s usually not what people are looking for.

@sfackler
Copy link
Member

Should we just deprecate this entirely? I agree that is_ascii_digit is almost of the time what you're actually looking for.

@kennytm kennytm added I-nominated and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jun 18, 2018
@SimonSapin
Copy link
Contributor

I’d be ok with deprecating this and (some?) other Unicode character database-based char::is_* methods (stabilized in #20395, some have been carried since ac13f0d in 2011, neither thread with much discussion of use cases). But let’s discuss it separately from this thread.

Back to this PR: regardless of deprecation this function will stick around so we need to decide on its behavior. I think it’s clear that this PLR is a bug fix that restores the behavior to what was intended (and initially implemented). However the buggy behavior has been around for 4 years, and even documented with examples. (This PR is changing the doctest.)

@clarfonthey
Copy link
Contributor

clarfonthey commented Jun 22, 2018

I think that the doctests should be treated as more important documentation than the text itself. IMHO the text should be updated to match the doctest, not the other way around.

I agree with deprecating, too. I honestly can't think of any specific reason why code which requires this specific definition of numeric wouldn't be using a crate like unic which offers lookup for all Unicode categories.

@alexcrichton
Copy link
Member

I would personally be in favor of landing this patch and continuing to keep these methods. While this contradicts the doctests it doesn't contradict the documentation nor the naive interpretation of this and other is_* accessors (aka lookups for the unicode tables).

While this can be a footgun sometimes it doesn't mean that it's always a footgun. If we really want to remove the method then we could consider renaming it to is_unicode_numeric or something like that. That's a pretty heavy hammer, though, for what seems like a bug that doesn't come up all that often

@stokhos stokhos added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed I-nominated S-waiting-on-team Status: Awaiting decision from the relevant subteam (see the T-<team> label). S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jun 29, 2018
@stokhos
Copy link

stokhos commented Jul 6, 2018

Ping from triage @dscorbett , we haven't heard from you for a while, will you have time to like into this PR

@dscorbett
Copy link
Contributor Author

I agree with @alexcrichton: the documentation defines the function in terms of Nd, Nl, and No, so that is what the functions should do. The two contradictory cases in the doctests are not as important, because readers are, I think, more likely to read the main documentation carefully than the tests; but if they do read the tests, they might just conclude that Unicode doesn’t consider those two characters numeric. (In fact, I initially didn’t remove the test for '①', because I assumed it was correct and that its generic category must be So.)

Contra @SimonSapin, Unicode does not guarantee that general categories are stable, even for assigned code points. For example, in past versions, U+16EE RUNIC ARLAUG SYMBOL changed from No to Nl, U+2160 ROMAN NUMERAL ONE from So to Nl, and U+19DA NEW TAI LUE THAM DIGIT ONE from Nd to No. It is therefore a mistake for a client of these functions to rely on any particular behavior for any given code point.

I searched for uses of is_alphanumeric and is_numeric. It looks like most uses either assume the input is ASCII, which are therefore buggy no matter what these functions do, or are simplistic tokenizers, which would benefit from accepting gc=No.

@TimNN
Copy link
Contributor

TimNN commented Jul 17, 2018

Ping from triage! @alexcrichton / @SimonSapin: How should this PR move forward?

@alexcrichton
Copy link
Member

Do others from @rust-lang/libs feel opposed to merging this? I've outlined above why I think it's fine to land this and why I don't think we're ready yet to deprecate this method

@kennytm kennytm added S-waiting-on-team Status: Awaiting decision from the relevant subteam (see the T-<team> label). and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Jul 17, 2018
@alexcrichton
Copy link
Member

Ok! Sounds like not a huge amount of thoughts, so let's...

@bors: r+

@bors
Copy link
Contributor

bors commented Aug 1, 2018

📌 Commit 5150ff0 has been approved by alexcrichton

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-team Status: Awaiting decision from the relevant subteam (see the T-<team> label). labels Aug 1, 2018
@bors
Copy link
Contributor

bors commented Aug 1, 2018

⌛ Testing commit 5150ff0 with merge c415676...

bors added a commit that referenced this pull request Aug 1, 2018
Treat gc=No characters as numeric

[`char::is_numeric`](https://doc.rust-lang.org/std/primitive.char.html#method.is_numeric) and [`char::is_alphanumeric`](https://doc.rust-lang.org/std/primitive.char.html#method.is_alphanumeric) are documented to be defined “in terms of the Unicode General Categories 'Nd', 'Nl', 'No'”, but unicode.py does not group 'No' with the other 'N' categories. These functions therefore currently return `false` for characters like ⟨¾⟩ and ⟨①⟩.
@bors
Copy link
Contributor

bors commented Aug 1, 2018

☀️ Test successful - status-appveyor, status-travis
Approved by: alexcrichton
Pushing c415676 to master...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.