Treat gc=No characters as numeric #51609

dscorbett · 2018-06-17T18:06:34Z

char::is_numeric and char::is_alphanumeric are documented to be defined “in terms of the Unicode General Categories 'Nd', 'Nl', 'No'”, but unicode.py does not group 'No' with the other 'N' categories. These functions therefore currently return false for characters like ⟨¾⟩ and ⟨①⟩.

rust-highfive · 2018-06-17T18:06:44Z

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @Mark-Simulacrum (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

Mark-Simulacrum · 2018-06-17T18:30:30Z

r? @SimonSapin

SimonSapin · 2018-06-17T23:55:04Z

#15283 changed from literally testing Nd, Nl, and No to testing for category N which is correct according to Unicode but unicode.py has been buggy there ever since it was added in #13700.

Changing the behavior now (even if back from a bug) after 4 years might be considered breaking, but we already do that when updating to a new version of Unicode. (Though only code points that were unassigned in the earlier version are affected.)

@rust-lang/libs, any thoughts on the stability aspect?

SimonSapin · 2018-06-18T00:10:22Z

By the way, I’m very late for this conversation since this method has been stable since 1.0, but I wondered what it would be used for. In https://github.com/search?l=Rust&p=1&q=.is_numeric%28%29&type=Code most of the results seem to be one of:

Results for exercism.io, Advent of Code, Project Euler, etc
Code that is buggy and should use is_ascii_digit instead
Copies of Rust’s own test cases

If this method was proposed now I’d argue against including it in the standard library since it’s usually not what people are looking for.

sfackler · 2018-06-18T00:17:16Z

Should we just deprecate this entirely? I agree that is_ascii_digit is almost of the time what you're actually looking for.

SimonSapin · 2018-06-19T17:18:08Z

I’d be ok with deprecating this and (some?) other Unicode character database-based char::is_* methods (stabilized in #20395, some have been carried since ac13f0d in 2011, neither thread with much discussion of use cases). But let’s discuss it separately from this thread.

Back to this PR: regardless of deprecation this function will stick around so we need to decide on its behavior. I think it’s clear that this PLR is a bug fix that restores the behavior to what was intended (and initially implemented). However the buggy behavior has been around for 4 years, and even documented with examples. (This PR is changing the doctest.)

clarfonthey · 2018-06-22T21:40:19Z

I think that the doctests should be treated as more important documentation than the text itself. IMHO the text should be updated to match the doctest, not the other way around.

I agree with deprecating, too. I honestly can't think of any specific reason why code which requires this specific definition of numeric wouldn't be using a crate like unic which offers lookup for all Unicode categories.

alexcrichton · 2018-06-29T01:42:18Z

I would personally be in favor of landing this patch and continuing to keep these methods. While this contradicts the doctests it doesn't contradict the documentation nor the naive interpretation of this and other is_* accessors (aka lookups for the unicode tables).

While this can be a footgun sometimes it doesn't mean that it's always a footgun. If we really want to remove the method then we could consider renaming it to is_unicode_numeric or something like that. That's a pretty heavy hammer, though, for what seems like a bug that doesn't come up all that often

stokhos · 2018-07-06T19:33:52Z

Ping from triage @dscorbett , we haven't heard from you for a while, will you have time to like into this PR

dscorbett · 2018-07-06T20:33:38Z

I agree with @alexcrichton: the documentation defines the function in terms of Nd, Nl, and No, so that is what the functions should do. The two contradictory cases in the doctests are not as important, because readers are, I think, more likely to read the main documentation carefully than the tests; but if they do read the tests, they might just conclude that Unicode doesn’t consider those two characters numeric. (In fact, I initially didn’t remove the test for '①', because I assumed it was correct and that its generic category must be So.)

Contra @SimonSapin, Unicode does not guarantee that general categories are stable, even for assigned code points. For example, in past versions, U+16EE RUNIC ARLAUG SYMBOL changed from No to Nl, U+2160 ROMAN NUMERAL ONE from So to Nl, and U+19DA NEW TAI LUE THAM DIGIT ONE from Nd to No. It is therefore a mistake for a client of these functions to rely on any particular behavior for any given code point.

I searched for uses of is_alphanumeric and is_numeric. It looks like most uses either assume the input is ASCII, which are therefore buggy no matter what these functions do, or are simplistic tokenizers, which would benefit from accepting gc=No.

TimNN · 2018-07-17T17:36:54Z

Ping from triage! @alexcrichton / @SimonSapin: How should this PR move forward?

alexcrichton · 2018-07-17T21:18:00Z

Do others from @rust-lang/libs feel opposed to merging this? I've outlined above why I think it's fine to land this and why I don't think we're ready yet to deprecate this method

alexcrichton · 2018-08-01T17:19:32Z

Ok! Sounds like not a huge amount of thoughts, so let's...

@bors: r+

bors · 2018-08-01T17:19:33Z

📌 Commit 5150ff0 has been approved by alexcrichton

bors · 2018-08-01T17:44:32Z

⌛ Testing commit 5150ff0 with merge c415676...

Treat gc=No characters as numeric [`char::is_numeric`](https://doc.rust-lang.org/std/primitive.char.html#method.is_numeric) and [`char::is_alphanumeric`](https://doc.rust-lang.org/std/primitive.char.html#method.is_alphanumeric) are documented to be defined “in terms of the Unicode General Categories 'Nd', 'Nl', 'No'”, but unicode.py does not group 'No' with the other 'N' categories. These functions therefore currently return `false` for characters like ⟨¾⟩ and ⟨①⟩.

bors · 2018-08-01T19:54:00Z

☀️ Test successful - status-appveyor, status-travis
Approved by: alexcrichton
Pushing c415676 to master...

Treat gc=No characters as numeric

5150ff0

rust-highfive assigned Mark-Simulacrum Jun 17, 2018

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jun 17, 2018

kennytm added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Jun 17, 2018

rust-highfive assigned SimonSapin and unassigned Mark-Simulacrum Jun 17, 2018

SimonSapin added the S-waiting-on-team Status: Awaiting decision from the relevant subteam (see the T-<team> label). label Jun 17, 2018

kennytm added I-nominated and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jun 18, 2018

kennytm added S-waiting-on-team Status: Awaiting decision from the relevant subteam (see the T-<team> label). and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Jul 17, 2018

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-team Status: Awaiting decision from the relevant subteam (see the T-<team> label). labels Aug 1, 2018

bors merged commit 5150ff0 into rust-lang:master Aug 1, 2018

dscorbett deleted the is_numeric branch August 1, 2018 20:01

dscorbett mentioned this pull request Aug 20, 2018

Treat gc=No as a subset of gc=N unicode-rs/unicode-segmentation#44

Merged

cuviper mentioned this pull request Sep 24, 2018

[1.30 beta] More chars considered alphanumeric in 1.30 #54481

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Treat gc=No characters as numeric #51609

Treat gc=No characters as numeric #51609

dscorbett commented Jun 17, 2018

rust-highfive commented Jun 17, 2018

Mark-Simulacrum commented Jun 17, 2018

SimonSapin commented Jun 17, 2018

SimonSapin commented Jun 18, 2018

sfackler commented Jun 18, 2018

SimonSapin commented Jun 19, 2018

clarfonthey commented Jun 22, 2018 •

edited

Loading

alexcrichton commented Jun 29, 2018

stokhos commented Jul 6, 2018

dscorbett commented Jul 6, 2018

TimNN commented Jul 17, 2018

alexcrichton commented Jul 17, 2018

alexcrichton commented Aug 1, 2018

bors commented Aug 1, 2018

bors commented Aug 1, 2018

bors commented Aug 1, 2018

Treat gc=No characters as numeric #51609

Treat gc=No characters as numeric #51609

Conversation

dscorbett commented Jun 17, 2018

rust-highfive commented Jun 17, 2018

Mark-Simulacrum commented Jun 17, 2018

SimonSapin commented Jun 17, 2018

SimonSapin commented Jun 18, 2018

sfackler commented Jun 18, 2018

SimonSapin commented Jun 19, 2018

clarfonthey commented Jun 22, 2018 • edited Loading

alexcrichton commented Jun 29, 2018

stokhos commented Jul 6, 2018

dscorbett commented Jul 6, 2018

TimNN commented Jul 17, 2018

alexcrichton commented Jul 17, 2018

alexcrichton commented Aug 1, 2018

bors commented Aug 1, 2018

bors commented Aug 1, 2018

bors commented Aug 1, 2018

clarfonthey commented Jun 22, 2018 •

edited

Loading