Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconsider digit separators #1485

Closed
jonmeow opened this issue Jul 21, 2022 · 12 comments
Closed

Reconsider digit separators #1485

jonmeow opened this issue Jul 21, 2022 · 12 comments
Assignees
Labels
leads question A question for the leads team

Comments

@jonmeow
Copy link
Contributor

jonmeow commented Jul 21, 2022

At present Carbon restricts integer digit separators to every 3 digits, going back to https://github.com/carbon-language/carbon-lang/blob/trunk/proposals/p0143.md.

A contrary mention had been made about the Indian convention. However, it looks like CJK cultures were overlooked, maybe due to conflicting information in https://en.wikipedia.org/wiki/Decimal_separator#Digit_grouping (which says eastern countries have switched to 3 digit groups). According to https://www.statisticalconsultants.co.nz/blog/how-the-world-separates-its-digits.html offers that China uses every 4 digits.

In light of the greater amount of convention differences, it may be worth supporting more variations (e.g., support 3 different conventions for digit groupings), or otherwise loosen restrictions. While that could end up with ambiguous placement for some numbers, larger numbers would less ambiguous because the groupings would repeat.

Note, I think this arose from this tweet

@lexi-nadia
Copy link

Besides international variations, there are also microformats. For example:

let mac_address: i64 = 0xa1_b2_c3_d4_e5_f6;
let uuid: i128 = 0x123e4567_e89b_12d3_a456_426614174000;

@mo-xiaoming
Copy link

As a Chinese developer, I can say

  1. yes, in our culture, we're used to 4 digit groups
  2. However, as a developer, I'm quite comfortable with 3 digit groups (stockholm syndrome?)
  3. @lexi-nadia has a very good point on hex numbers

So, maybe adding this kind of variation is worth a while

@nigeltao
Copy link

I'm not saying you should do this, just throwing out a related idea...

In Carbon, 0x1A is valid but 0x1a is not. Unlike C/C++, hex digits are case sensitive.

In Wuffs, both are valid (from the compiler's point of view) but the formatter (the equivalent of clang-format, gofmt, rustfmt, etc) canonicalizes it as 0x1A. The convention is to run the formatter regularly (e.g. in on-file-save or pre-commit hooks) and so, in practice, you only see 0x1A and never 0x1a. But this still lets you copy/paste 0x1a from a StackOverflow post, even if that post discusses a different programming language.

FWIW, Wuffs' formatter's canonicalization of numeric literals also inserts underscores at every 6 digits for decimal and at every 4 digits for hexadecimal: it's 3_141592 and 0xDEAD_BEEF. The point being that there is one canonical spelling of every numeric literal, just like there's one canonical indentation style (and no endless tabs vs spaces debate). Whether it's every 3 or 6 digits, for decimal, isn't that important. As said about Go: "Gofmt's style is nobody's favourite, but gofmt is everybody's favourite".

@chandlerc
Copy link
Contributor

In Carbon, 0x1A is valid but 0x1a is not. Unlike C/C++, hex digits are case sensitive.

In Wuffs, both are valid (from the compiler's point of view) but the formatter (the equivalent of clang-format, gofmt, rustfmt, etc) canonicalizes it as 0x1A. The convention is to run the formatter regularly (e.g. in on-file-save or pre-commit hooks) and so, in practice, you only see 0x1A and never 0x1a. But this still lets you copy/paste 0x1a from a StackOverflow post, even if that post discusses a different programming language.

I think this is a pretty separate question, so if you'd like to pursue it I would move it. FWIW, we can have a near perfect recovery here in the frontend and suggest edits, so I think the difference isn't huge, but it is a difference.

FWIW, Wuffs' formatter's canonicalization of numeric literals also inserts underscores at every 6 digits for decimal and at every 4 digits for hexadecimal: it's 3_141592 and 0xDEAD_BEEF. The point being that there is one canonical spelling of every numeric literal, just like there's one canonical indentation style (and no endless tabs vs spaces debate). Whether it's every 3 or 6 digits, for decimal, isn't that important. As said about Go: "Gofmt's style is nobody's favourite, but gofmt is everybody's favourite".

Given the semantically meaningful different groupings mentioned here, I think this question should include not canonicalizing in the formatter. FWIW, I'm sufficiently convinced by things like credit card numbers, UUIDs, and MAC addresses that we should have this flexibility even outside of any ideas around regional differences or different bases.

@nigeltao
Copy link

nigeltao commented Jul 29, 2022

FWIW, being hexadecimal, UUIDs and MAC addresses aren't unusably bad if you enforce underscores every 4 digits. The natural microformat boundaries are already multiples of two bytes. Even if the natural UUID grouping involves the last 12 hex digits, that's still easy to see here:

let mac_address: i64 = 0xa1b2_c3d4_e5f6;
let uuid: i128 = 0x123e_4567_e89b_12d3_a456_4266_1417_4000;

In the MAC address case, "what's the 3rd byte" is still much easier to eyeball with "underscore every 4" than with no underscores at all.

As for credit card numbers, do people actually process them as numbers (as opposed to strings)?

@chandlerc
Copy link
Contributor

FWIW, being hexadecimal, UUIDs and MAC addresses aren't unusably bad if you enforce underscores every 4 digits. The natural microformat boundaries are already multiples of two bytes. Even if the natural UUID grouping involves the last 12 hex digits, that's still easy to see here:

let mac_address: i64 = 0xa1b2_c3d4_e5f6;
let uuid: i128 = 0x123e_4567_e89b_12d3_a456_4266_1417_4000;

In the MAC address case, "what's the 3rd byte" is still much easier to eyeball with "underscore every 4" than with no underscores at all.

I still find the versions above significantly more readable than these. I agree that no digit separators would be even worse, but I don't think that's really the question. I think the readability gain of format-specific grouping is worthwhile based on the examples here.

@zygoloid zygoloid added the good first issue Possibly a good first issue for newcomers label Aug 9, 2022
@zygoloid
Copy link
Contributor

zygoloid commented Aug 9, 2022

We seem to have good evidence here that we should reconsider this decision, and a good level of consensus for making a change. The next step would be for someone to write a proposal presenting these arguments.

@jonmeow jonmeow added the leads question A question for the leads team label Aug 10, 2022
@ethomag
Copy link

ethomag commented Aug 11, 2022

Maybe I misinterpreted this (in docs/design/lexical_conventions/numeric_literals.md)

For real-number literals, digit separators can appear in the decimal and hexadecimal 
integer portions (prior to the period and after the optional e or mandatory p)

I don't understand the restriction of having digit separators only to the left of the decimal point for real numbers and I could not find any rationale behind it in the docs. Consider:

let nanosecond: f64 = 0.000000001;

vs

let nanosecond: f64 = 0.000_000_001;

I think that improves readability as much as digit separators in the integer part.

@jonmeow
Copy link
Contributor Author

jonmeow commented Aug 11, 2022

Created a proposal on #1983 -- let me know if I've misunderstood leads direction there, I can always flip around alternatives if the leads want a different choice.

I don't understand the restriction of having digit separators only to the left of the decimal point for real numbers and I could not find any rationale behind it in the docs.

AFAICT your interpretation is correct, although the proposal has some conflicting examples in ties. Anyways, I think #1983 should produce clear rationale either way.

@ethomag
Copy link

ethomag commented Aug 11, 2022

Thanks @jonmeow for your reply. My concern was not about ties, but strictly readability. I think scientific notation is symmetric around the decimal point. To be able to group decimal digits in the integer part so that you can easily eyeball which parts are grams, kilograms etc is something that can aid avoiding making mistakes when defining constants. I just think the same argument holds for milligrams, micrograms etc.

I could not find any rationale that I could understand in the referred links, but it seems you have already considered this. I was just naïvely thinking that this was something that was overlooked.

I am truly amazed by your work, it's quite a challenge you have taken on!

@chandlerc
Copy link
Contributor

(removing good-first-issue label as this is now in progress)

@chandlerc chandlerc removed the good first issue Possibly a good first issue for newcomers label Aug 12, 2022
zygoloid added a commit that referenced this issue Aug 25, 2022
[Proposal #143: Numeric literals](https://github.com/carbon-language/carbon-lang/blob/trunk/proposals/p0143.md) added digit separators with strict rules for placement. It missed some use-cases. In order to address this, remove placement rules for numeric literals.

Related issue: #1485 

Co-authored-by: Chandler Carruth <[email protected]>
Co-authored-by: Richard Smith <[email protected]>
@jonmeow
Copy link
Contributor Author

jonmeow commented Aug 25, 2022

I believe this is resolved by #1983 though I still need to update the design (but I think we can call the leads question closed).

@jonmeow jonmeow closed this as completed Aug 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
leads question A question for the leads team
Projects
None yet
Development

No branches or pull requests

7 participants