Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise and bug-fix to_digit #132428

Closed
daniel-pfeiffer opened this issue Oct 31, 2024 · 9 comments
Closed

Optimise and bug-fix to_digit #132428

daniel-pfeiffer opened this issue Oct 31, 2024 · 9 comments
Labels
C-feature-request Category: A feature request, i.e: not implemented / a PR. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@daniel-pfeiffer
Copy link

daniel-pfeiffer commented Oct 31, 2024

I tried this code, with absurd radices:

'0'.to_digit(0);
'0'.to_digit(1);

I expected to see this happen: panic

Instead, this happened: None Some(0)

Solution

This function was quite convoluted, which lead to unnecessary operations. Radix bounds checking was forgotten at the lower end.

OTOH radix bounds checking is often not necessary to repeat for every single digit, so I propose 2 variants, the consequence of which, see at the bottom:

pub const fn to_digit(self, radix: u32) -> Option<u32> {
    assert!(radix >= 2, "to_digit: radix is too low (minimum 2)");
    assert!(radix <= 36, "to_digit: radix is too high (maximum 36)");
    self.to_digit_unchecked(radix)
}

/// ### Soundness
///
/// Callers of this function are responsible that this precondition is satisfied:
///
/// - The radix must be between 2 and 36 inclusive.
///
/// Failing that, the returned value may be weird and is not guaranteed to stay the same in future.
#[inline]
// semi-branchless variant
pub const fn to_digit_unchecked(selfie: char, radix: u32) -> Option<u32> {
    let is_digit = (selfie <= '9') as u32;
    let digit =
        is_digit * // branchless if
            // If not a digit, a number greater than radix will be created.
            (selfie as u32).wrapping_sub('0' as u32)
        + (1 - is_digit) * // branchless else
            // Force the 6th bit to be set to ensure ascii is lower case.
            (selfie as u32 | 0b10_0000).wrapping_sub('a' as u32).saturating_add(10);
    // FIXME: once then_some is const fn, use it here
    if digit < radix { Some(digit) } else { None }
}
// conventional variant
pub const fn to_digit_unchecked(self, radix: u32) -> Option<u32> {
    let digit =
        match self {
            // If not a digit, a number greater than radix will be created.
            ..='9' => (self as u32).wrapping_sub('0' as u32),
            // Force the 6th bit to be set to ensure ascii is lower case.
            _ => (self as u32 | 0b10_0000).wrapping_sub('a' as u32).saturating_add(10)
        };
	/* Or if you prefer this style
        if self <= '9' {
            // If not a digit, a number greater than radix will be created.
            (self as u32).wrapping_sub('0' as u32)
        } else {
            // Force the 6th bit to be set to ensure ascii is lower case.
            (self as u32 | 0b10_0000).wrapping_sub('a' as u32).saturating_add(10)
        };
	*/
    // FIXME: once then_some is const fn, use it here
    if digit < radix { Some(digit) } else { None }
}

I have checked all callers of to_digit, to see where it is already safe, or can be made safe, to switch to to_digit_unchecked. Maybe each time a comment should be added to the place that guarantees a valid radix:

bounds already checked outside of loop, so can switch:

  • library/core/src/num/mod.rs 2x

literal base in a variable, only valid values, so can switch:

  • compiler/rustc_session/src/errors.rs 1x
  • rustc_parse/src/lexer/mod.rs 1x

in a loop, should add bounds check and switch:

  • library/core/src/net/parser.rs 2x

literal radices, where the compiler can hopefully eliminate the asserts, so no need, but might do it to save compile time:

  • compiler/rustc_lexer/src/unescape.rs 4x
  • librustdoc/html/render/print_item.rs 6x
  • rustc_parse_format/src/lib.rs 1x
  • many test files
@daniel-pfeiffer daniel-pfeiffer added the C-bug Category: This is a bug. label Oct 31, 2024
@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Oct 31, 2024
@workingjubilee
Copy link
Member

...PRs welcome? it's usually easiest, when you have a fix on-hand, to just have a fix to review.

@workingjubilee
Copy link
Member

workingjubilee commented Nov 1, 2024

That said, the documentation here:

# Errors

Returns None if the char does not refer to a digit in the given radix.

# Panics

Panics if given a radix larger than 36.

suggests the correct results should probably be either Option A:

assert_eq!('0'.to_digit(0), None);
assert_eq!('1'.to_digit(1), Some(1)); // https://en.wikipedia.org/wiki/Unary_numeral_system

or Option B:

assert_eq!('0'.to_digit(0), None);
assert_eq!('0'.to_digit(1), None);

@daniel-pfeiffer
Copy link
Author

daniel-pfeiffer commented Nov 1, 2024

It conformed to panicking, when it said it would. However the radix bounds check is there to prevent wrong radices. Unary is a totally different scheme, which is not implemented by this function. All numbers Rust deals with in any base are in positional notation. There 0 is always the smallest digit. And incrementing the biggest digit gives 10. The smallest base for which this is possible is two.

If Rust wants to special case tiny radices, it must actively do so and explain what’s going on! Otherwise they must be forbidden just like too big ones are. The current state of things is not consistent.

Night’s sleep has inspired me to offer a branchless variant. Not benchmarked, hoping that compiler specialists will know whether this is even more optimised.

My understanding was that a PR can’t just introduce a new function to the public API. And whether to switch all those callers to it, surely needs somebody’s prior approval.

I have never compiled rustc, and possibly don’t have enough disk-space. So if I were to issue a PR it would only be syntax checked.

@scottmcm
Copy link
Member

scottmcm commented Nov 1, 2024

If you wish to propose to_digit_unchecked, the process is to open an ACP https://std-dev-guide.rust-lang.org/development/feature-lifecycle.html#suitability-for-the-standard-library

@lolbinarycat lolbinarycat added the T-libs Relevant to the library team, which will review and decide on the PR/issue. label Nov 1, 2024
@lolbinarycat
Copy link
Contributor

It's worth noting that '1'.to_digit(1) returns None as expected. Personally, I think the current behavior matches what the documentation specifies, even if it isn't very useful. Inserting a new panic condition here could possibly cause subtle bugs in code that needs to not panic and takes the radix as a user input.

also, your "Soundness" header should instead be called "Correctness". soundness in rust means whether some unsafe code allows safe code to cause UB (which can happen if a function or trait is not marked unsafe) when it needs to be. this function does not cause UB when its preconditions are violated, thus it is a correctness precondition, not a saftey precondition. also, it is a safe function, so it cannot have saftey preconditions.

@workingjubilee
Copy link
Member

If Rust wants to special case tiny radices,

It's the opposite of a special case.

Panicking would be a special case.

@workingjubilee
Copy link
Member

All numbers Rust deals with in any base are in positional notation. There 0 is always the smallest digit. And incrementing the biggest digit gives 10.

According to the definition you have provided, then:

  • when the radix given is 1, the provided answer for 0 is 0.
  • incrementing the biggest digit would give 10, but that is not allowed, so it returns None.

The smallest base for which this is possible is two.

As we can see, when only 1 symbol is allowed, we can still have 0.

So you have successfully persuaded me there is not a problem.

Ask a silly question, get a silly answer. Rust is memory-safe, not foolproof.

@workingjubilee workingjubilee closed this as not planned Won't fix, can't repro, duplicate, stale Nov 1, 2024
@saethlin saethlin added C-feature-request Category: A feature request, i.e: not implemented / a PR. and removed C-bug Category: This is a bug. needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Nov 2, 2024
@daniel-pfeiffer
Copy link
Author

daniel-pfeiffer commented Nov 4, 2024

As we can see, when only 1 symbol is allowed, we can still have 0.

Unary works differently from binary and above. It only has the digit 1. A unary with digit 0 can’t work: it couldn’t count anything. Nor could a nullary system with what? No digits at all? Also a rubbish system.

There’s a good reason from_str_radix does both these checks. And to_digit is inconsistent with it.

@jhpratt
Copy link
Member

jhpratt commented Nov 4, 2024

As there's an ACP open, let's keep discussion there.

@rust-lang rust-lang locked as resolved and limited conversation to collaborators Nov 4, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
C-feature-request Category: A feature request, i.e: not implemented / a PR. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

7 participants