Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Display impl for u128 and i128 is slow #44583

Closed
dtolnay opened this issue Sep 15, 2017 · 11 comments
Closed

Display impl for u128 and i128 is slow #44583

dtolnay opened this issue Sep 15, 2017 · 11 comments
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. E-medium Call for participation: Medium difficulty. Experience needed to fix: Intermediate. I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@dtolnay
Copy link
Member

dtolnay commented Sep 15, 2017

@henninglive has a significantly faster one in dtolnay/itoa#10 (comment). Let's provide the faster one in std::fmt!

Unoptimized:
	test bench_itoa::bench_u128_max ... bench:       3,009 ns/iter (+/- 99)

Manual linking:
	test bench_itoa::bench_u128_max ... bench:       1,537 ns/iter (+/- 59)

Inlined:
	test bench_itoa::bench_u128_max ... bench:       1,289 ns/iter (+/- 54)
@henninglive
Copy link

henninglive commented Sep 15, 2017

The implementation is copied from std::fmt, the only change is that I have inlined the __udivmodti4() implementation to work around #44545 until LLVM is smart enough combine the calls. This is probably not relevant for std.

@dtolnay
Copy link
Member Author

dtolnay commented Sep 15, 2017

I guess I haven't looked at the two implementations carefully. Can we not inline __udivmodti4() in std?

@henninglive
Copy link

henninglive commented Sep 15, 2017

It’s possible, but we probably don’t want to do that. __udivmodti4() is an intrinsic passed to LLVM to be used to implement division for i128/u128 on architectures without native 128-bit support. It is an implementation detail of the architecture and is currently defined separately from the compiler itself. Std is going to nearly as fast as the inlined version when #44545 is eventually fixed. We should proberly wait for that or we could let std link __udivmodti4() as it did in the past, rust-lang/rfcs#914.

@cuviper
Copy link
Member

cuviper commented Sep 15, 2017

For platforms that don't have native 128-bit division, it may be faster to break it into parts that can do native division from there. Bases 2, 8, and 16 can split directly to 64-bit parts, and base-10 can split into three base-1013 parts. If some 32-bit targets don't have native 64-bit division, they may do even better with 5 base-108 parts stored in 32-bit, etc.

@cuviper
Copy link
Member

cuviper commented Sep 15, 2017

Hmm, I just read the implementation, and it already chunks it by 10000, so that much is probably pretty good already.

@henninglive
Copy link

I haven’t tried it, but I believe the current implementation is going to be pretty terrible on platforms without u64. If I understand this code correctly, the __udivmodti4() implementation as going use __udivmoddi4 and its brothers to do u64 operations, which a lot of non inlined functions calls to divide two u128s.

@dtolnay
Copy link
Member Author

dtolnay commented Sep 16, 2017

I have yet another implementation in dtolnay/itoa#12. This one is 13x faster than std::fmt on my machine.

@Mark-Simulacrum Mark-Simulacrum added I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. C-enhancement Category: An issue proposing an enhancement or a PR with one. labels Sep 17, 2017
@Mark-Simulacrum Mark-Simulacrum added the E-medium Call for participation: Medium difficulty. Experience needed to fix: Intermediate. label May 27, 2018
@steveklabnik
Copy link
Member

Triage: this appears to have gotten even more extreme over time:

running 2 tests
test itoa   ... bench:         216 ns/iter (+/- 28)
test stdlib ... bench:       1,377 ns/iter (+/- 123)

The reproduction involves cargo features and so cannot be done on the playground, details below.

Cargo.toml:

[package]
name = "itoatest"
version = "0.1.0"
authors = ["Steve Klabnik <[email protected]>"]
edition = "2018"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies.itoa]
version = "*"
features = ["std", "i128"]

in benches/bench.rs:

#![feature(test)]
extern crate test;
extern crate itoa;

use test::Bencher;

#[bench]
fn stdlib(b: &mut Bencher) {
    b.iter(|| {
        let s = format!("{}", u128::max_value());
        std::hint::black_box(s);
    });
}

#[bench]
fn itoa(b: &mut Bencher) {

    b.iter(|| {
        let mut s = String::new();
        itoa::fmt(&mut s, u128::max_value()).unwrap();
        std::hint::black_box(s);
    });
}

bors added a commit to rust-lang-ci/rust that referenced this issue Oct 4, 2020
Use less divisions in display u128/i128

This PR is an absolute mess, and I need to test if it improves the speed of fmt::Display for u128/i128, but I think it's correct.
It hopefully is more efficient by cutting u128 into at most 2 u64s, and also chunks by 1e16 instead of just 1e4.

Also I specialized the implementations for uints to always be non-false because it bothered me that it was checked at all

Do not merge until I benchmark it and also clean up the god awful mess of spaghetti.
Based on prior work in rust-lang#44583

cc: `@Dylan-DPC`

Due to work on `itoa` and suggestion in original issue:
r? `@dtolnay`
@mdibaiee
Copy link
Contributor

mdibaiee commented Jan 9, 2022

Is this still an open issue? I see a PR for this has been merged.

@mati865
Copy link
Contributor

mati865 commented Jan 15, 2022

Using @steveklabnik example (with added #![feature(bench_black_box)]) I've got this:

running 2 tests
test itoa   ... bench:          34 ns/iter (+/- 0)
test stdlib ... bench:          54 ns/iter (+/- 0)

I think there is still some room for improvements but it's much better now.

@dtolnay
Copy link
Member Author

dtolnay commented Jan 15, 2022

That's likely the best std::fmt can do. The remaining 20ns discrepancy is not algorithmic, it is the overhead of the Formatter machinery.

@dtolnay dtolnay closed this as completed Jan 15, 2022
@dtolnay dtolnay added T-libs Relevant to the library team, which will review and decide on the PR/issue. and removed T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Jan 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. E-medium Call for participation: Medium difficulty. Experience needed to fix: Intermediate. I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

7 participants