Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2x benchmark loss in rayon-hash from multiple codegen-units #47665

Open
cuviper opened this issue Jan 22, 2018 · 14 comments
Open

2x benchmark loss in rayon-hash from multiple codegen-units #47665

cuviper opened this issue Jan 22, 2018 · 14 comments
Labels
A-codegen Area: Code generation C-enhancement Category: An issue proposing an enhancement or a PR with one. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@cuviper
Copy link
Member

cuviper commented Jan 22, 2018

I'm seeing a huge slowdown in rayon-hash benchmarks, resolved with -Ccodegen-units=1.

$ rustc -Vv
rustc 1.25.0-nightly (97520ccb1 2018-01-21)
binary: rustc
commit-hash: 97520ccb101609af63f29919bb0a39115269c89e
commit-date: 2018-01-21
host: x86_64-unknown-linux-gnu
release: 1.25.0-nightly
LLVM version: 4.0

$ cargo bench --bench set_sum
   Compiling [...]
    Finished release [optimized] target(s) in 5.51 secs
     Running target/release/deps/set_sum-833cf161cf760670

running 4 tests
test rayon_set_sum_parallel ... bench:   2,295,348 ns/iter (+/- 152,025)
test rayon_set_sum_serial   ... bench:   7,730,830 ns/iter (+/- 171,552)
test std_set_sum_parallel   ... bench:  10,038,209 ns/iter (+/- 188,189)
test std_set_sum_serial     ... bench:   7,733,258 ns/iter (+/- 134,850)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out

$ RUSTFLAGS=-Ccodegen-units=1 cargo bench --bench set_sum
   Compiling [...]
    Finished release [optimized] target(s) in 6.48 secs
     Running target/release/deps/set_sum-833cf161cf760670

running 4 tests
test rayon_set_sum_parallel ... bench:   1,092,732 ns/iter (+/- 105,979)
test rayon_set_sum_serial   ... bench:   6,152,751 ns/iter (+/- 83,103)
test std_set_sum_parallel   ... bench:   8,957,618 ns/iter (+/- 132,791)
test std_set_sum_serial     ... bench:   6,144,775 ns/iter (+/- 75,377)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out

rayon_set_sum_parallel is the showcase for this crate, and it suffers the most from CGU.

From profiling and disassembly, this seems to mostly be a lost inlining opportunity. In the slower build, the profile is split 35% bridge_unindexed_producer_consumer, 34% Iterator::fold, 28% Sum::sum, and the hot loop in the first looks like:

 16.72 │126d0:   cmpq   $0x0,(%r12,%rbx,8)
  6.73 │126d5: ↓ jne    126e1 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x201>
 16.65 │126d7:   inc    %rbx
  0.00 │126da:   cmp    %rbp,%rbx
  7.20 │126dd: ↑ jb     126d0 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x1f0>
  0.05 │126df: ↓ jmp    1272f <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x24f>
 26.93 │126e1:   mov    0x0(%r13,%rbx,4),%eax
  4.26 │126e6:   movq   $0x1,0x38(%rsp)
  2.27 │126ef:   mov    %rax,0x40(%rsp)
  1.88 │126f4:   mov    %r14,%rdi
  4.58 │126f7: → callq  15b90 <<u64 as core::iter::traits::Sum>::sum>
  0.68 │126fc:   movq   $0x1,0x38(%rsp)
  2.5812705:   mov    %r15,0x40(%rsp)
  0.62 │1270a:   movq   $0x1,0x48(%rsp)
  0.3112713:   mov    %rax,0x50(%rsp)
  0.4912718:   movb   $0x0,0x58(%rsp)
  2.50 │1271d:   xor    %esi,%esi
  0.41 │1271f:   mov    %r14,%rdi
  0.8512722: → callq  153f0 <<core::iter::Chain<A, B> as core::iter::iterator::Iterator>::fold>
  1.3012727:   mov    %rax,%r15
  2.16 │1272a: ↑ jmp    126d7 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x1f7>

With CGU=1, 96% of the profile is in bridge_unindexed_producer_consumer, with this hot loop:

  2.28 │1426d:   test   %rbx,%rbx
14270: ↓ je     14296 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x146>
  5.4014272:   mov    (%rbx),%ebx
  2.7514274:   add    %rbx,%rax
  1.4714277:   lea    (%rdx,%rsi,4),%rbx
  0.21 │1427b:   nopl   0x0(%rax,%rax,1)
 29.5614280:   cmp    %rdi,%rsi
  0.0414283: ↓ jae    14296 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x146>
  2.8714285:   add    $0x4,%rbx
 20.2214289:   cmpq   $0x0,(%rcx,%rsi,8)
  1.48 │1428e:   lea    0x1(%rsi),%rsi
  8.0014292: ↑ je     14280 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x130>
 25.2514294: ↑ jmp    1426d <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x11d>
@TimNN TimNN added I-slow Issue: Problems and improvements with respect to performance of generated code. A-codegen Area: Code generation labels Jan 23, 2018
@nikomatsakis
Copy link
Contributor

@cuviper I wonder if we can add any #[inline] hints to help resolve this?

@cuviper
Copy link
Member Author

cuviper commented Jan 25, 2018

@nikomatsakis you mean in libstd? Neither Chain::fold nor u64::sum have #[inline] attributes today, but usually that's not needed for generic functions. I guess being generic means they didn't (couldn't) get pre-compiled in libstd, but I suppose they must have gotten separate codegen units in the end.

But I hope we don't have to generally recommend #[inline] all over the place...

@Mark-Simulacrum
Copy link
Member

I'm not entirely sure this isn't expected; 2x is a bit much though. We might need to wait on the heuristics that @michaelwoerister has planned.

@nikomatsakis
Copy link
Contributor

@cuviper I'm not sure that being generic affects their need for an inlining hint. It is true that those functions will be instantiated in the downstream crate, but I think they are still isolated into their own codegen unit. (ThinLTO is of course supposed to help here.) I'm not sure what's the best fix though.

@cuviper cuviper added C-enhancement Category: An issue proposing an enhancement or a PR with one. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jan 27, 2018
@cuviper
Copy link
Member Author

cuviper commented Feb 14, 2018

FWIW, the gap has closed some (perhaps from LLVM 6?), with parallel codegen now "only" 40% slower:

$ rustc -Vv
rustc 1.25.0-nightly (4d2d3fc5d 2018-02-13)
binary: rustc
commit-hash: 4d2d3fc5dadf894a8ad709a5860a549f2c0b1032
commit-date: 2018-02-13
host: x86_64-unknown-linux-gnu
release: 1.25.0-nightly
LLVM version: 6.0

$ cargo bench --bench set_sum
    Updating registry `https://github.com/rust-lang/crates.io-index`
   Compiling [...]
    Finished release [optimized] target(s) in 5.37 secs
     Running target/release/deps/set_sum-d85cde9ece817246

running 4 tests
test rayon_set_sum_parallel ... bench:   1,420,138 ns/iter (+/- 34,709)
test rayon_set_sum_serial   ... bench:   7,718,603 ns/iter (+/- 141,127)
test std_set_sum_parallel   ... bench:   8,886,208 ns/iter (+/- 137,102)
test std_set_sum_serial     ... bench:   7,845,670 ns/iter (+/- 106,685)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out

$ RUSTFLAGS=-Ccodegen-units=1 cargo bench --bench set_sum
   Compiling [...]
    Finished release [optimized] target(s) in 6.40 secs
     Running target/release/deps/set_sum-d85cde9ece817246

running 4 tests
test rayon_set_sum_parallel ... bench:   1,089,784 ns/iter (+/- 167,357)
test rayon_set_sum_serial   ... bench:   6,196,760 ns/iter (+/- 335,661)
test std_set_sum_parallel   ... bench:   8,497,259 ns/iter (+/- 128,929)
test std_set_sum_serial     ... bench:   6,151,935 ns/iter (+/- 76,954)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out

The profile of rayon_set_sum_parallel is now 49.5% bridge_unindexed_producer_consumer, 48% Sum::sum, with this hot loop:

 19.74 │14fa0:   cmpq   $0x0,(%r12,%rbx,8)
  6.54 │14fa5: ↓ jne    14fb2 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x1a2>
 21.06 │14fa7:   add    $0x1,%rbx
       │14fab:   cmp    %rbp,%rbx
  4.04 │14fae: ↑ jb     14fa0 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x190>
  0.04 │14fb0: ↓ jmp    14fde <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x1ce>
 28.36 │14fb2:   mov    0x0(%r13,%rbx,4),%eax
  4.34 │14fb7:   movq   $0x1,0x58(%rsp)
  5.45 │14fc0:   mov    %rax,0x60(%rsp)
  0.84 │14fc5:   mov    %r14,%rdi
  3.80 │14fc8: → callq  16380 <<u64 as core::iter::traits::Sum>::sum>
  2.52 │14fcd:   add    %rax,%r15
  2.07 │14fd0:   add    $0x1,%rbx
       │14fd4:   cmp    %rbp,%rbx
  0.26 │14fd7: ↑ jb     14fa0 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x190>

versus codegen-units=1:

  0.6314092:   lea    (%rdi,%rcx,4),%rbp
  0.0114096:   nopw   %cs:0x0(%rax,%rax,1)
 25.16 │140a0:   cmpq   $0x0,(%rsi,%rcx,8)
  7.68 │140a5: ↓ jne    140b6 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x176>
 28.34 │140a7:   add    $0x1,%rcx
  0.21 │140ab:   add    $0x4,%rbp
       │140af:   cmp    %rdx,%rcx
  2.63 │140b2: ↑ jb     140a0 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x160>
  0.04 │140b4: ↓ jmp    140ce <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x18e>
 24.54 │140b6:   test   %rbp,%rbp
  0.00 │140b9: ↓ je     140ce <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x18e>
  0.32 │140bb:   add    $0x1,%rcx
  5.10 │140bf:   mov    0x0(%rbp),%ebp
  3.31 │140c2:   add    %rbp,%rax
       │140c5:   cmp    %rdx,%rcx
  1.40 │140c8: ↑ jb     14092 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x152>

@michaelwoerister
Copy link
Member

@rust-lang/cargo, maybe cargo bench should default to one codegen-unit?

@hanna-kruppe
Copy link
Contributor

If cargo bench uses one CGU but cargo build --release uses multiple CGUs, benchmark results are less representative of the code that actually runs at the end of the day.

@michaelwoerister
Copy link
Member

Yeah, I was thinking that too after I wrote that. --release really sounds like it should be the profile configured for best possible performance. We need an additional opt-dev profile that has "good enough" performance.

@matklad
Copy link
Member

matklad commented Feb 15, 2018

@michaelwoerister today Cargo uses separate profiles for cargo build --release and cargo bench, but, by default, they use the same options. So it is possible to override number of cgus for benchmarks specifically manually.

However, long-term there is a desire to phase out all special profiles except release and dev (at least in the narrow sense of profiles, as "flags for the compiler"). See also rust-lang/rfcs#2282 which suggests adding user-defined profiles.

@nikomatsakis
Copy link
Contributor

The question of whether or not to have an opt-dev profile and so forth has been around for some time. My current opinion is that we can probably keep it two options, but I would do this by altering the "debug" profile: instead of doing no optimization, it should do some minimal amount of optimization. It will take some time to determine precisely what that means, but I strongly suspect that we can get "decently performing code" by doing some mild inlining and few other minor optimizations. So the default might be something more like -O1 or -Og.

In that case, we might consider some other profiles (e.g., no optimization at all, or mild optimization), but I sort of suspect that the use cases are thinner. In my experience, the only reason to want -O0 was for debuginfo, and let's premise that our default profile keeps debuginfo intact. Something like opt-dev I guess corresponds to "I don't care about debuginfo but i'm not benchmarking" -- that might be better served by customizing the "default' build settings for you project?

Still, it seems like there is ultimately maybe a need for three profiles:

  • super-debug: no opts, I want perfect debuginfo
  • debug (default): do some optimization. By default, preserves debuginfo, but maybe you can "tune it up" if you're rather have faster code.
  • release: pull out all the stops.

Probably though this conversation should be happening somewhere else. =)

@aturon
Copy link
Member

aturon commented Feb 15, 2018

Probably though this conversation should be happening somewhere else

Yeah, this is closely connected to the work on revamping profiles in Cargo.

@cuviper
Copy link
Member Author

cuviper commented Feb 15, 2018

I just benchmarked as part of updating to rayon-1.0, and it's back to 2x slowdown, with Chain::fold appearing in the profile again. I guess the prior inlining improvement was just lucky, whether they're grouped in the same codegen unit.

@workingjubilee
Copy link
Member

Is this still an issue of concern?

@cuviper
Copy link
Member Author

cuviper commented Jul 18, 2022

Well rayon-hash is deprecated, but I guess the comparison may still be informative?

$ rustc -Vv
rustc 1.64.0-nightly (263edd43c 2022-07-17)
binary: rustc
commit-hash: 263edd43c5255084292329423c61a9d69715ebfa
commit-date: 2022-07-17
host: x86_64-unknown-linux-gnu
release: 1.64.0-nightly
LLVM version: 14.0.6

$ cargo bench --bench set_sum
[...]
running 6 tests
test hashbrown_set_sum_parallel  ... bench:     214,551 ns/iter (+/- 5,456)
test hashbrown_set_sum_serial    ... bench:   1,828,014 ns/iter (+/- 14,492)
test rayon_hash_set_sum_parallel ... bench:     464,519 ns/iter (+/- 53,670)
test rayon_hash_set_sum_serial   ... bench:   6,327,378 ns/iter (+/- 57,630)
test std_hash_set_sum_parallel   ... bench:   4,392,456 ns/iter (+/- 1,768,937)
test std_hash_set_sum_serial     ... bench:   2,904,370 ns/iter (+/- 18,217)

$ RUSTFLAGS=-Ccodegen-units=1 cargo bench --bench set_sum
[...]
running 6 tests
test hashbrown_set_sum_parallel  ... bench:     217,752 ns/iter (+/- 43,351)
test hashbrown_set_sum_serial    ... bench:   1,624,025 ns/iter (+/- 23,969)
test rayon_hash_set_sum_parallel ... bench:     494,929 ns/iter (+/- 50,196)
test rayon_hash_set_sum_serial   ... bench:   4,843,448 ns/iter (+/- 72,878)
test std_hash_set_sum_parallel   ... bench:   2,632,228 ns/iter (+/- 628,283)
test std_hash_set_sum_serial     ... bench:   1,773,990 ns/iter (+/- 66,189)

I was originally sad about rayon_set_sum_parallel, and now that's pretty close, actually slower with CGU=1 but within +/-. It also looks consistent and great for hashbrown_set_sum_parallel. The rest have a pretty large benefit from CGU=1.

@workingjubilee workingjubilee added the C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such label Oct 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation C-enhancement Category: An issue proposing an enhancement or a PR with one. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

9 participants