2x benchmark loss in rayon-hash from multiple codegen-units #47665

cuviper · 2018-01-22T21:13:13Z

I'm seeing a huge slowdown in rayon-hash benchmarks, resolved with -Ccodegen-units=1.

$ rustc -Vv
rustc 1.25.0-nightly (97520ccb1 2018-01-21)
binary: rustc
commit-hash: 97520ccb101609af63f29919bb0a39115269c89e
commit-date: 2018-01-21
host: x86_64-unknown-linux-gnu
release: 1.25.0-nightly
LLVM version: 4.0

$ cargo bench --bench set_sum
   Compiling [...]
    Finished release [optimized] target(s) in 5.51 secs
     Running target/release/deps/set_sum-833cf161cf760670

running 4 tests
test rayon_set_sum_parallel ... bench:   2,295,348 ns/iter (+/- 152,025)
test rayon_set_sum_serial   ... bench:   7,730,830 ns/iter (+/- 171,552)
test std_set_sum_parallel   ... bench:  10,038,209 ns/iter (+/- 188,189)
test std_set_sum_serial     ... bench:   7,733,258 ns/iter (+/- 134,850)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out

$ RUSTFLAGS=-Ccodegen-units=1 cargo bench --bench set_sum
   Compiling [...]
    Finished release [optimized] target(s) in 6.48 secs
     Running target/release/deps/set_sum-833cf161cf760670

running 4 tests
test rayon_set_sum_parallel ... bench:   1,092,732 ns/iter (+/- 105,979)
test rayon_set_sum_serial   ... bench:   6,152,751 ns/iter (+/- 83,103)
test std_set_sum_parallel   ... bench:   8,957,618 ns/iter (+/- 132,791)
test std_set_sum_serial     ... bench:   6,144,775 ns/iter (+/- 75,377)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out

rayon_set_sum_parallel is the showcase for this crate, and it suffers the most from CGU.

From profiling and disassembly, this seems to mostly be a lost inlining opportunity. In the slower build, the profile is split 35% bridge_unindexed_producer_consumer, 34% Iterator::fold, 28% Sum::sum, and the hot loop in the first looks like:

 16.72 │126d0:   cmpq   $0x0,(%r12,%rbx,8)
  6.73 │126d5: ↓ jne    126e1 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x201>
 16.65 │126d7:   inc    %rbx
  0.00 │126da:   cmp    %rbp,%rbx
  7.20 │126dd: ↑ jb     126d0 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x1f0>
  0.05 │126df: ↓ jmp    1272f <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x24f>
 26.93 │126e1:   mov    0x0(%r13,%rbx,4),%eax
  4.26 │126e6:   movq   $0x1,0x38(%rsp)
  2.27 │126ef:   mov    %rax,0x40(%rsp)
  1.88 │126f4:   mov    %r14,%rdi
  4.58 │126f7: → callq  15b90 <<u64 as core::iter::traits::Sum>::sum>
  0.68 │126fc:   movq   $0x1,0x38(%rsp)
  2.58 │12705:   mov    %r15,0x40(%rsp)
  0.62 │1270a:   movq   $0x1,0x48(%rsp)
  0.31 │12713:   mov    %rax,0x50(%rsp)
  0.49 │12718:   movb   $0x0,0x58(%rsp)
  2.50 │1271d:   xor    %esi,%esi
  0.41 │1271f:   mov    %r14,%rdi
  0.85 │12722: → callq  153f0 <<core::iter::Chain<A, B> as core::iter::iterator::Iterator>::fold>
  1.30 │12727:   mov    %rax,%r15
  2.16 │1272a: ↑ jmp    126d7 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x1f7>

With CGU=1, 96% of the profile is in bridge_unindexed_producer_consumer, with this hot loop:

  2.28 │1426d:   test   %rbx,%rbx
       │14270: ↓ je     14296 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x146>
  5.40 │14272:   mov    (%rbx),%ebx
  2.75 │14274:   add    %rbx,%rax
  1.47 │14277:   lea    (%rdx,%rsi,4),%rbx
  0.21 │1427b:   nopl   0x0(%rax,%rax,1)
 29.56 │14280:   cmp    %rdi,%rsi
  0.04 │14283: ↓ jae    14296 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x146>
  2.87 │14285:   add    $0x4,%rbx
 20.22 │14289:   cmpq   $0x0,(%rcx,%rsi,8)
  1.48 │1428e:   lea    0x1(%rsi),%rsi
  8.00 │14292: ↑ je     14280 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x130>
 25.25 │14294: ↑ jmp    1426d <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x11d>

The text was updated successfully, but these errors were encountered:

nikomatsakis · 2018-01-25T15:19:35Z

@cuviper I wonder if we can add any #[inline] hints to help resolve this?

cuviper · 2018-01-25T18:42:32Z

@nikomatsakis you mean in libstd? Neither Chain::fold nor u64::sum have #[inline] attributes today, but usually that's not needed for generic functions. I guess being generic means they didn't (couldn't) get pre-compiled in libstd, but I suppose they must have gotten separate codegen units in the end.

But I hope we don't have to generally recommend #[inline] all over the place...

Mark-Simulacrum · 2018-01-25T21:56:34Z

I'm not entirely sure this isn't expected; 2x is a bit much though. We might need to wait on the heuristics that @michaelwoerister has planned.

nikomatsakis · 2018-01-26T14:58:11Z

@cuviper I'm not sure that being generic affects their need for an inlining hint. It is true that those functions will be instantiated in the downstream crate, but I think they are still isolated into their own codegen unit. (ThinLTO is of course supposed to help here.) I'm not sure what's the best fix though.

cuviper · 2018-02-14T19:13:16Z

FWIW, the gap has closed some (perhaps from LLVM 6?), with parallel codegen now "only" 40% slower:

$ rustc -Vv
rustc 1.25.0-nightly (4d2d3fc5d 2018-02-13)
binary: rustc
commit-hash: 4d2d3fc5dadf894a8ad709a5860a549f2c0b1032
commit-date: 2018-02-13
host: x86_64-unknown-linux-gnu
release: 1.25.0-nightly
LLVM version: 6.0

$ cargo bench --bench set_sum
    Updating registry `https://github.com/rust-lang/crates.io-index`
   Compiling [...]
    Finished release [optimized] target(s) in 5.37 secs
     Running target/release/deps/set_sum-d85cde9ece817246

running 4 tests
test rayon_set_sum_parallel ... bench:   1,420,138 ns/iter (+/- 34,709)
test rayon_set_sum_serial   ... bench:   7,718,603 ns/iter (+/- 141,127)
test std_set_sum_parallel   ... bench:   8,886,208 ns/iter (+/- 137,102)
test std_set_sum_serial     ... bench:   7,845,670 ns/iter (+/- 106,685)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out

$ RUSTFLAGS=-Ccodegen-units=1 cargo bench --bench set_sum
   Compiling [...]
    Finished release [optimized] target(s) in 6.40 secs
     Running target/release/deps/set_sum-d85cde9ece817246

running 4 tests
test rayon_set_sum_parallel ... bench:   1,089,784 ns/iter (+/- 167,357)
test rayon_set_sum_serial   ... bench:   6,196,760 ns/iter (+/- 335,661)
test std_set_sum_parallel   ... bench:   8,497,259 ns/iter (+/- 128,929)
test std_set_sum_serial     ... bench:   6,151,935 ns/iter (+/- 76,954)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out

The profile of rayon_set_sum_parallel is now 49.5% bridge_unindexed_producer_consumer, 48% Sum::sum, with this hot loop:

 19.74 │14fa0:   cmpq   $0x0,(%r12,%rbx,8)
  6.54 │14fa5: ↓ jne    14fb2 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x1a2>
 21.06 │14fa7:   add    $0x1,%rbx
       │14fab:   cmp    %rbp,%rbx
  4.04 │14fae: ↑ jb     14fa0 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x190>
  0.04 │14fb0: ↓ jmp    14fde <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x1ce>
 28.36 │14fb2:   mov    0x0(%r13,%rbx,4),%eax
  4.34 │14fb7:   movq   $0x1,0x58(%rsp)
  5.45 │14fc0:   mov    %rax,0x60(%rsp)
  0.84 │14fc5:   mov    %r14,%rdi
  3.80 │14fc8: → callq  16380 <<u64 as core::iter::traits::Sum>::sum>
  2.52 │14fcd:   add    %rax,%r15
  2.07 │14fd0:   add    $0x1,%rbx
       │14fd4:   cmp    %rbp,%rbx
  0.26 │14fd7: ↑ jb     14fa0 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x190>

versus codegen-units=1:

  0.63 │14092:   lea    (%rdi,%rcx,4),%rbp
  0.01 │14096:   nopw   %cs:0x0(%rax,%rax,1)
 25.16 │140a0:   cmpq   $0x0,(%rsi,%rcx,8)
  7.68 │140a5: ↓ jne    140b6 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x176>
 28.34 │140a7:   add    $0x1,%rcx
  0.21 │140ab:   add    $0x4,%rbp
       │140af:   cmp    %rdx,%rcx
  2.63 │140b2: ↑ jb     140a0 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x160>
  0.04 │140b4: ↓ jmp    140ce <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x18e>
 24.54 │140b6:   test   %rbp,%rbp
  0.00 │140b9: ↓ je     140ce <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x18e>
  0.32 │140bb:   add    $0x1,%rcx
  5.10 │140bf:   mov    0x0(%rbp),%ebp
  3.31 │140c2:   add    %rbp,%rax
       │140c5:   cmp    %rdx,%rcx
  1.40 │140c8: ↑ jb     14092 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x152>

michaelwoerister · 2018-02-15T09:27:16Z

@rust-lang/cargo, maybe cargo bench should default to one codegen-unit?

hanna-kruppe · 2018-02-15T09:30:38Z

If cargo bench uses one CGU but cargo build --release uses multiple CGUs, benchmark results are less representative of the code that actually runs at the end of the day.

michaelwoerister · 2018-02-15T09:33:28Z

Yeah, I was thinking that too after I wrote that. --release really sounds like it should be the profile configured for best possible performance. We need an additional opt-dev profile that has "good enough" performance.

matklad · 2018-02-15T09:35:47Z

@michaelwoerister today Cargo uses separate profiles for cargo build --release and cargo bench, but, by default, they use the same options. So it is possible to override number of cgus for benchmarks specifically manually.

However, long-term there is a desire to phase out all special profiles except release and dev (at least in the narrow sense of profiles, as "flags for the compiler"). See also rust-lang/rfcs#2282 which suggests adding user-defined profiles.

nikomatsakis · 2018-02-15T17:05:22Z

The question of whether or not to have an opt-dev profile and so forth has been around for some time. My current opinion is that we can probably keep it two options, but I would do this by altering the "debug" profile: instead of doing no optimization, it should do some minimal amount of optimization. It will take some time to determine precisely what that means, but I strongly suspect that we can get "decently performing code" by doing some mild inlining and few other minor optimizations. So the default might be something more like -O1 or -Og.

In that case, we might consider some other profiles (e.g., no optimization at all, or mild optimization), but I sort of suspect that the use cases are thinner. In my experience, the only reason to want -O0 was for debuginfo, and let's premise that our default profile keeps debuginfo intact. Something like opt-dev I guess corresponds to "I don't care about debuginfo but i'm not benchmarking" -- that might be better served by customizing the "default' build settings for you project?

Still, it seems like there is ultimately maybe a need for three profiles:

super-debug: no opts, I want perfect debuginfo
debug (default): do some optimization. By default, preserves debuginfo, but maybe you can "tune it up" if you're rather have faster code.
release: pull out all the stops.

Probably though this conversation should be happening somewhere else. =)

aturon · 2018-02-15T17:07:59Z

Probably though this conversation should be happening somewhere else

Yeah, this is closely connected to the work on revamping profiles in Cargo.

cuviper · 2018-02-15T18:44:42Z

I just benchmarked as part of updating to rayon-1.0, and it's back to 2x slowdown, with Chain::fold appearing in the profile again. I guess the prior inlining improvement was just lucky, whether they're grouped in the same codegen unit.

workingjubilee · 2022-07-18T03:37:38Z

Is this still an issue of concern?

cuviper · 2022-07-18T19:03:21Z

Well rayon-hash is deprecated, but I guess the comparison may still be informative?

$ rustc -Vv
rustc 1.64.0-nightly (263edd43c 2022-07-17)
binary: rustc
commit-hash: 263edd43c5255084292329423c61a9d69715ebfa
commit-date: 2022-07-17
host: x86_64-unknown-linux-gnu
release: 1.64.0-nightly
LLVM version: 14.0.6

$ cargo bench --bench set_sum
[...]
running 6 tests
test hashbrown_set_sum_parallel  ... bench:     214,551 ns/iter (+/- 5,456)
test hashbrown_set_sum_serial    ... bench:   1,828,014 ns/iter (+/- 14,492)
test rayon_hash_set_sum_parallel ... bench:     464,519 ns/iter (+/- 53,670)
test rayon_hash_set_sum_serial   ... bench:   6,327,378 ns/iter (+/- 57,630)
test std_hash_set_sum_parallel   ... bench:   4,392,456 ns/iter (+/- 1,768,937)
test std_hash_set_sum_serial     ... bench:   2,904,370 ns/iter (+/- 18,217)

$ RUSTFLAGS=-Ccodegen-units=1 cargo bench --bench set_sum
[...]
running 6 tests
test hashbrown_set_sum_parallel  ... bench:     217,752 ns/iter (+/- 43,351)
test hashbrown_set_sum_serial    ... bench:   1,624,025 ns/iter (+/- 23,969)
test rayon_hash_set_sum_parallel ... bench:     494,929 ns/iter (+/- 50,196)
test rayon_hash_set_sum_serial   ... bench:   4,843,448 ns/iter (+/- 72,878)
test std_hash_set_sum_parallel   ... bench:   2,632,228 ns/iter (+/- 628,283)
test std_hash_set_sum_serial     ... bench:   1,773,990 ns/iter (+/- 66,189)

I was originally sad about rayon_set_sum_parallel, and now that's pretty close, actually slower with CGU=1 but within +/-. It also looks consistent and great for hashbrown_set_sum_parallel. The rest have a pretty large benefit from CGU=1.

TimNN added I-slow Issue: Problems and improvements with respect to performance of generated code. A-codegen Area: Code generation labels Jan 23, 2018

eddyb mentioned this issue Jan 25, 2018

Performance regression with nightly benchmarks. #47062

Closed

nagisa mentioned this issue Jan 25, 2018

codegen-units + ThinLTO is not as good as codegen-units = 1 #47745

Open

cuviper added C-enhancement Category: An issue proposing an enhancement or a PR with one. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jan 27, 2018

matthiaskrgr mentioned this issue Mar 1, 2018

clarify effects of lto, thinlto and codegen-units #48518

Open

pitdicker mentioned this issue Mar 7, 2018

Rng::iter rust-random/rand#275

Closed

3 tasks

workingjubilee added the C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such label Oct 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2x benchmark loss in rayon-hash from multiple codegen-units #47665

2x benchmark loss in rayon-hash from multiple codegen-units #47665

cuviper commented Jan 22, 2018

nikomatsakis commented Jan 25, 2018

cuviper commented Jan 25, 2018

Mark-Simulacrum commented Jan 25, 2018

nikomatsakis commented Jan 26, 2018

cuviper commented Feb 14, 2018

michaelwoerister commented Feb 15, 2018

hanna-kruppe commented Feb 15, 2018

michaelwoerister commented Feb 15, 2018

matklad commented Feb 15, 2018

nikomatsakis commented Feb 15, 2018

aturon commented Feb 15, 2018

cuviper commented Feb 15, 2018

workingjubilee commented Jul 18, 2022

cuviper commented Jul 18, 2022

2x benchmark loss in rayon-hash from multiple codegen-units #47665

2x benchmark loss in rayon-hash from multiple codegen-units #47665

Comments

cuviper commented Jan 22, 2018

nikomatsakis commented Jan 25, 2018

cuviper commented Jan 25, 2018

Mark-Simulacrum commented Jan 25, 2018

nikomatsakis commented Jan 26, 2018

cuviper commented Feb 14, 2018

michaelwoerister commented Feb 15, 2018

hanna-kruppe commented Feb 15, 2018

michaelwoerister commented Feb 15, 2018

matklad commented Feb 15, 2018

nikomatsakis commented Feb 15, 2018

aturon commented Feb 15, 2018

cuviper commented Feb 15, 2018

workingjubilee commented Jul 18, 2022

cuviper commented Jul 18, 2022