Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the impact of Vec::reserve calls that do not cause any allocation #83357

Merged
merged 2 commits into from
Mar 30, 2021

Conversation

saethlin
Copy link
Member

@saethlin saethlin commented Mar 21, 2021

I think a lot of callers expect Vec::reserve to be nearly free when no resizing is required, but unfortunately that isn't the case. LLVM makes remarkably poor inlining choices (along the path from Vec::reserve to RawVec::grow_amortized), so depending on the surrounding context you either get a huge blob of RawVec's resizing logic inlined into some seemingly-unrelated function, or not enough inlining happens and/or the actual check in needs_to_grow ends up behind a function call. My goal is to make the codegen for Vec::reserve match the mental that callers seem to have: It's reliably just a sub cmp ja if there is already sufficient capacity.

This patch has the following impact on the serde_json benchmarks: https://github.com/serde-rs/json-benchmark/tree/ca3efde8a5b75ff59271539b67452911860248c7 run with cargo +stage1 run --release -- -n 1024

Before:

                                DOM                  STRUCT
======= serde_json ======= parse|stringify ===== parse|stringify ====
data/canada.json         340 MB/s   490 MB/s   630 MB/s   370 MB/s
data/citm_catalog.json   460 MB/s   540 MB/s  1010 MB/s   550 MB/s
data/twitter.json        330 MB/s   840 MB/s   640 MB/s   630 MB/s

======= json-rust ======== parse|stringify ===== parse|stringify ====
data/canada.json         580 MB/s   990 MB/s
data/citm_catalog.json   720 MB/s   660 MB/s
data/twitter.json        570 MB/s   960 MB/s

After:

                                DOM                  STRUCT
======= serde_json ======= parse|stringify ===== parse|stringify ====
data/canada.json         330 MB/s   510 MB/s   610 MB/s   380 MB/s
data/citm_catalog.json   450 MB/s   640 MB/s   970 MB/s   830 MB/s
data/twitter.json        330 MB/s   880 MB/s   670 MB/s   960 MB/s

======= json-rust ======== parse|stringify ===== parse|stringify ====
data/canada.json         560 MB/s  1130 MB/s
data/citm_catalog.json   710 MB/s   880 MB/s
data/twitter.json        530 MB/s  1230 MB/s

That's approximately a one-third increase in throughput on two of the benchmarks, and no effect on one (The benchmark suite has sufficient jitter that I could pick a run where there are no regressions, so I'm not convinced they're meaningful here).

This also produces perf increases on the order of 3-5% in a few other microbenchmarks that I'm tracking. It might be useful to see if this has a cascading effect on inlining choices in some large codebases.

Compiling this simple program demonstrates the change in codegen that causes the perf impact:

fn main() {
    reserve(&mut Vec::new());
}

#[inline(never)]
fn reserve(v: &mut Vec<u8>) {
    v.reserve(1234);
}

Before:

00000000000069b0 <scratch::reserve>:
    69b0:       53                      push   %rbx
    69b1:       48 83 ec 30             sub    $0x30,%rsp
    69b5:       48 8b 47 08             mov    0x8(%rdi),%rax
    69b9:       48 8b 4f 10             mov    0x10(%rdi),%rcx
    69bd:       48 89 c2                mov    %rax,%rdx
    69c0:       48 29 ca                sub    %rcx,%rdx
    69c3:       48 81 fa d1 04 00 00    cmp    $0x4d1,%rdx
    69ca:       77 73                   ja     6a3f <scratch::reserve+0x8f>
    69cc:       48 81 c1 d2 04 00 00    add    $0x4d2,%rcx
    69d3:       72 75                   jb     6a4a <scratch::reserve+0x9a>
    69d5:       48 89 fb                mov    %rdi,%rbx
    69d8:       48 8d 14 00             lea    (%rax,%rax,1),%rdx
    69dc:       48 39 ca                cmp    %rcx,%rdx
    69df:       48 0f 47 ca             cmova  %rdx,%rcx
    69e3:       48 83 f9 08             cmp    $0x8,%rcx
    69e7:       be 08 00 00 00          mov    $0x8,%esi
    69ec:       48 0f 47 f1             cmova  %rcx,%rsi
    69f0:       48 85 c0                test   %rax,%rax
    69f3:       74 17                   je     6a0c <scratch::reserve+0x5c>
    69f5:       48 8b 0b                mov    (%rbx),%rcx
    69f8:       48 89 0c 24             mov    %rcx,(%rsp)
    69fc:       48 89 44 24 08          mov    %rax,0x8(%rsp)
    6a01:       48 c7 44 24 10 01 00    movq   $0x1,0x10(%rsp)
    6a08:       00 00
    6a0a:       eb 08                   jmp    6a14 <scratch::reserve+0x64>
    6a0c:       48 c7 04 24 00 00 00    movq   $0x0,(%rsp)
    6a13:       00
    6a14:       48 8d 7c 24 18          lea    0x18(%rsp),%rdi
    6a19:       48 89 e1                mov    %rsp,%rcx
    6a1c:       ba 01 00 00 00          mov    $0x1,%edx
    6a21:       e8 9a fe ff ff          call   68c0 <alloc::raw_vec::finish_grow>
    6a26:       48 8b 7c 24 20          mov    0x20(%rsp),%rdi
    6a2b:       48 8b 74 24 28          mov    0x28(%rsp),%rsi
    6a30:       48 83 7c 24 18 01       cmpq   $0x1,0x18(%rsp)
    6a36:       74 0d                   je     6a45 <scratch::reserve+0x95>
    6a38:       48 89 3b                mov    %rdi,(%rbx)
    6a3b:       48 89 73 08             mov    %rsi,0x8(%rbx)
    6a3f:       48 83 c4 30             add    $0x30,%rsp
    6a43:       5b                      pop    %rbx
    6a44:       c3                      ret
    6a45:       48 85 f6                test   %rsi,%rsi
    6a48:       75 08                   jne    6a52 <scratch::reserve+0xa2>
    6a4a:       ff 15 38 c4 03 00       call   *0x3c438(%rip)        # 42e88 <_GLOBAL_OFFSET_TABLE_+0x490>
    6a50:       0f 0b                   ud2
    6a52:       ff 15 f0 c4 03 00       call   *0x3c4f0(%rip)        # 42f48 <_GLOBAL_OFFSET_TABLE_+0x550>
    6a58:       0f 0b                   ud2
    6a5a:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)

After:

0000000000006910 <scratch::reserve>:
    6910:       48 8b 47 08             mov    0x8(%rdi),%rax
    6914:       48 8b 77 10             mov    0x10(%rdi),%rsi
    6918:       48 29 f0                sub    %rsi,%rax
    691b:       48 3d d1 04 00 00       cmp    $0x4d1,%rax
    6921:       77 05                   ja     6928 <scratch::reserve+0x18>
    6923:       e9 e8 fe ff ff          jmp    6810 <alloc::raw_vec::RawVec<T,A>::reserve::do_reserve_and_handle>
    6928:       c3                      ret
    6929:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)

@rust-highfive
Copy link
Collaborator

r? @kennytm

(rust-highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Mar 21, 2021
@rust-log-analyzer

This comment has been minimized.

@the8472
Copy link
Member

the8472 commented Mar 21, 2021

This also produces perf increases on the order of 3-5% in a few other microbenchmarks that I'm tracking.

alloc has a whole array of benchmarks for vec too. You can run them via ./x.py bench library/alloc --test-args vec:: and diff the results with cargo benchcmp.

@saethlin
Copy link
Member Author

saethlin commented Mar 21, 2021

alloc has a whole array of benchmarks for vec too

old
test vec::bench_chain_chain_collect                      ... bench:       1,551 ns/iter (+/- 18)
test vec::bench_chain_collect                            ... bench:       1,671 ns/iter (+/- 4)
test vec::bench_chain_extend_ref                         ... bench:       1,669 ns/iter (+/- 12)
test vec::bench_chain_extend_value                       ... bench:       1,647 ns/iter (+/- 10)
test vec::bench_clone_0000                               ... bench:           4 ns/iter (+/- 0)
test vec::bench_clone_0010                               ... bench:          18 ns/iter (+/- 1) = 555 MB/s
test vec::bench_clone_0100                               ... bench:          61 ns/iter (+/- 5) = 1639 MB/s
test vec::bench_clone_1000                               ... bench:         450 ns/iter (+/- 10) = 2222 MB/s
test vec::bench_clone_from_01_0000_0000                  ... bench:          10 ns/iter (+/- 0)
test vec::bench_clone_from_01_0000_0010                  ... bench:          22 ns/iter (+/- 0) = 454 MB/s
test vec::bench_clone_from_01_0000_0100                  ... bench:          74 ns/iter (+/- 4) = 1351 MB/s
test vec::bench_clone_from_01_0000_1000                  ... bench:         484 ns/iter (+/- 42) = 2066 MB/s
test vec::bench_clone_from_01_0010_0000                  ... bench:          10 ns/iter (+/- 0)
test vec::bench_clone_from_01_0010_0010                  ... bench:          22 ns/iter (+/- 1) = 454 MB/s
test vec::bench_clone_from_01_0010_0100                  ... bench:          73 ns/iter (+/- 5) = 1369 MB/s
test vec::bench_clone_from_01_0100_0010                  ... bench:          22 ns/iter (+/- 0) = 454 MB/s
test vec::bench_clone_from_01_0100_0100                  ... bench:          73 ns/iter (+/- 3) = 1369 MB/s
test vec::bench_clone_from_01_0100_1000                  ... bench:         493 ns/iter (+/- 27) = 2028 MB/s
test vec::bench_clone_from_01_1000_0100                  ... bench:          73 ns/iter (+/- 5) = 1369 MB/s
test vec::bench_clone_from_01_1000_1000                  ... bench:         495 ns/iter (+/- 64) = 2020 MB/s
test vec::bench_clone_from_10_0000_0000                  ... bench:          77 ns/iter (+/- 0)
test vec::bench_clone_from_10_0000_0010                  ... bench:         117 ns/iter (+/- 9) = 854 MB/s
test vec::bench_clone_from_10_0000_0100                  ... bench:         559 ns/iter (+/- 23) = 1788 MB/s
test vec::bench_clone_from_10_0000_1000                  ... bench:       4,135 ns/iter (+/- 175) = 2418 MB/s
test vec::bench_clone_from_10_0010_0000                  ... bench:          76 ns/iter (+/- 0)
test vec::bench_clone_from_10_0010_0010                  ... bench:         118 ns/iter (+/- 2) = 847 MB/s
test vec::bench_clone_from_10_0010_0100                  ... bench:         569 ns/iter (+/- 74) = 1757 MB/s
test vec::bench_clone_from_10_0100_0010                  ... bench:         121 ns/iter (+/- 1) = 826 MB/s
test vec::bench_clone_from_10_0100_0100                  ... bench:         568 ns/iter (+/- 80) = 1760 MB/s
test vec::bench_clone_from_10_0100_1000                  ... bench:       4,109 ns/iter (+/- 216) = 2433 MB/s
test vec::bench_clone_from_10_1000_0100                  ... bench:         553 ns/iter (+/- 73) = 1808 MB/s
test vec::bench_clone_from_10_1000_1000                  ... bench:       4,125 ns/iter (+/- 197) = 2424 MB/s
test vec::bench_dedup_new_100                            ... bench:          49 ns/iter (+/- 1) = 8163 MB/s
test vec::bench_dedup_new_1000                           ... bench:         527 ns/iter (+/- 13) = 7590 MB/s
test vec::bench_dedup_new_10000                          ... bench:       5,014 ns/iter (+/- 33) = 7977 MB/s
test vec::bench_dedup_new_100000                         ... bench:     324,742 ns/iter (+/- 1,872) = 1231 MB/s
test vec::bench_dedup_old_100                            ... bench:         102 ns/iter (+/- 0) = 3921 MB/s
test vec::bench_dedup_old_1000                           ... bench:         782 ns/iter (+/- 17) = 5115 MB/s
test vec::bench_dedup_old_10000                          ... bench:      10,994 ns/iter (+/- 86) = 3638 MB/s
test vec::bench_dedup_old_100000                         ... bench:     365,853 ns/iter (+/- 2,049) = 1093 MB/s
test vec::bench_extend_0000_0000                         ... bench:          10 ns/iter (+/- 0)
test vec::bench_extend_0000_0010                         ... bench:          35 ns/iter (+/- 2) = 285 MB/s
test vec::bench_extend_0000_0100                         ... bench:          99 ns/iter (+/- 9) = 1010 MB/s
test vec::bench_extend_0000_1000                         ... bench:         534 ns/iter (+/- 24) = 1872 MB/s
test vec::bench_extend_0010_0010                         ... bench:          85 ns/iter (+/- 1) = 117 MB/s
test vec::bench_extend_0100_0100                         ... bench:         185 ns/iter (+/- 3) = 540 MB/s
test vec::bench_extend_1000_1000                         ... bench:       1,255 ns/iter (+/- 3) = 796 MB/s
test vec::bench_extend_from_slice_0000_0000              ... bench:           8 ns/iter (+/- 0)
test vec::bench_extend_from_slice_0000_0010              ... bench:          24 ns/iter (+/- 1) = 416 MB/s
test vec::bench_extend_from_slice_0000_0100              ... bench:          70 ns/iter (+/- 4) = 1428 MB/s
test vec::bench_extend_from_slice_0000_1000              ... bench:         451 ns/iter (+/- 3) = 2217 MB/s
test vec::bench_extend_from_slice_0010_0010              ... bench:          64 ns/iter (+/- 0) = 156 MB/s
test vec::bench_extend_from_slice_0100_0100              ... bench:         172 ns/iter (+/- 1) = 581 MB/s
test vec::bench_extend_from_slice_1000_1000              ... bench:         915 ns/iter (+/- 19) = 1092 MB/s
test vec::bench_extend_recycle                           ... bench:          68 ns/iter (+/- 0)
test vec::bench_from_elem_0000                           ... bench:           3 ns/iter (+/- 0)
test vec::bench_from_elem_0010                           ... bench:          19 ns/iter (+/- 0) = 526 MB/s
test vec::bench_from_elem_0100                           ... bench:          66 ns/iter (+/- 2) = 1515 MB/s
test vec::bench_from_elem_1000                           ... bench:         512 ns/iter (+/- 2) = 1953 MB/s
test vec::bench_from_fn_0000                             ... bench:           3 ns/iter (+/- 0)
test vec::bench_from_fn_0010                             ... bench:          19 ns/iter (+/- 2) = 526 MB/s
test vec::bench_from_fn_0100                             ... bench:          73 ns/iter (+/- 0) = 1369 MB/s
test vec::bench_from_fn_1000                             ... bench:         620 ns/iter (+/- 5) = 1612 MB/s
test vec::bench_from_iter_0000                           ... bench:           5 ns/iter (+/- 0)
test vec::bench_from_iter_0010                           ... bench:          20 ns/iter (+/- 1) = 500 MB/s
test vec::bench_from_iter_0100                           ... bench:          64 ns/iter (+/- 4) = 1562 MB/s
test vec::bench_from_iter_1000                           ... bench:         456 ns/iter (+/- 22) = 2192 MB/s
test vec::bench_from_slice_0000                          ... bench:           6 ns/iter (+/- 0)
test vec::bench_from_slice_0010                          ... bench:          28 ns/iter (+/- 1) = 357 MB/s
test vec::bench_from_slice_0100                          ... bench:          79 ns/iter (+/- 2) = 1265 MB/s
test vec::bench_from_slice_1000                          ... bench:         535 ns/iter (+/- 6) = 1869 MB/s
test vec::bench_in_place_collect_droppable               ... bench:       1,838 ns/iter (+/- 14)
test vec::bench_in_place_recycle                         ... bench:         134 ns/iter (+/- 0)
test vec::bench_in_place_u128_0010_i0                    ... bench:          49 ns/iter (+/- 0)
test vec::bench_in_place_u128_0010_i1                    ... bench:          17 ns/iter (+/- 0)
test vec::bench_in_place_u128_0100_i0                    ... bench:          70 ns/iter (+/- 0)
test vec::bench_in_place_u128_0100_i1                    ... bench:         112 ns/iter (+/- 0)
test vec::bench_in_place_u128_1000_i0                    ... bench:         392 ns/iter (+/- 4)
test vec::bench_in_place_u128_1000_i1                    ... bench:         732 ns/iter (+/- 4)
test vec::bench_in_place_xu32_0010_i0                    ... bench:          18 ns/iter (+/- 0)
test vec::bench_in_place_xu32_0010_i1                    ... bench:          16 ns/iter (+/- 0)
test vec::bench_in_place_xu32_0100_i0                    ... bench:          58 ns/iter (+/- 1)
test vec::bench_in_place_xu32_0100_i1                    ... bench:          27 ns/iter (+/- 0)
test vec::bench_in_place_xu32_1000_i0                    ... bench:         147 ns/iter (+/- 3)
test vec::bench_in_place_xu32_1000_i1                    ... bench:         167 ns/iter (+/- 2)
test vec::bench_in_place_xxu8_0010_i0                    ... bench:          18 ns/iter (+/- 0)
test vec::bench_in_place_xxu8_0010_i1                    ... bench:          12 ns/iter (+/- 0)
test vec::bench_in_place_xxu8_0100_i0                    ... bench:          21 ns/iter (+/- 2)
test vec::bench_in_place_xxu8_0100_i1                    ... bench:          12 ns/iter (+/- 0)
test vec::bench_in_place_xxu8_1000_i0                    ... bench:          60 ns/iter (+/- 0)
test vec::bench_in_place_xxu8_1000_i1                    ... bench:          36 ns/iter (+/- 0)
test vec::bench_in_place_zip_iter_mut                    ... bench:         123 ns/iter (+/- 0)
test vec::bench_in_place_zip_recycle                     ... bench:          33 ns/iter (+/- 0)
test vec::bench_map_fast                                 ... bench:       3,898 ns/iter (+/- 22)
test vec::bench_map_regular                              ... bench:       3,895 ns/iter (+/- 18)
test vec::bench_nest_chain_chain_collect                 ... bench:       1,146 ns/iter (+/- 2)
test vec::bench_new                                      ... bench:           0 ns/iter (+/- 0)
test vec::bench_range_map_collect                        ... bench:         523 ns/iter (+/- 3)
test vec::bench_rev_1                                    ... bench:       1,225 ns/iter (+/- 4)
test vec::bench_rev_2                                    ... bench:       1,221 ns/iter (+/- 3)
test vec::bench_with_capacity_0000                       ... bench:           1 ns/iter (+/- 0)
test vec::bench_with_capacity_0010                       ... bench:           9 ns/iter (+/- 0) = 1111 MB/s
test vec::bench_with_capacity_0100                       ... bench:           9 ns/iter (+/- 0) = 11111 MB/s
test vec::bench_with_capacity_1000                       ... bench:          44 ns/iter (+/- 1) = 22727 MB/s
new
test vec::bench_chain_chain_collect                      ... bench:       1,192 ns/iter (+/- 11)
test vec::bench_chain_collect                            ... bench:       1,149 ns/iter (+/- 6)
test vec::bench_chain_extend_ref                         ... bench:       1,155 ns/iter (+/- 6)
test vec::bench_chain_extend_value                       ... bench:       1,152 ns/iter (+/- 6)
test vec::bench_clone_0000                               ... bench:           4 ns/iter (+/- 0)
test vec::bench_clone_0010                               ... bench:          18 ns/iter (+/- 1) = 555 MB/s
test vec::bench_clone_0100                               ... bench:          55 ns/iter (+/- 4) = 1818 MB/s
test vec::bench_clone_1000                               ... bench:         452 ns/iter (+/- 16) = 2212 MB/s
test vec::bench_clone_from_01_0000_0000                  ... bench:           9 ns/iter (+/- 0)
test vec::bench_clone_from_01_0000_0010                  ... bench:          23 ns/iter (+/- 1) = 434 MB/s
test vec::bench_clone_from_01_0000_0100                  ... bench:          72 ns/iter (+/- 2) = 1388 MB/s
test vec::bench_clone_from_01_0000_1000                  ... bench:         507 ns/iter (+/- 21) = 1972 MB/s
test vec::bench_clone_from_01_0010_0000                  ... bench:          10 ns/iter (+/- 0)
test vec::bench_clone_from_01_0010_0010                  ... bench:          21 ns/iter (+/- 1) = 476 MB/s
test vec::bench_clone_from_01_0010_0100                  ... bench:          72 ns/iter (+/- 11) = 1388 MB/s
test vec::bench_clone_from_01_0100_0010                  ... bench:          23 ns/iter (+/- 1) = 434 MB/s
test vec::bench_clone_from_01_0100_0100                  ... bench:          72 ns/iter (+/- 6) = 1388 MB/s
test vec::bench_clone_from_01_0100_1000                  ... bench:         509 ns/iter (+/- 44) = 1964 MB/s
test vec::bench_clone_from_01_1000_0100                  ... bench:          73 ns/iter (+/- 10) = 1369 MB/s
test vec::bench_clone_from_01_1000_1000                  ... bench:         509 ns/iter (+/- 23) = 1964 MB/s
test vec::bench_clone_from_10_0000_0000                  ... bench:          67 ns/iter (+/- 1)
test vec::bench_clone_from_10_0000_0010                  ... bench:         108 ns/iter (+/- 18) = 925 MB/s
test vec::bench_clone_from_10_0000_0100                  ... bench:         543 ns/iter (+/- 63) = 1841 MB/s
test vec::bench_clone_from_10_0000_1000                  ... bench:       4,165 ns/iter (+/- 202) = 2400 MB/s
test vec::bench_clone_from_10_0010_0000                  ... bench:          67 ns/iter (+/- 2)
test vec::bench_clone_from_10_0010_0010                  ... bench:         106 ns/iter (+/- 20) = 943 MB/s
test vec::bench_clone_from_10_0010_0100                  ... bench:         555 ns/iter (+/- 60) = 1801 MB/s
test vec::bench_clone_from_10_0100_0010                  ... bench:         112 ns/iter (+/- 17) = 892 MB/s
test vec::bench_clone_from_10_0100_0100                  ... bench:         560 ns/iter (+/- 50) = 1785 MB/s
test vec::bench_clone_from_10_0100_1000                  ... bench:       4,177 ns/iter (+/- 169) = 2394 MB/s
test vec::bench_clone_from_10_1000_0100                  ... bench:         573 ns/iter (+/- 34) = 1745 MB/s
test vec::bench_clone_from_10_1000_1000                  ... bench:       4,183 ns/iter (+/- 232) = 2390 MB/s
test vec::bench_dedup_new_100                            ... bench:          55 ns/iter (+/- 3) = 7272 MB/s
test vec::bench_dedup_new_1000                           ... bench:         767 ns/iter (+/- 16) = 5215 MB/s
test vec::bench_dedup_new_10000                          ... bench:       6,200 ns/iter (+/- 120) = 6451 MB/s
test vec::bench_dedup_new_100000                         ... bench:     324,144 ns/iter (+/- 2,379) = 1234 MB/s
test vec::bench_dedup_old_100                            ... bench:          77 ns/iter (+/- 1) = 5194 MB/s
test vec::bench_dedup_old_1000                           ... bench:         551 ns/iter (+/- 7) = 7259 MB/s
test vec::bench_dedup_old_10000                          ... bench:       8,669 ns/iter (+/- 47) = 4614 MB/s
test vec::bench_dedup_old_100000                         ... bench:     344,084 ns/iter (+/- 1,660) = 1162 MB/s
test vec::bench_extend_0000_0000                         ... bench:           9 ns/iter (+/- 0)
test vec::bench_extend_0000_0010                         ... bench:          36 ns/iter (+/- 0) = 277 MB/s
test vec::bench_extend_0000_0100                         ... bench:          90 ns/iter (+/- 4) = 1111 MB/s
test vec::bench_extend_0000_1000                         ... bench:         586 ns/iter (+/- 32) = 1706 MB/s
test vec::bench_extend_0010_0010                         ... bench:          93 ns/iter (+/- 3) = 107 MB/s
test vec::bench_extend_0100_0100                         ... bench:         201 ns/iter (+/- 3) = 497 MB/s
test vec::bench_extend_1000_1000                         ... bench:       1,256 ns/iter (+/- 9) = 796 MB/s
test vec::bench_extend_from_slice_0000_0000              ... bench:           6 ns/iter (+/- 0)
test vec::bench_extend_from_slice_0000_0010              ... bench:          23 ns/iter (+/- 0) = 434 MB/s
test vec::bench_extend_from_slice_0000_0100              ... bench:          68 ns/iter (+/- 13) = 1470 MB/s
test vec::bench_extend_from_slice_0000_1000              ... bench:         461 ns/iter (+/- 20) = 2169 MB/s
test vec::bench_extend_from_slice_0010_0010              ... bench:          74 ns/iter (+/- 0) = 135 MB/s
test vec::bench_extend_from_slice_0100_0100              ... bench:         178 ns/iter (+/- 5) = 561 MB/s
test vec::bench_extend_from_slice_1000_1000              ... bench:         941 ns/iter (+/- 12) = 1062 MB/s
test vec::bench_extend_recycle                           ... bench:          68 ns/iter (+/- 1)
test vec::bench_from_elem_0000                           ... bench:           2 ns/iter (+/- 0)
test vec::bench_from_elem_0010                           ... bench:          17 ns/iter (+/- 0) = 588 MB/s
test vec::bench_from_elem_0100                           ... bench:          62 ns/iter (+/- 8) = 1612 MB/s
test vec::bench_from_elem_1000                           ... bench:         518 ns/iter (+/- 7) = 1930 MB/s
test vec::bench_from_fn_0000                             ... bench:           2 ns/iter (+/- 0)
test vec::bench_from_fn_0010                             ... bench:          19 ns/iter (+/- 1) = 526 MB/s
test vec::bench_from_fn_0100                             ... bench:          79 ns/iter (+/- 18) = 1265 MB/s
test vec::bench_from_fn_1000                             ... bench:         636 ns/iter (+/- 4) = 1572 MB/s
test vec::bench_from_iter_0000                           ... bench:           5 ns/iter (+/- 0)
test vec::bench_from_iter_0010                           ... bench:          20 ns/iter (+/- 1) = 500 MB/s
test vec::bench_from_iter_0100                           ... bench:          64 ns/iter (+/- 4) = 1562 MB/s
test vec::bench_from_iter_1000                           ... bench:         451 ns/iter (+/- 7) = 2217 MB/s
test vec::bench_from_slice_0000                          ... bench:           6 ns/iter (+/- 0)
test vec::bench_from_slice_0010                          ... bench:          27 ns/iter (+/- 1) = 370 MB/s
test vec::bench_from_slice_0100                          ... bench:          78 ns/iter (+/- 3) = 1282 MB/s
test vec::bench_from_slice_1000                          ... bench:         557 ns/iter (+/- 29) = 1795 MB/s
test vec::bench_in_place_collect_droppable               ... bench:       1,764 ns/iter (+/- 8)
test vec::bench_in_place_recycle                         ... bench:         135 ns/iter (+/- 0)
test vec::bench_in_place_u128_0010_i0                    ... bench:          53 ns/iter (+/- 1)
test vec::bench_in_place_u128_0010_i1                    ... bench:          17 ns/iter (+/- 0)
test vec::bench_in_place_u128_0100_i0                    ... bench:          73 ns/iter (+/- 1)
test vec::bench_in_place_u128_0100_i1                    ... bench:         103 ns/iter (+/- 1)
test vec::bench_in_place_u128_1000_i0                    ... bench:         392 ns/iter (+/- 2)
test vec::bench_in_place_u128_1000_i1                    ... bench:         726 ns/iter (+/- 3)
test vec::bench_in_place_xu32_0010_i0                    ... bench:          18 ns/iter (+/- 0)
test vec::bench_in_place_xu32_0010_i1                    ... bench:          11 ns/iter (+/- 0)
test vec::bench_in_place_xu32_0100_i0                    ... bench:          56 ns/iter (+/- 2)
test vec::bench_in_place_xu32_0100_i1                    ... bench:          22 ns/iter (+/- 0)
test vec::bench_in_place_xu32_1000_i0                    ... bench:         138 ns/iter (+/- 5)
test vec::bench_in_place_xu32_1000_i1                    ... bench:         163 ns/iter (+/- 0)
test vec::bench_in_place_xxu8_0010_i0                    ... bench:          18 ns/iter (+/- 0)
test vec::bench_in_place_xxu8_0010_i1                    ... bench:          12 ns/iter (+/- 0)
test vec::bench_in_place_xxu8_0100_i0                    ... bench:          24 ns/iter (+/- 1)
test vec::bench_in_place_xxu8_0100_i1                    ... bench:          12 ns/iter (+/- 0)
test vec::bench_in_place_xxu8_1000_i0                    ... bench:          68 ns/iter (+/- 0)
test vec::bench_in_place_xxu8_1000_i1                    ... bench:          38 ns/iter (+/- 1)
test vec::bench_in_place_zip_iter_mut                    ... bench:         139 ns/iter (+/- 0)
test vec::bench_in_place_zip_recycle                     ... bench:          33 ns/iter (+/- 0)
test vec::bench_map_fast                                 ... bench:       3,902 ns/iter (+/- 11)
test vec::bench_map_regular                              ... bench:       3,895 ns/iter (+/- 26)
test vec::bench_nest_chain_chain_collect                 ... bench:       1,155 ns/iter (+/- 5)
test vec::bench_new                                      ... bench:           0 ns/iter (+/- 0)
test vec::bench_range_map_collect                        ... bench:         523 ns/iter (+/- 5)
test vec::bench_rev_1                                    ... bench:       1,231 ns/iter (+/- 8)
test vec::bench_rev_2                                    ... bench:       1,220 ns/iter (+/- 6)
test vec::bench_with_capacity_0000                       ... bench:           1 ns/iter (+/- 0)
test vec::bench_with_capacity_0010                       ... bench:           9 ns/iter (+/- 0) = 1111 MB/s
test vec::bench_with_capacity_0100                       ... bench:           9 ns/iter (+/- 0) = 11111 MB/s
test vec::bench_with_capacity_1000                       ... bench:          36 ns/iter (+/- 0) = 27777 MB/s
benchcmp
 name                                    old ns/iter          new ns/iter          diff ns/iter   diff %  speedup
 vec::bench_chain_chain_collect          1,551                1,192                        -359  -23.15%   x 1.30
 vec::bench_chain_collect                1,671                1,149                        -522  -31.24%   x 1.45
 vec::bench_chain_extend_ref             1,669                1,155                        -514  -30.80%   x 1.45
 vec::bench_chain_extend_value           1,647                1,152                        -495  -30.05%   x 1.43
 vec::bench_clone_0100                   61 (1639 MB/s)       55 (1818 MB/s)                 -6   -9.84%   x 1.11
 vec::bench_clone_from_01_0000_0000      10                   9                              -1  -10.00%   x 1.11
 vec::bench_clone_from_01_0000_0010      22 (454 MB/s)        23 (434 MB/s)                   1    4.55%   x 0.96
 vec::bench_clone_from_01_0000_0100      74 (1351 MB/s)       72 (1388 MB/s)                 -2   -2.70%   x 1.03
 vec::bench_clone_from_01_0000_1000      484 (2066 MB/s)      507 (1972 MB/s)                23    4.75%   x 0.95
 vec::bench_clone_from_01_0010_0010      22 (454 MB/s)        21 (476 MB/s)                  -1   -4.55%   x 1.05
 vec::bench_clone_from_01_0010_0100      73 (1369 MB/s)       72 (1388 MB/s)                 -1   -1.37%   x 1.01
 vec::bench_clone_from_01_0100_0010      22 (454 MB/s)        23 (434 MB/s)                   1    4.55%   x 0.96
 vec::bench_clone_from_01_0100_0100      73 (1369 MB/s)       72 (1388 MB/s)                 -1   -1.37%   x 1.01
 vec::bench_clone_from_01_0100_1000      493 (2028 MB/s)      509 (1964 MB/s)                16    3.25%   x 0.97
 vec::bench_clone_from_01_1000_1000      495 (2020 MB/s)      509 (1964 MB/s)                14    2.83%   x 0.97
 vec::bench_clone_from_10_0000_0000      77                   67                            -10  -12.99%   x 1.15
 vec::bench_clone_from_10_0000_0010      117 (854 MB/s)       108 (925 MB/s)                 -9   -7.69%   x 1.08
 vec::bench_clone_from_10_0000_0100      559 (1788 MB/s)      543 (1841 MB/s)               -16   -2.86%   x 1.03
 vec::bench_clone_from_10_0010_0000      76                   67                             -9  -11.84%   x 1.13
 vec::bench_clone_from_10_0010_0010      118 (847 MB/s)       106 (943 MB/s)                -12  -10.17%   x 1.11
 vec::bench_clone_from_10_0010_0100      569 (1757 MB/s)      555 (1801 MB/s)               -14   -2.46%   x 1.03
 vec::bench_clone_from_10_0100_0010      121 (826 MB/s)       112 (892 MB/s)                 -9   -7.44%   x 1.08
 vec::bench_clone_from_10_0100_0100      568 (1760 MB/s)      560 (1785 MB/s)                -8   -1.41%   x 1.01
 vec::bench_clone_from_10_0100_1000      4,109 (2433 MB/s)    4,177 (2394 MB/s)              68    1.65%   x 0.98
 vec::bench_clone_from_10_1000_0100      553 (1808 MB/s)      573 (1745 MB/s)                20    3.62%   x 0.97
 vec::bench_clone_from_10_1000_1000      4,125 (2424 MB/s)    4,183 (2390 MB/s)              58    1.41%   x 0.99
 vec::bench_dedup_new_100                49 (8163 MB/s)       55 (7272 MB/s)                  6   12.24%   x 0.89
 vec::bench_dedup_new_1000               527 (7590 MB/s)      767 (5215 MB/s)               240   45.54%   x 0.69
 vec::bench_dedup_new_10000              5,014 (7977 MB/s)    6,200 (6451 MB/s)           1,186   23.65%   x 0.81
 vec::bench_dedup_old_100                102 (3921 MB/s)      77 (5194 MB/s)                -25  -24.51%   x 1.32
 vec::bench_dedup_old_1000               782 (5115 MB/s)      551 (7259 MB/s)              -231  -29.54%   x 1.42
 vec::bench_dedup_old_10000              10,994 (3638 MB/s)   8,669 (4614 MB/s)          -2,325  -21.15%   x 1.27
 vec::bench_dedup_old_100000             365,853 (1093 MB/s)  344,084 (1162 MB/s)       -21,769   -5.95%   x 1.06
 vec::bench_extend_0000_0000             10                   9                              -1  -10.00%   x 1.11
 vec::bench_extend_0000_0010             35 (285 MB/s)        36 (277 MB/s)                   1    2.86%   x 0.97
 vec::bench_extend_0000_0100             99 (1010 MB/s)       90 (1111 MB/s)                 -9   -9.09%   x 1.10
 vec::bench_extend_0000_1000             534 (1872 MB/s)      586 (1706 MB/s)                52    9.74%   x 0.91
 vec::bench_extend_0010_0010             85 (117 MB/s)        93 (107 MB/s)                   8    9.41%   x 0.91
 vec::bench_extend_0100_0100             185 (540 MB/s)       201 (497 MB/s)                 16    8.65%   x 0.92
 vec::bench_extend_from_slice_0000_0000  8                    6                              -2  -25.00%   x 1.33
 vec::bench_extend_from_slice_0000_0010  24 (416 MB/s)        23 (434 MB/s)                  -1   -4.17%   x 1.04
 vec::bench_extend_from_slice_0000_0100  70 (1428 MB/s)       68 (1470 MB/s)                 -2   -2.86%   x 1.03
 vec::bench_extend_from_slice_0000_1000  451 (2217 MB/s)      461 (2169 MB/s)                10    2.22%   x 0.98
 vec::bench_extend_from_slice_0010_0010  64 (156 MB/s)        74 (135 MB/s)                  10   15.62%   x 0.86
 vec::bench_extend_from_slice_0100_0100  172 (581 MB/s)       178 (561 MB/s)                  6    3.49%   x 0.97
 vec::bench_extend_from_slice_1000_1000  915 (1092 MB/s)      941 (1062 MB/s)                26    2.84%   x 0.97
 vec::bench_from_elem_0000               3                    2                              -1  -33.33%   x 1.50
 vec::bench_from_elem_0010               19 (526 MB/s)        17 (588 MB/s)                  -2  -10.53%   x 1.12
 vec::bench_from_elem_0100               66 (1515 MB/s)       62 (1612 MB/s)                 -4   -6.06%   x 1.06
 vec::bench_from_elem_1000               512 (1953 MB/s)      518 (1930 MB/s)                 6    1.17%   x 0.99
 vec::bench_from_fn_0000                 3                    2                              -1  -33.33%   x 1.50
 vec::bench_from_fn_0100                 73 (1369 MB/s)       79 (1265 MB/s)                  6    8.22%   x 0.92
 vec::bench_from_fn_1000                 620 (1612 MB/s)      636 (1572 MB/s)                16    2.58%   x 0.97
 vec::bench_from_iter_1000               456 (2192 MB/s)      451 (2217 MB/s)                -5   -1.10%   x 1.01
 vec::bench_from_slice_0010              28 (357 MB/s)        27 (370 MB/s)                  -1   -3.57%   x 1.04
 vec::bench_from_slice_0100              79 (1265 MB/s)       78 (1282 MB/s)                 -1   -1.27%   x 1.01
 vec::bench_from_slice_1000              535 (1869 MB/s)      557 (1795 MB/s)                22    4.11%   x 0.96
 vec::bench_in_place_collect_droppable   1,838                1,764                         -74   -4.03%   x 1.04
 vec::bench_in_place_u128_0010_i0        49                   53                              4    8.16%   x 0.92
 vec::bench_in_place_u128_0100_i0        70                   73                              3    4.29%   x 0.96
 vec::bench_in_place_u128_0100_i1        112                  103                            -9   -8.04%   x 1.09
 vec::bench_in_place_xu32_0010_i1        16                   11                             -5  -31.25%   x 1.45
 vec::bench_in_place_xu32_0100_i0        58                   56                             -2   -3.45%   x 1.04
 vec::bench_in_place_xu32_0100_i1        27                   22                             -5  -18.52%   x 1.23
 vec::bench_in_place_xu32_1000_i0        147                  138                            -9   -6.12%   x 1.07
 vec::bench_in_place_xu32_1000_i1        167                  163                            -4   -2.40%   x 1.02
 vec::bench_in_place_xxu8_0100_i0        21                   24                              3   14.29%   x 0.88
 vec::bench_in_place_xxu8_1000_i0        60                   68                              8   13.33%   x 0.88
 vec::bench_in_place_xxu8_1000_i1        36                   38                              2    5.56%   x 0.95
 vec::bench_in_place_zip_iter_mut        123                  139                            16   13.01%   x 0.88
 vec::bench_with_capacity_1000           44 (22727 MB/s)      36 (27777 MB/s)                -8  -18.18%   x 1.22

Honestly I don't know what to make of all this. Overall it looks okay, but I'm going to spend some time maybe tomorrow tracking down the regressions. Most of them seem seem repeatable, for what that's worth. But I think some fraction of them are unrelated to the fact that this patch pessimizes calls to reserve that cause an allocation.

@jyn514 jyn514 added I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Mar 22, 2021
// Therefore, we move all the resizing and error-handling logic from grow_amortized and
// handle_reserve behind a call, while making sure that the this function is likely to be
// inlined as just a comparison and a call if the comparison fails.
#[cold]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why mark this as cold instead of inline(never)? I would expect this to be called somewhat frequently in a normal program.

Copy link
Member

@the8472 the8472 Mar 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I have read in other comments is that cold simply means llvm's coldcc. So it minimizes the impact on the caller, isn't that the goal here? We assume the caller code to be hotter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some places inside vec (and some other modules) it's annotated via

#[cold]
#[inline(never)]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jyn514 That's kind of right and I'm taking a new look at this PR with the same goal of reducing the amount of time spent in reserve calls that don't cause any capacity change.

It looks to me like most of the Vec::reserve calls are guarded by a capacity check (such as the one in push). But some aren't. This manifests in the serde_json benchmarks because serializers do a lot of Write::write_all calls which is implemented by Vec::extend_from_slice, which is implemented (eventually) by append_elements, which has an unguarded self.reserve. I'm assessing adding an if self.len() + additional > self.capacity() guard around the call in append_elements, and going to take a look at either guarding all the reserve calls or sinking the check into Vec::reserve, and deduplicating the check around a lot of the call sites in Vec.

Copy link
Member

@scottmcm scottmcm Mar 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I have read in other comments is that cold simply means llvm's coldcc.

Note that #[cold] is the cold function attribute https://llvm.org/docs/LangRef.html#function-attributes

Changing the ABI (and thus having incompatible fn pointers) is extern "rust-cold" (#97544).

EDIT: Sorry, didn't realize this thread was so old 🤦 It was 5th on the "most recently updated" list I was looking at.

@kennytm kennytm added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 24, 2021
@saethlin
Copy link
Member Author

I've looked over the impact this patch has on the standard library benchmarks. From those I've looked into, my overall conclusion is that these benchmarks mostly need help and the changes they report have relatively little relevance to any changes in this PR. I'll be looking into cleaning them up in the future.

@saethlin saethlin changed the title Mark RawVec::reserve as inline and outline the resizing logic Reduce the impact of Vec::reserve calls that do not cause any allocation Mar 24, 2021
Copy link
Member

@dtolnay dtolnay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is great!

@dtolnay
Copy link
Member

dtolnay commented Mar 30, 2021

@bors r+ rollup=never

@bors
Copy link
Contributor

bors commented Mar 30, 2021

📌 Commit 73d7734 has been approved by dtolnay

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Mar 30, 2021
@bors
Copy link
Contributor

bors commented Mar 30, 2021

⌛ Testing commit 73d7734 with merge 32d3276...

@bors
Copy link
Contributor

bors commented Mar 30, 2021

☀️ Test successful - checks-actions
Approved by: dtolnay
Pushing 32d3276 to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Mar 30, 2021
@bors bors merged commit 32d3276 into rust-lang:master Mar 30, 2021
@rustbot rustbot added this to the 1.53.0 milestone Mar 30, 2021
bors added a commit to rust-lang-ci/rust that referenced this pull request Mar 30, 2021
Clean up Vec's benchmarks

The Vec benchmarks need a lot of love. I sort of noticed this in rust-lang#83357 but the overall situation is much less awesome than I thought at the time. The first commit just removes a lot of asserts and does a touch of other cleanup.

A number of these benchmarks are poorly-named. For example, `bench_map_fast` is not in fact fast, `bench_rev_1` and `bench_rev_2` are vague, `bench_in_place_zip_iter_mut` doesn't call `zip`, `bench_in_place*` don't do anything in-place... Should I fix these, or is there tooling that depend on the names not changing?

I've also noticed that `bench_rev_1` and `bench_rev_2` are remarkably fragile. It looks like poking other code in `Vec` can cause the codegen of this benchmark to switch to a version that has almost exactly half its current throughput and I have absolutely no idea why.

Here's the fast version:
```asm
  0.69 │110:   movdqu -0x20(%rbx,%rdx,4),%xmm0
  1.76 │       movdqu -0x10(%rbx,%rdx,4),%xmm1
  0.71 │       pshufd $0x1b,%xmm1,%xmm1
  0.60 │       pshufd $0x1b,%xmm0,%xmm0
  3.68 │       movdqu %xmm1,-0x30(%rcx)
 14.36 │       movdqu %xmm0,-0x20(%rcx)
 13.88 │       movdqu -0x40(%rbx,%rdx,4),%xmm0
  6.64 │       movdqu -0x30(%rbx,%rdx,4),%xmm1
  0.76 │       pshufd $0x1b,%xmm1,%xmm1
  0.77 │       pshufd $0x1b,%xmm0,%xmm0
  1.87 │       movdqu %xmm1,-0x10(%rcx)
 13.01 │       movdqu %xmm0,(%rcx)
 38.81 │       add    $0x40,%rcx
  0.92 │       add    $0xfffffffffffffff0,%rdx
  1.22 │     ↑ jne    110
```
And the slow one:
```asm
  0.42 │9a880:   movdqa     %xmm2,%xmm1
  4.03 │9a884:   movq       -0x8(%rbx,%rsi,4),%xmm4
  8.49 │9a88a:   pshufd     $0xe1,%xmm4,%xmm4
  2.58 │9a88f:   movq       -0x10(%rbx,%rsi,4),%xmm5
  7.02 │9a895:   pshufd     $0xe1,%xmm5,%xmm5
  4.79 │9a89a:   punpcklqdq %xmm5,%xmm4
  5.77 │9a89e:   movdqu     %xmm4,-0x18(%rdx)
 15.74 │9a8a3:   movq       -0x18(%rbx,%rsi,4),%xmm4
  3.91 │9a8a9:   pshufd     $0xe1,%xmm4,%xmm4
  5.04 │9a8ae:   movq       -0x20(%rbx,%rsi,4),%xmm5
  5.29 │9a8b4:   pshufd     $0xe1,%xmm5,%xmm5
  4.60 │9a8b9:   punpcklqdq %xmm5,%xmm4
  9.81 │9a8bd:   movdqu     %xmm4,-0x8(%rdx)
 11.05 │9a8c2:   paddq      %xmm3,%xmm0
  0.86 │9a8c6:   paddq      %xmm3,%xmm2
  5.89 │9a8ca:   add        $0x20,%rdx
  0.12 │9a8ce:   add        $0xfffffffffffffff8,%rsi
  1.16 │9a8d2:   add        $0x2,%rdi
  2.96 │9a8d6: → jne        9a880 <<alloc::vec::Vec<T,A> as core::iter::traits::collect::Extend<&T>>::extend+0xd0>
```
@rylev
Copy link
Member

rylev commented Apr 1, 2021

@saethlin @dtolnay This seems to have caused some performance issues in compilation. Can we talk through the cost/benefit of this change? It seemed like the results of benchmarking were inconclusive. I'm wondering if we're sure that this change has an overall positive impact.

@saethlin
Copy link
Member Author

saethlin commented Apr 1, 2021

Yes, I think this change should be backed out probably in favor of the changes I was thinking of in this comment: #83357 (comment)

I'm not quite sure why this was merged. Maybe I should have kept it as a draft?

saethlin added a commit to saethlin/rust that referenced this pull request Apr 2, 2021
@saethlin saethlin deleted the vec-reserve-inlining branch May 16, 2022 04:36
@dtolnay dtolnay assigned dtolnay and unassigned kennytm Mar 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code. merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.