slice::rotate_left/right is not simplified for compile-time-constant small rotates #89714

saethlin · 2021-10-09T21:34:21Z

The current (2021-10-08 nightly) and recent implementations of slice::rotate_right produce 15x more instructions than a manual special-case implementation for a small rotate by a compile-time constant offset. This results in a 5-50% performance difference, depending on the length of the slice in question.

godbolt demonstration: https://godbolt.org/z/a4YdqP1rn

pub fn shift_after_fast(vec: &mut [usize], i: usize) {
    if i > vec.len() || (vec.len() - i) < 2 {
        return;
    }
    unsafe {
        let tmp = core::ptr::read(vec.as_ptr().add(vec.len() - 1));
        core::ptr::copy(
            vec.as_ptr().add(i),
            vec.as_mut_ptr().add(i + 1),
            vec.len() - i - 1,
        );
        core::ptr::write(vec.as_mut_ptr().add(i), tmp);
    }
}

pub fn shift_after(vec: &mut [usize], i: usize) {
    vec[i..].rotate_right(1);
}

I think the current implementation doesn't optimize well because it is a loop { surrounding three rotate algorithms with their own predicates, the first of which is almost never known at compile time:

if (left + right < 24) || (mem::size_of::<T>() > mem::size_of::<[usize; 4]>()) {

and it is the second algorithm that we want to run in this case:

} else if cmp::min(left, right) <= mem::size_of::<BufType>() / mem::size_of::<T>() {

Simply pasting a copy of the second algorithm above the loop removes the performance difference and most but not all of the code size difference.

cc @scottmcm I've finally gotten around to reporting this after months

rustc 1.57.0-nightly (54bb4fec6 2021-10-08)
binary: rustc
commit-hash: 54bb4fec68cb592e23077896baea072919721573
commit-date: 2021-10-08
host: x86_64-unknown-linux-gnu
release: 1.57.0-nightly
LLVM version: 13.0.0

The text was updated successfully, but these errors were encountered:

paolobarbolini · 2022-01-16T17:06:01Z

#92967 should make it better in some cases (if T isn't too big and the rotate amount is known to be within a certain low range at compiletime) https://rust.godbolt.org/z/Ph1TWE1jn

Urgau · 2022-01-16T18:58:40Z

@paolobarbolini you might be interested to known that this naive implementation outperform (at least for u32) both the current std and the std fixed version:

pub fn rorate_left_with_reverse(s: &mut [u32], mid: usize) {
    assert!(mid <= s.len());

    let _ = s[..mid].reverse();
    let _ = s[mid..].reverse();
    let _ = s.reverse();
}

Benchmark with s = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] and mid = 4 (run on armv7):

test rotate_left_fixed   ... bench:          91 ns/iter (+/- 1)
test rotate_left_reverse ... bench:          37 ns/iter (+/- 0)
test rotate_left_std     ... bench:          91 ns/iter (+/- 0)

x86_64

running 3 tests
test rotate_left_fixed   ... bench:          11 ns/iter (+/- 5)
test rotate_left_reverse ... bench:           3 ns/iter (+/- 0)
test rotate_left_std     ... bench:          11 ns/iter (+/- 0)

Urgau · 2022-01-16T19:57:34Z

I known what's going on. My naive implementation doesn't use memcpy or memmove at all, effectively saving the overhead of calling those functions. A quick test seems to indicate that on my machine with u32 the point of bascule (where the cost of memcpy or memmove become negative) is around 60 ± 5 elements in the slice. This suggest to me that the std implementation could be further improved by avoiding for small slice the cost of using memcpy and memmove.

ASM of rorate_left_with_reverse (naive implementation)

example::rorate_left_with_reverse:
        push    rbx
        mov     r8, rsi
        sub     r8, rdx
        jb      .LBB1_26
        cmp     rdx, 2
        jb      .LBB1_9
        mov     r9, rdx
        shr     r9
        cmp     r9, 1
        jne     .LBB1_4
        xor     ecx, ecx
        test    dl, 2
        jne     .LBB1_8
        jmp     .LBB1_9
.LBB1_4:
        lea     rax, [rdi + 4]
        lea     r10, [rdi + 4*rdx]
        add     r10, -4
        and     r9, -2
        neg     r9
        xor     ecx, ecx
.LBB1_5:
        mov     r11d, dword ptr [rax - 4]
        mov     ebx, dword ptr [r10 + 4*rcx]
        mov     dword ptr [rax - 4], ebx
        mov     dword ptr [r10 + 4*rcx], r11d
        mov     r11d, dword ptr [rax]
        mov     ebx, dword ptr [r10 + 4*rcx - 4]
        mov     dword ptr [rax], ebx
        mov     dword ptr [r10 + 4*rcx - 4], r11d
        add     rax, 8
        add     rcx, -2
        cmp     r9, rcx
        jne     .LBB1_5
        neg     rcx
        test    dl, 2
        je      .LBB1_9
.LBB1_8:
        mov     rax, rcx
        not     rax
        add     rax, rdx
        mov     r9d, dword ptr [rdi + 4*rcx]
        mov     ebx, dword ptr [rdi + 4*rax]
        mov     dword ptr [rdi + 4*rcx], ebx
        mov     dword ptr [rdi + 4*rax], r9d
.LBB1_9:
        cmp     r8, 2
        jb      .LBB1_17
        mov     r9, r8
        shr     r9
        cmp     r9, 1
        jne     .LBB1_12
        xor     ecx, ecx
        test    r8b, 2
        jne     .LBB1_16
        jmp     .LBB1_17
.LBB1_12:
        lea     rax, [rdi + 4*rdx]
        add     rax, 4
        lea     r10, [rdi + 4*rsi]
        add     r10, -4
        and     r9, -2
        neg     r9
        xor     ecx, ecx
.LBB1_13:
        mov     r11d, dword ptr [rax - 4]
        mov     ebx, dword ptr [r10 + 4*rcx]
        mov     dword ptr [rax - 4], ebx
        mov     dword ptr [r10 + 4*rcx], r11d
        mov     r11d, dword ptr [rax]
        mov     ebx, dword ptr [r10 + 4*rcx - 4]
        mov     dword ptr [rax], ebx
        mov     dword ptr [r10 + 4*rcx - 4], r11d
        add     rax, 8
        add     rcx, -2
        cmp     r9, rcx
        jne     .LBB1_13
        neg     rcx
        test    r8b, 2
        je      .LBB1_17
.LBB1_16:
        lea     rax, [rdi + 4*rdx]
        mov     rdx, rcx
        not     rdx
        add     r8, rdx
        mov     edx, dword ptr [rax + 4*rcx]
        mov     ebx, dword ptr [rax + 4*r8]
        mov     dword ptr [rax + 4*rcx], ebx
        mov     dword ptr [rax + 4*r8], edx
.LBB1_17:
        cmp     rsi, 2
        jb      .LBB1_25
        mov     r8, rsi
        shr     r8
        cmp     r8, 1
        jne     .LBB1_20
        xor     eax, eax
        test    sil, 2
        jne     .LBB1_24
        jmp     .LBB1_25
.LBB1_20:
        lea     rdx, [rdi + 4]
        lea     rcx, [rdi + 4*rsi]
        add     rcx, -4
        and     r8, -2
        neg     r8
        xor     eax, eax
.LBB1_21:
        mov     r9d, dword ptr [rdx - 4]
        mov     ebx, dword ptr [rcx + 4*rax]
        mov     dword ptr [rdx - 4], ebx
        mov     dword ptr [rcx + 4*rax], r9d
        mov     r9d, dword ptr [rdx]
        mov     ebx, dword ptr [rcx + 4*rax - 4]
        mov     dword ptr [rdx], ebx
        mov     dword ptr [rcx + 4*rax - 4], r9d
        add     rdx, 8
        add     rax, -2
        cmp     r8, rax
        jne     .LBB1_21
        neg     rax
        test    sil, 2
        je      .LBB1_25
.LBB1_24:
        mov     rcx, rax
        not     rcx
        add     rcx, rsi
        mov     edx, dword ptr [rdi + 4*rax]
        mov     esi, dword ptr [rdi + 4*rcx]
        mov     dword ptr [rdi + 4*rax], esi
        mov     dword ptr [rdi + 4*rcx], edx
.LBB1_25:
        pop     rbx
        ret
.LBB1_26:
        lea     rdi, [rip + .L__unnamed_1]
        lea     rdx, [rip + .L__unnamed_2]
        mov     esi, 32
        call    qword ptr [rip + core::panicking::panic@GOTPCREL]
        ud2

ASM of rotate_u8_right1 (fixed std version)

example::rotate_u8_right1:
        push    rbp
        push    r14
        push    rbx
        test    rsi, rsi
        je      .LBB0_11
        mov     rdx, rsi
        add     rdx, -1
        je      .LBB0_10
        mov     r14, rdi
        cmp     rsi, 23
        ja      .LBB0_8
        mov     bpl, byte ptr [r14]
        mov     eax, 1
        mov     ecx, 1
        sub     rcx, rsi
        xor     esi, esi
        mov     edi, 1
        jmp     .LBB0_4
.LBB0_5:
        add     rdi, 1
.LBB0_4:
        mov     ebx, ebp
        movzx   ebp, byte ptr [r14 + rdi]
        mov     byte ptr [r14 + rdi], bl
        cmp     rdi, rdx
        jb      .LBB0_5
        add     rdi, rcx
        je      .LBB0_9
        cmp     rdi, rax
        cmovb   rax, rsi
        cmovb   rdi, rsi
        jmp     .LBB0_4
.LBB0_8:
        lea     rdi, [r14 + 1]
        mov     bpl, byte ptr [r14 + rdx]
        mov     rsi, r14
        call    qword ptr [rip + memmove@GOTPCREL]
.LBB0_9:
        mov     byte ptr [r14], bpl
.LBB0_10:
        pop     rbx
        pop     r14
        pop     rbp
        ret
.LBB0_11:
        lea     rdi, [rip + .L__unnamed_1]
        lea     rdx, [rip + .L__unnamed_2]
        mov     esi, 34
        call    qword ptr [rip + core::panicking::panic@GOTPCREL]
        ud2

paolobarbolini · 2022-01-16T20:10:26Z

I have a crate where the number of items to .rotate_right(1) is always < 6 and the input is an array, so I wrote:

/// Same as `array[..=usize::from(n)].rotate_right(1);`
pub fn rotate_right1_fastest(array: &mut [u8; 256], n: u8) {
    debug_assert!(n < 6);

    let n = usize::from(n);

    let b = array[n];
    array[5 - usize::from(n < 5)] = array[4];
    array[4 - usize::from(n < 4)] = array[3];
    array[3 - usize::from(n < 3)] = array[2];
    array[2 - usize::from(n < 2)] = array[1];
    array[1 - usize::from(n < 1)] = array[0];
    array[0] = b;
}

ASM

example::rotate_right1_fastest:
        movzx   eax, sil
        mov     al, byte ptr [rdi + rax]
        mov     cl, byte ptr [rdi + 4]
        cmp     sil, 5
        mov     rdx, rdi
        sbb     rdx, 0
        mov     byte ptr [rdx + 5], cl
        mov     cl, byte ptr [rdi + 3]
        cmp     sil, 4
        mov     rdx, rdi
        sbb     rdx, 0
        mov     byte ptr [rdx + 4], cl
        mov     cl, byte ptr [rdi + 2]
        cmp     sil, 3
        mov     rdx, rdi
        sbb     rdx, 0
        mov     byte ptr [rdx + 3], cl
        cmp     sil, 2
        mov     rcx, rdi
        sbb     rcx, 0
        mov     dl, byte ptr [rdi + 1]
        mov     byte ptr [rcx + 2], dl
        mov     cl, byte ptr [rdi]
        cmp     sil, 1
        mov     rdx, rdi
        sbb     rdx, -1
        mov     byte ptr [rdx], cl
        mov     byte ptr [rdi], al
        ret

scottmcm · 2022-01-17T06:45:25Z

@Urgau Note that u32 is one of the better cases for reverse. I suspect if you try something like [u8; 3] you'll see that one start to perform noticeably worse.

That said, I did recently rewrite reverse to appear simpler but actually be faster because it could be vectorized (#90821). It might be that the newer LLVM can similarly do better with a different rotate implementation than the current one.

Actually, maybe the swap could be re-written to not be manually vectorized, using the same techniques as reverse used. That might let the original vectorizable implementation shine again, if the reverse one is currently doing well.

Urgau · 2022-01-17T13:03:08Z

@scottmcm Effectively, I didn't taught of that.
I have done some implementation benchmark on my x86_64 machine, they are very noisy, but I hope they helps:

Current std:

running 7 tests
test slice::rotate_16_usize_4                                   ... bench:         309 ns/iter (+/- 48)
test slice::rotate_16_usize_5                                   ... bench:         300 ns/iter (+/- 41)
test slice::rotate_64_usize_4                                   ... bench:       4,016 ns/iter (+/- 471)
test slice::rotate_64_usize_5                                   ... bench:       4,759 ns/iter (+/- 561)
test slice::rotate_rgb                                          ... bench:         342 ns/iter (+/- 36)
test slice::rotate_u8                                           ... bench:         349 ns/iter (+/- 48)
test slice::rotate_usize                                        ... bench:         387 ns/iter (+/- 53)

Nearly original (ManuallyDrop + nits):

running 7 tests
test slice::rotate_16_usize_4                                   ... bench:         248 ns/iter (+/- 98)
test slice::rotate_16_usize_5                                   ... bench:         317 ns/iter (+/- 41)
test slice::rotate_64_usize_4                                   ... bench:       4,117 ns/iter (+/- 1,414)
test slice::rotate_64_usize_5                                   ... bench:       7,041 ns/iter (+/- 5,687)
test slice::rotate_rgb                                          ... bench:         318 ns/iter (+/- 34)
test slice::rotate_u8                                           ... bench:         325 ns/iter (+/- 174)
test slice::rotate_usize                                        ... bench:         406 ns/iter (+/- 102)

Nearly original + with original ptr_swap_n:

running 7 tests
test slice::rotate_16_usize_4                                   ... bench:         228 ns/iter (+/- 37)
test slice::rotate_16_usize_5                                   ... bench:         306 ns/iter (+/- 113)
test slice::rotate_64_usize_4                                   ... bench:       3,834 ns/iter (+/- 379)
test slice::rotate_64_usize_5                                   ... bench:       6,107 ns/iter (+/- 688)
test slice::rotate_rgb                                          ... bench:         338 ns/iter (+/- 138)
test slice::rotate_u8                                           ... bench:         325 ns/iter (+/- 121)
test slice::rotate_usize                                        ... bench:         358 ns/iter (+/- 39)

Naive implementation:

running 7 tests
test slice::rotate_16_usize_4                                   ... bench:         324 ns/iter (+/- 41)
test slice::rotate_16_usize_5                                   ... bench:         493 ns/iter (+/- 38)
test slice::rotate_64_usize_4                                   ... bench:       5,052 ns/iter (+/- 407)
test slice::rotate_64_usize_5                                   ... bench:       8,078 ns/iter (+/- 558)
test slice::rotate_rgb                                          ... bench:       1,285 ns/iter (+/- 142)
test slice::rotate_u8                                           ... bench:         569 ns/iter (+/- 152)
test slice::rotate_usize                                        ... bench:         522 ns/iter (+/- 50)

Conclusion, unsurprisingly the current implementation seems to be overhaul better.

clubby789 · 2023-03-30T21:25:51Z

@rustbot label +I-slow

ghost · 2023-05-13T19:07:23Z

There is another problem when the size of the type is big like for example

struct BallastType {
    ballast: [u64; 32], // the compiler keeps this, even if unused
    value: usize,
}

The algorithm will switch immediately to the gcd algorithm because of (mem::size_of::<T>() > mem::size_of::<[usize; 4]>()).
But this will slow down very much (factor 2 or more) for this type compared to triple reversal or the default method that block swaps step by step.

I think the idea to use the gcd algorithm here was that big types are more costly to copy around and the gcd algorithm requires the optimal number of moves for all inputs. But it should be kept in mind that it is highly cache-inefficient and that big types fit much less into memory so that cache misses have an even higher impact with a slice of same length compared to a smaller type.

I also noticed that when the gcd algorithm is used with smaller types it depends on the length ratio, whether it is faster than a different method or not.

I think, the gcd method should not be used like in the current way. If the type is too big, it will slow down much.

paolobarbolini mentioned this issue Jan 16, 2022

Missed optimization with core::cmp::{min,max} {<,>} n and the range of one the values is known #92967

Closed

lukaslihotzki mentioned this issue Jul 30, 2022

Fix performance regression introduced by syncable_voice binaryfields/resid-rs#11

Merged

rustbot added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Mar 30, 2023

folkertdev mentioned this issue May 31, 2024

make ptr::rotate smaller when using optimize_for_size #125720

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slice::rotate_left/right is not simplified for compile-time-constant small rotates #89714

slice::rotate_left/right is not simplified for compile-time-constant small rotates #89714

saethlin commented Oct 9, 2021

paolobarbolini commented Jan 16, 2022 •

edited

Loading

Urgau commented Jan 16, 2022

Urgau commented Jan 16, 2022 •

edited

Loading

paolobarbolini commented Jan 16, 2022 •

edited

Loading

scottmcm commented Jan 17, 2022

Urgau commented Jan 17, 2022

clubby789 commented Mar 30, 2023

ghost commented May 13, 2023 •

edited by ghost

Loading

slice::rotate_left/right is not simplified for compile-time-constant small rotates #89714

slice::rotate_left/right is not simplified for compile-time-constant small rotates #89714

Comments

saethlin commented Oct 9, 2021

paolobarbolini commented Jan 16, 2022 • edited Loading

Urgau commented Jan 16, 2022

Urgau commented Jan 16, 2022 • edited Loading

paolobarbolini commented Jan 16, 2022 • edited Loading

scottmcm commented Jan 17, 2022

Urgau commented Jan 17, 2022

clubby789 commented Mar 30, 2023

ghost commented May 13, 2023 • edited by ghost Loading

paolobarbolini commented Jan 16, 2022 •

edited

Loading

Urgau commented Jan 16, 2022 •

edited

Loading

paolobarbolini commented Jan 16, 2022 •

edited

Loading

ghost commented May 13, 2023 •

edited by ghost

Loading