Slice equality is slow #16913

nham · 2014-09-01T01:55:10Z

If I run the following code with rustc -O --test src/slice_equality_slow.rs && ./slice_equality_slow --bench:

extern crate test;

fn eq<'a, T: PartialEq>(a: &'a [T], b: &'a [T]) -> bool {
    if a.len() != b.len() {
        return false;
    }

    for i in range(0, a.len()) {
        if a[i] != b[i] {
            return false;
        }
    }

    true
}

#[cfg(test)]
mod bench {
    use test::Bencher;

    #[bench]
    fn builtin_eq(b: &mut Bencher) {
        let mut u = vec!();
        let mut v = vec!();
        for i in range(0u, 1_000_000) {
            u.push(i);
            v.push(i);
        }

        let x = u.as_slice();
        let y = v.as_slice();

        b.iter(|| {
            assert!(x == y);
        })
    }

    #[bench]
    fn custom_eq(b: &mut Bencher) {
        let mut u = vec!();
        let mut v = vec!();
        for i in range(0u, 1_000_000) {
            u.push(i);
            v.push(i);
        }

        let x = u.as_slice();
        let y = v.as_slice();

        b.iter(|| {
            assert!(super::eq(x, y));
        })
    }
}

fn main() {}

Then I get:

running 2 tests
test bench::builtin_eq ... bench:  12513842 ns/iter (+/- 64900)
test bench::custom_eq  ... bench:   7303767 ns/iter (+/- 304090)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured

I ran into this because I was able to speed up the naive string matching algorithm in core::str by replacing the slice equality with a for loop and comparing each component manually.

The text was updated successfully, but these errors were encountered:

ben0x539 · 2014-09-02T19:09:36Z

the builtin_eq x == y comparison seems to take 3x as much time as memcmp, fwiw

dotdash · 2014-09-09T21:04:47Z

The call to eq is not even inlined at opt-level 2, only at level 3. Also, it seems that the null checks from the iterator aren't eliminated at either opt-level:

match_case.i:                                     ; preds = %"_ZN5slice57Items$LT$$x27a$C$$x20T$GT$.Iterator$LT$$BP$$x27a$x20T$GT$4next21h11506221690941188551E.exit"
  %sret_slot.sroa.0.0.i.lcssa310 = phi i8* [ %sret_slot.sroa.0.0.i, %"_ZN5slice57Items$LT$$x27a$C$$x20T$GT$.Iterator$LT$$BP$$x27a$x20T$GT$4next21h11506221690941188551E.exit" ]
  %60 = icmp eq i8* %sret_slot.sroa.0.0.i.lcssa310, null
  br i1 %60, label %.noexc76, label %then-block-191-.i.loopexit309

match_case6.i:                                    ; preds = %"_ZN5slice57Items$LT$$x27a$C$$x20T$GT$.Iterator$LT$$BP$$x27a$x20T$GT$4next21h11506221690941188551E.exit"
  %61 = icmp eq i8* %sret_slot.sroa.0.0.i, null
  br i1 %61, label %then-block-191-.i.loopexit, label %.noexc197

.noexc197:                                        ; preds = %match_case6.i
  %62 = bitcast i8* %sret_slot.sroa.0.0.i to i64*
  %63 = bitcast i8* %sret_slot.sroa.0.0.i201 to i64*
  %64 = load i64* %63, align 8
  %65 = load i64* %62, align 8
  %66 = icmp eq i64 %64, %65
  br i1 %66, label %loop_body.i, label %then-block-191-.i.loopexit

cc @zwarich -- Could the NullCheckElim pass handle that?

That aside, even the custom_eq version isn't optimal yet. Using a direct memcmp call I get:

test bench::builtin_eq ... bench:   1747797 ns/iter (+/- 22496)
test bench::custom_eq  ... bench:    964028 ns/iter (+/- 63657)
test bench::memcmp_eq  ... bench:    518912 ns/iter (+/- 2196)

AFAICT, LLVM has no optimization to optimize loops to calls to memcmp, yet (it's a TODO in the LoopIdiomRecognize xform), so we probably lose out because of that.

bluss · 2014-09-09T21:59:29Z

The custom_eq method can be improved slightly by using raw pointers (C++ style iteration). But even then, the distance to memcmp is vast. For u8 elements it's 10x.

dotdash · 2014-09-10T07:51:48Z

Hm, I don't see that much of a difference for u8 elements. The (unmodified) custom_eq is about as fast as memcp for me.

test bench::custom_eq  ... bench:    537754 ns/iter (+/- 8113)
test bench::memcmp_eq  ... bench:    516983 ns/iter (+/- 2994)

I implemented memcmp_eq like this:

fn memcmp_eq<'a, T: PartialEq>(a: &'a [T], b: &'a [T]) -> bool {
    if a.len() != b.len() {
        return false;
    }

    unsafe {
        rlibc::memcmp(a.as_ptr() as *const _, b.as_ptr() as *const _, a.len()) == 0
    }
}

bluss · 2014-09-10T15:24:21Z

try libc memcmp instead

thestinger · 2014-10-10T19:41:35Z

The LLVM loop idiom pass doesn't know how to generate memcmp right now. That's where the problem needs to be fixed. Otherwise, you need to target the specific CPU directly to get good SIMD from LLVM.

ranma42 · 2015-09-16T08:15:53Z

Is this fixed by #26884 ?

bluss · 2015-09-16T11:56:33Z

It seems to be fixed in the sense of the original reporter?

But, slice equality still doesn't vectorize properly or compare well to glibc's memcmp (for byte slices), so improvements remain.

bluss · 2015-09-16T11:57:20Z

@dotdash Not sure why it doesn't vectorize in nightly:

https://play.rust-lang.org/?gist=38c5ef4ccf66898cc261&version=nightly

pub fn compare(a: &[u8], b: &[u8]) -> bool {
    a == b
}

arthurprs · 2016-01-23T18:20:44Z

As of today this is still true.

ranma42 · 2016-01-23T18:36:29Z

Related bug in LLVM: https://llvm.org/bugs/show_bug.cgi?id=16332

Basically the problem is that the LLVM vectoriser does not know how to optimise loops with multiple exits or (which is the same) with termination guards that are not simple constraints on the index variable.

nagisa · 2017-03-14T21:09:21Z

Seems to be fixed now with some specialisation (the slow comparison, not LLVM bug)

▪./memcmp --bench

running 2 tests
test bench::builtin_eq ... bench:     393,159 ns/iter (+/- 28,862)
test bench::custom_eq  ... bench:     581,997 ns/iter (+/- 42,720)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured

aturon · 2017-03-14T21:23:51Z

Thanks @nagisa, closing!

fix: Some file watching related vfs fixes Fixes rust-lang/rust-analyzer#15554, additionally it seems that client side file watching was broken on windows this entire time, this PR switches `DidChangeWatchedFilesRegistrationOptions` to use relative glob patterns which do work on windows in VSCode.

huonw added the I-slow label Sep 1, 2014

thestinger added A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. labels Oct 10, 2014

emberian self-assigned this Mar 25, 2015

emberian removed their assignment Jan 5, 2016

saschagrunert mentioned this issue Jan 31, 2017

Added compile time optimization for bytewise slice comparison #39422

Closed

aturon closed this as completed Mar 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slice equality is slow #16913

Slice equality is slow #16913

nham commented Sep 1, 2014

ben0x539 commented Sep 2, 2014

dotdash commented Sep 9, 2014

bluss commented Sep 9, 2014

dotdash commented Sep 10, 2014

bluss commented Sep 10, 2014

thestinger commented Oct 10, 2014

ranma42 commented Sep 16, 2015

bluss commented Sep 16, 2015

bluss commented Sep 16, 2015

arthurprs commented Jan 23, 2016

ranma42 commented Jan 23, 2016

nagisa commented Mar 14, 2017

aturon commented Mar 14, 2017

Slice equality is slow #16913

Slice equality is slow #16913

Comments

nham commented Sep 1, 2014

ben0x539 commented Sep 2, 2014

dotdash commented Sep 9, 2014

bluss commented Sep 9, 2014

dotdash commented Sep 10, 2014

bluss commented Sep 10, 2014

thestinger commented Oct 10, 2014

ranma42 commented Sep 16, 2015

bluss commented Sep 16, 2015

bluss commented Sep 16, 2015

arthurprs commented Jan 23, 2016

ranma42 commented Jan 23, 2016

nagisa commented Mar 14, 2017

aturon commented Mar 14, 2017