Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slice equality is slow #16913

Closed
nham opened this issue Sep 1, 2014 · 13 comments
Closed

Slice equality is slow #16913

nham opened this issue Sep 1, 2014 · 13 comments
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@nham
Copy link
Contributor

nham commented Sep 1, 2014

If I run the following code with rustc -O --test src/slice_equality_slow.rs && ./slice_equality_slow --bench:

extern crate test;

fn eq<'a, T: PartialEq>(a: &'a [T], b: &'a [T]) -> bool {
    if a.len() != b.len() {
        return false;
    }

    for i in range(0, a.len()) {
        if a[i] != b[i] {
            return false;
        }
    }

    true
}

#[cfg(test)]
mod bench {
    use test::Bencher;

    #[bench]
    fn builtin_eq(b: &mut Bencher) {
        let mut u = vec!();
        let mut v = vec!();
        for i in range(0u, 1_000_000) {
            u.push(i);
            v.push(i);
        }

        let x = u.as_slice();
        let y = v.as_slice();

        b.iter(|| {
            assert!(x == y);
        })
    }

    #[bench]
    fn custom_eq(b: &mut Bencher) {
        let mut u = vec!();
        let mut v = vec!();
        for i in range(0u, 1_000_000) {
            u.push(i);
            v.push(i);
        }

        let x = u.as_slice();
        let y = v.as_slice();

        b.iter(|| {
            assert!(super::eq(x, y));
        })
    }
}

fn main() {}

Then I get:

running 2 tests
test bench::builtin_eq ... bench:  12513842 ns/iter (+/- 64900)
test bench::custom_eq  ... bench:   7303767 ns/iter (+/- 304090)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured

I ran into this because I was able to speed up the naive string matching algorithm in core::str by replacing the slice equality with a for loop and comparing each component manually.

@huonw huonw added the I-slow label Sep 1, 2014
@ben0x539
Copy link
Contributor

ben0x539 commented Sep 2, 2014

the builtin_eq x == y comparison seems to take 3x as much time as memcmp, fwiw

@dotdash
Copy link
Contributor

dotdash commented Sep 9, 2014

The call to eq is not even inlined at opt-level 2, only at level 3. Also, it seems that the null checks from the iterator aren't eliminated at either opt-level:

match_case.i:                                     ; preds = %"_ZN5slice57Items$LT$$x27a$C$$x20T$GT$.Iterator$LT$$BP$$x27a$x20T$GT$4next21h11506221690941188551E.exit"
  %sret_slot.sroa.0.0.i.lcssa310 = phi i8* [ %sret_slot.sroa.0.0.i, %"_ZN5slice57Items$LT$$x27a$C$$x20T$GT$.Iterator$LT$$BP$$x27a$x20T$GT$4next21h11506221690941188551E.exit" ]
  %60 = icmp eq i8* %sret_slot.sroa.0.0.i.lcssa310, null
  br i1 %60, label %.noexc76, label %then-block-191-.i.loopexit309

match_case6.i:                                    ; preds = %"_ZN5slice57Items$LT$$x27a$C$$x20T$GT$.Iterator$LT$$BP$$x27a$x20T$GT$4next21h11506221690941188551E.exit"
  %61 = icmp eq i8* %sret_slot.sroa.0.0.i, null
  br i1 %61, label %then-block-191-.i.loopexit, label %.noexc197

.noexc197:                                        ; preds = %match_case6.i
  %62 = bitcast i8* %sret_slot.sroa.0.0.i to i64*
  %63 = bitcast i8* %sret_slot.sroa.0.0.i201 to i64*
  %64 = load i64* %63, align 8
  %65 = load i64* %62, align 8
  %66 = icmp eq i64 %64, %65
  br i1 %66, label %loop_body.i, label %then-block-191-.i.loopexit

cc @zwarich -- Could the NullCheckElim pass handle that?

That aside, even the custom_eq version isn't optimal yet. Using a direct memcmp call I get:

test bench::builtin_eq ... bench:   1747797 ns/iter (+/- 22496)
test bench::custom_eq  ... bench:    964028 ns/iter (+/- 63657)
test bench::memcmp_eq  ... bench:    518912 ns/iter (+/- 2196)

AFAICT, LLVM has no optimization to optimize loops to calls to memcmp, yet (it's a TODO in the LoopIdiomRecognize xform), so we probably lose out because of that.

@bluss
Copy link
Member

bluss commented Sep 9, 2014

The custom_eq method can be improved slightly by using raw pointers (C++ style iteration). But even then, the distance to memcmp is vast. For u8 elements it's 10x.

@dotdash
Copy link
Contributor

dotdash commented Sep 10, 2014

Hm, I don't see that much of a difference for u8 elements. The (unmodified) custom_eq is about as fast as memcp for me.

test bench::custom_eq  ... bench:    537754 ns/iter (+/- 8113)
test bench::memcmp_eq  ... bench:    516983 ns/iter (+/- 2994)

I implemented memcmp_eq like this:

fn memcmp_eq<'a, T: PartialEq>(a: &'a [T], b: &'a [T]) -> bool {
    if a.len() != b.len() {
        return false;
    }

    unsafe {
        rlibc::memcmp(a.as_ptr() as *const _, b.as_ptr() as *const _, a.len()) == 0
    }
}

@bluss
Copy link
Member

bluss commented Sep 10, 2014

try libc memcmp instead

@thestinger
Copy link
Contributor

The LLVM loop idiom pass doesn't know how to generate memcmp right now. That's where the problem needs to be fixed. Otherwise, you need to target the specific CPU directly to get good SIMD from LLVM.

@thestinger thestinger added A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. labels Oct 10, 2014
@emberian emberian self-assigned this Mar 25, 2015
@ranma42
Copy link
Contributor

ranma42 commented Sep 16, 2015

Is this fixed by #26884 ?

@bluss
Copy link
Member

bluss commented Sep 16, 2015

It seems to be fixed in the sense of the original reporter?

But, slice equality still doesn't vectorize properly or compare well to glibc's memcmp (for byte slices), so improvements remain.

@bluss
Copy link
Member

bluss commented Sep 16, 2015

@dotdash Not sure why it doesn't vectorize in nightly:

https://play.rust-lang.org/?gist=38c5ef4ccf66898cc261&version=nightly

pub fn compare(a: &[u8], b: &[u8]) -> bool {
    a == b
}

@emberian emberian removed their assignment Jan 5, 2016
@arthurprs
Copy link
Contributor

As of today this is still true.

@ranma42
Copy link
Contributor

ranma42 commented Jan 23, 2016

Related bug in LLVM: https://llvm.org/bugs/show_bug.cgi?id=16332

Basically the problem is that the LLVM vectoriser does not know how to optimise loops with multiple exits or (which is the same) with termination guards that are not simple constraints on the index variable.

@nagisa
Copy link
Member

nagisa commented Mar 14, 2017

Seems to be fixed now with some specialisation (the slow comparison, not LLVM bug)

▪./memcmp --bench

running 2 tests
test bench::builtin_eq ... bench:     393,159 ns/iter (+/- 28,862)
test bench::custom_eq  ... bench:     581,997 ns/iter (+/- 42,720)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured

@aturon
Copy link
Member

aturon commented Mar 14, 2017

Thanks @nagisa, closing!

@aturon aturon closed this as completed Mar 14, 2017
bors added a commit to rust-lang-ci/rust that referenced this issue Mar 31, 2024
fix: Some file watching related vfs fixes

Fixes rust-lang/rust-analyzer#15554, additionally it seems that client side file watching was broken on windows this entire time, this PR switches `DidChangeWatchedFilesRegistrationOptions` to use relative glob patterns which do work on windows in VSCode.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests