-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x86_64 SSE2 fast-path for str.contains(&str) and short needles #103779
Conversation
r? @m-ou-se (rustbot has picked a reviewer for you, use r? to override) |
This comment was marked as resolved.
This comment was marked as resolved.
7e2ee3b
to
f233747
Compare
@bors try @rust-timer queue |
Awaiting bors try build completion. @rustbot label: +S-waiting-on-perf |
⌛ Trying commit f23374740b495d91f8a1efa767129fe5f34a521a with merge 00a98801b5deff4884bae13cf558c2f7e3beb9ca... |
☀️ Try build successful - checks-actions |
Queued 00a98801b5deff4884bae13cf558c2f7e3beb9ca with parent f42b6fa, future comparison URL. |
Finished benchmarking commit (00a98801b5deff4884bae13cf558c2f7e3beb9ca): comparison URL. Overall result: ✅ improvements - no action neededBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesThis benchmark run did not return any relevant results for this metric. |
Hmm, is this the first libcore function with architecture-specific SIMD intrinsics? At one point I know this was kinda discouraged. I think part of the reason was miri (which we could check with This could be ported to On the bright side, this would remove most of the unsafe! Footnotes
|
I had kind of a hard time following some parts of this (mainly the index arithmetic). To understand that, I wrote out what this the algorithm is doing. I think this has slightly less handwaving than the linked page but it could be personal. For my own reference and other, I've put it here. (If you notice any cases where I have misunderstood something important about it, let me know (there are a few parts where I've made intentional simplifications to avoid getting into some tedious details, tho)) The SIMD optimized code is used for part of the algorithm seems to apply for needles between 2..=8 bytes (0b is impossible, 1b uses memchr), if the haystack is sufficiently big (
I think some parts of this are clear in the code, but others aren't. The index arithmetic in particular isn't really (IMO), nor is the relationship between the byte indexs in the search vectors and the offset in the chunks (which is important to understand why this works). The index arithmetic seems to be avoidable with iterators, which would clear a lot of this up (and also possibly might help perf by avoiding bounds checks?). I suppose (Along those lines, I also kind of find use of the term (I think I'm invested enough to want to take this review at this point...) r? @thomcc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great (and has great numbers), but I'd like to see this get a couple (big-ish) changes. #103779 (comment) has some rationale for these.
-
Use
core::simd
instead of manualcore::arch
. I think this is needed for miri and non-LLVM codegen backends, but even if it is not, getting rid of the the need for safety comments like// SAFETY: no preconditions other than sse2 being available
on trivial stuff is kinda worth it IMO. (It's also arguably easier to maintain, and safe)I'd probably say still keep the
cfg(all(target_arch = "x86_64", target_feature = "sse2"))
for now -- we probably don't want to start usingcore::simd
on every target without careful thought, and I suspect evenaarch64
is likely to have sub-par codegen for this undercore::simd
(even without the LLVM issue, the algorithm would need some tweaks on targets without amovemsk
-style instruction). -
I think using iterators would make the index arithmetic and iteration logic a lot more clear. As mentioned, I had to spend a bit of time figuring out what it was doing, and this would have helped some (maybe a lot) of that.
-
I think I'd also like some (inline) comments to explain what it's doing at a high level (no need to explain how simd works ofc, but the actual algorithm is kinda galaxybrain IMO, so some notes would be nice). Feel free to punt on this til after the others, since they might make it so that a link is enough.
Feel free to push back on these if you disagree (for example, iterators are def not the only way to make this clearer)
CC @BurntSushi (who has talked about this from time to time before) |
That's not just you. This went through several rounds of fixing out of bounds and integer overflow errors and lots of debug-printing to get the pieces aligned 😅 |
Seems like it is. At least Miri doesn't implement the vast majority of them and so far that has not caused errors from stdlib functions, to my knowledge. |
Not sure if you have seen it, but we already maintain a crate for this algorithm: https://github.com/cloudflare/sliceslice-rs |
Thanks, I wasn't aware. I may steal some tweaks from there. Although the AVX parts wouldn't be useful in core unless we start shipping some x86-64-v3 toolchain. |
If that makes it easier for rustc, we could split out the sse2 / SSE 4.2 versions out of the avx2 one. |
@thomcc cg_clif should implement all intrinsics used in this PR, however they are implemented using scalar emulation. With the exception of |
The main issue is that libcore is almost never recompiled and as such would never benefit from the SIMD at all. However SSE2 is a mandatory feature for x86_64, so using SSE2 SIMD on x86_64 should be fine I think. |
In practical terms, I think this is a great addition and will make a lot of common cases faster. So 👍 from me. I do agree with @thomcc though that using One concern I do have here is whether this is a correct optimization or not. My understanding was that even though SSE2 is part of Bigger picture, I'd like to point out that the Now whether it makes sense to depend on an external crate for such a core part of std, I'm not sure about that. We could bring In terms of the algorithm choice, Another algorithm choice would be to use the SIMD routine as a prefilter for Two-Way in the case of longer needles. But that is a more invasive change I think. (It is what |
Nothing can be a dependency of core as everything depends on core. (technically
LLVM already uses SSE2 everywhere (including for passing float function arguments as per the x86_64 sysv abi) without runtime checks as the target specs for almost all x86_64 targets enable SSE2. If an OS disables SSE2 it will have to be disabled in the target spec, at which point the SSE2 fast path in this PR will simply be compiled out.) |
⌛ Trying commit f29ca60f231c22fc341a90ba17ea69cf45b385ca with merge b926bd75207e52f30bfdaec8a2efdaae590de6b0... |
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (b926bd75207e52f30bfdaec8a2efdaae590de6b0): comparison URL. Overall result: ❌✅ regressions and improvements - ACTION NEEDEDBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
|
f29ca60
to
a2b2010
Compare
@bors r=thomcc rollup=never The slight compile time regressions are probably unavoidable since it adds code-paths that are dispatched at runtime. Doc benchmarks are still mostly green but now below the noise threshold (compared to the initial benchmark). I think the microbenchmarks justify this change and we can tweak things around the edges later. |
@bors p=1 going to close the tree for non-nevers for a while so they can drain out |
☀️ Test successful - checks-actions |
Finished benchmarking commit (9340e5c): comparison URL. Overall result: ❌✅ regressions and improvements - ACTION NEEDEDNext Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
|
A few wins, a few losses, it roughly balances out. @rustbot label: +perf-regression-triaged |
…=thomcc" The current implementation seems to be unsound. See rust-lang#104726.
Simd contains fix Fixes rust-lang#104726 The bug was introduced by an improvement late in the original PR (rust-lang#103779) which added the backtracking when the last and first byte of the needle were the same. That changed the meaning of the variable for the last probe offset, which I should have split into the last byte offset and last probe offset. Not doing so lead to incorrect loop conditions.
x86_64 SSE2 fast-path for str.contains(&str) and short needles Based on Wojciech Muła's [SIMD-friendly algorithms for substring searching](http://0x80.pl/articles/simd-strfind.html#sse-avx2) The two-way algorithm is Big-O efficient but it needs to preprocess the needle to find a "critical factorization" of it. This additional work is significant for short needles. Additionally it mostly advances needle.len() bytes at a time. The SIMD-based approach used here on the other hand can advance based on its vector width, which can exceed the needle length. Except for pathological cases, but due to being limited to small needles the worst case blowup is also small. benchmarks taken on a Zen2, compiled with `-Ccodegen-units=1`: ``` OLD: test str::bench_contains_16b_in_long ... bench: 504 ns/iter (+/- 14) = 5061 MB/s test str::bench_contains_2b_repeated_long ... bench: 948 ns/iter (+/- 175) = 2690 MB/s test str::bench_contains_32b_in_long ... bench: 445 ns/iter (+/- 6) = 5732 MB/s test str::bench_contains_bad_naive ... bench: 130 ns/iter (+/- 1) = 569 MB/s test str::bench_contains_bad_simd ... bench: 84 ns/iter (+/- 8) = 880 MB/s test str::bench_contains_equal ... bench: 142 ns/iter (+/- 7) = 394 MB/s test str::bench_contains_short_long ... bench: 677 ns/iter (+/- 25) = 3768 MB/s test str::bench_contains_short_short ... bench: 27 ns/iter (+/- 2) = 2074 MB/s NEW: test str::bench_contains_16b_in_long ... bench: 82 ns/iter (+/- 0) = 31109 MB/s test str::bench_contains_2b_repeated_long ... bench: 73 ns/iter (+/- 0) = 34945 MB/s test str::bench_contains_32b_in_long ... bench: 71 ns/iter (+/- 1) = 35929 MB/s test str::bench_contains_bad_naive ... bench: 7 ns/iter (+/- 0) = 10571 MB/s test str::bench_contains_bad_simd ... bench: 97 ns/iter (+/- 41) = 762 MB/s test str::bench_contains_equal ... bench: 4 ns/iter (+/- 0) = 14000 MB/s test str::bench_contains_short_long ... bench: 73 ns/iter (+/- 0) = 34945 MB/s test str::bench_contains_short_short ... bench: 12 ns/iter (+/- 0) = 4666 MB/s ```
Based on Wojciech Muła's SIMD-friendly algorithms for substring searching
The two-way algorithm is Big-O efficient but it needs to preprocess the needle
to find a "critical factorization" of it. This additional work is significant
for short needles. Additionally it mostly advances needle.len() bytes at a time.
The SIMD-based approach used here on the other hand can advance based on its
vector width, which can exceed the needle length. Except for pathological cases,
but due to being limited to small needles the worst case blowup is also small.
benchmarks taken on a Zen2, compiled with
-Ccodegen-units=1
: