x86_64 SSE2 fast-path for str.contains(&str) and short needles #103779

the8472 · 2022-10-30T20:52:57Z

Based on Wojciech Muła's SIMD-friendly algorithms for substring searching

The two-way algorithm is Big-O efficient but it needs to preprocess the needle
to find a "critical factorization" of it. This additional work is significant
for short needles. Additionally it mostly advances needle.len() bytes at a time.

The SIMD-based approach used here on the other hand can advance based on its
vector width, which can exceed the needle length. Except for pathological cases,
but due to being limited to small needles the worst case blowup is also small.

benchmarks taken on a Zen2, compiled with -Ccodegen-units=1:

OLD:
test str::bench_contains_16b_in_long                     ... bench:         504 ns/iter (+/- 14) = 5061 MB/s
test str::bench_contains_2b_repeated_long                ... bench:         948 ns/iter (+/- 175) = 2690 MB/s
test str::bench_contains_32b_in_long                     ... bench:         445 ns/iter (+/- 6) = 5732 MB/s
test str::bench_contains_bad_naive                       ... bench:         130 ns/iter (+/- 1) = 569 MB/s
test str::bench_contains_bad_simd                        ... bench:          84 ns/iter (+/- 8) = 880 MB/s
test str::bench_contains_equal                           ... bench:         142 ns/iter (+/- 7) = 394 MB/s
test str::bench_contains_short_long                      ... bench:         677 ns/iter (+/- 25) = 3768 MB/s
test str::bench_contains_short_short                     ... bench:          27 ns/iter (+/- 2) = 2074 MB/s

NEW:
test str::bench_contains_16b_in_long                     ... bench:          82 ns/iter (+/- 0) = 31109 MB/s
test str::bench_contains_2b_repeated_long                ... bench:          73 ns/iter (+/- 0) = 34945 MB/s
test str::bench_contains_32b_in_long                     ... bench:          71 ns/iter (+/- 1) = 35929 MB/s
test str::bench_contains_bad_naive                       ... bench:           7 ns/iter (+/- 0) = 10571 MB/s
test str::bench_contains_bad_simd                        ... bench:          97 ns/iter (+/- 41) = 762 MB/s
test str::bench_contains_equal                           ... bench:           4 ns/iter (+/- 0) = 14000 MB/s
test str::bench_contains_short_long                      ... bench:          73 ns/iter (+/- 0) = 34945 MB/s
test str::bench_contains_short_short                     ... bench:          12 ns/iter (+/- 0) = 4666 MB/s

rustbot · 2022-10-30T20:53:02Z

r? @m-ou-se

(rustbot has picked a reviewer for you, use r? to override)

the8472 · 2022-10-30T21:20:41Z

@bors try @rust-timer queue

rust-timer · 2022-10-30T21:20:44Z

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

bors · 2022-10-30T21:20:51Z

⌛ Trying commit f23374740b495d91f8a1efa767129fe5f34a521a with merge 00a98801b5deff4884bae13cf558c2f7e3beb9ca...

bors · 2022-10-30T23:29:47Z

☀️ Try build successful - checks-actions
Build commit: 00a98801b5deff4884bae13cf558c2f7e3beb9ca (00a98801b5deff4884bae13cf558c2f7e3beb9ca)

rust-timer · 2022-10-30T23:29:50Z

Queued 00a98801b5deff4884bae13cf558c2f7e3beb9ca with parent f42b6fa, future comparison URL.

rust-timer · 2022-10-31T00:48:52Z

Finished benchmarking commit (00a98801b5deff4884bae13cf558c2f7e3beb9ca): comparison URL.

Overall result: ✅ improvements - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf -perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-0.8%	[-1.8%, -0.3%]	9
Improvements ✅ (secondary)	-2.0%	[-3.9%, -0.3%]	12
All ❌✅ (primary)	-0.8%	[-1.8%, -0.3%]	9

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	0.2%	[0.2%, 0.2%]	1
Regressions ❌ (secondary)	1.8%	[1.8%, 1.8%]	1
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	0.2%	[0.2%, 0.2%]	1

Cycles

This benchmark run did not return any relevant results for this metric.

thomcc · 2022-10-31T03:01:40Z

Hmm, is this the first libcore function with architecture-specific SIMD intrinsics? At one point I know this was kinda discouraged. I think part of the reason was miri (which we could check with #[cfg(not(miri))]) and cranelift (which IDK if we can check) (CC @RalfJung, @bjorn3).

This could be ported to core::simd fairly easily (it doesn't use anything we're missing or unhappy with), but the heavy use of movemsk means we should make sure it doesn't regress perf on non-x86 (and continue gating on it if it does), since that operation has has codegen issues on many other targets even at the best of times¹.

On the bright side, this would remove most of the unsafe!

The core::simd equivalent to _mm_movemask_epi8 (mask8x16::any) has awful scalarized codegen on some non-x86 targets, but the fact that this actually uses the resulting bitmask for more than something like a test against 0 (which can be done more efficiently), which means it probably would need more changes to avoid the hot part of the loop getting scalarized. ↩

thomcc · 2022-10-31T04:56:39Z

I had kind of a hard time following some parts of this (mainly the index arithmetic). To understand that, I wrote out what this the algorithm is doing. I think this has slightly less handwaving than the linked page but it could be personal. For my own reference and other, I've put it here.

(If you notice any cases where I have misunderstood something important about it, let me know (there are a few parts where I've made intentional simplifications to avoid getting into some tedious details, tho))

The SIMD optimized code is used for part of the algorithm seems to apply for needles between 2..=8 bytes (0b is impossible, 1b uses memchr), if the haystack is sufficiently big (haystack.len() must be greater or equal to needle.len() + 16). If these are met, we do the following:

Take the first (index = 0) and last (index = needle.len() - 1) bytes of the needle, and make two vectors by splatting each into a 16-byte vector.

That is first_bytes = u8x16::from([needle[0]; 16]) and last_bytes = u8x16::from([needle[needle.len() - 1]; 16]) (NB. the code uses first and last rather than the _bytes suffix).

Using the first and last bytes allows some minor optimizations and simplifies some logic, but the indexes get used elsewhere and (IMO) are very important to understanding why this code works.
Iterate over the haystack in steps of 16, and load two 16 byte vectors at each step, one at the start of the chunk, and one offset from that by needle.len() - 1, stopping before this would go out of bounds (see step 5).

That is, chunk_a = u8x16::from_slice(&haystack[i+0..][..16]) and chunk_b = u8x16::from_slice(&haystack[i+needle.len()-1..][..16]) (NB: the code uses a and b rather than a chunk_ prefix), where i is the offset to load from (e.g. the iteration count * 16).

Note: The + 0 and + needle.len() - 1 the same as the indices inside needle where we took the bytes from in step 1.
Do packed equality tests for chunk_a vs first_bytes and chunk_b vs last_bytes to produce two masks, eq_first and eq_last. These are bitanded together to and we have a potential match if the result is non-zero.

The goal of that is to test if there are any indices in the vectors where a match has been hit in the same index of both comparisons.

Because the chunk_ vectors are offset by an amount equal to the index used to get the byte which built each _bytes vector, a test being true for simd_eq(chunk_a, first_bytes) and simd_eq(chunk_b, last_bytes) at some index means that at least two bytes of the needle are correct in haystack starting at chunk_offset + {index that matched}.
If we've found a potential match, then (to oversimplify a bit) we go over the indexes that are true in both eq_first and eq_last, and check for a match at that index using the equivalent of needle == &haystack[i+that_index..][..needle.len()].

Small note: The code actually avoids comparing the first and last indices of needle here, since effectively, that's what we tested in step 3 to get eq_first and eq_last. (It also uses the bitmask from movemsk, rather than the full mask, but this detail is not that important)
Our main loop in step 2 has a conservative end bound, so in order ensure there's no trailing bytes we didn't check perform steps 2-4 but use an offset of haystack.len() - (needle.len() - 1) - 16 instead of i.

This offset is chosen so that the slice loaded into chunk_b is exactly at the end of the haystack, ensuring there's no slop, and that we've tested the whole input.

I think some parts of this are clear in the code, but others aren't. The index arithmetic in particular isn't really (IMO), nor is the relationship between the byte indexs in the search vectors and the offset in the chunks (which is important to understand why this works).

The index arithmetic seems to be avoidable with iterators, which would clear a lot of this up (and also possibly might help perf by avoiding bounds checks?).

I suppose haystack.array_chunks::<16>().zip(haystack[needle.len()-1..].array_chunks::<16>()) is close to what the main iteration wants to be doing (although it does not have), but haystack.windows(needle.len() + 16).step_by(16) is closer to what it currently does, so maybe less disruptive (also maybe more optimizable? That would be something you'd probably have a better idea about, probably).

(Along those lines, I also kind of find use of the term chunk a little misleading, since it implies the iteration would be more like haystack.array_chunks::<16>(). I don't have a real alternative to suggest tho, and certainly used the term above, so whatever)

(I think I'm invested enough to want to take this review at this point...)

r? @thomcc

thomcc

This is great (and has great numbers), but I'd like to see this get a couple (big-ish) changes. #103779 (comment) has some rationale for these.

Use core::simd instead of manual core::arch. I think this is needed for miri and non-LLVM codegen backends, but even if it is not, getting rid of the the need for safety comments like // SAFETY: no preconditions other than sse2 being available on trivial stuff is kinda worth it IMO. (It's also arguably easier to maintain, and safe)

I'd probably say still keep the cfg(all(target_arch = "x86_64", target_feature = "sse2")) for now -- we probably don't want to start using core::simd on every target without careful thought, and I suspect even aarch64 is likely to have sub-par codegen for this under core::simd (even without the LLVM issue, the algorithm would need some tweaks on targets without a movemsk-style instruction).
I think using iterators would make the index arithmetic and iteration logic a lot more clear. As mentioned, I had to spend a bit of time figuring out what it was doing, and this would have helped some (maybe a lot) of that.
I think I'd also like some (inline) comments to explain what it's doing at a high level (no need to explain how simd works ofc, but the actual algorithm is kinda galaxybrain IMO, so some notes would be nice). Feel free to punt on this til after the others, since they might make it so that a link is enough.

Feel free to push back on these if you disagree (for example, iterators are def not the only way to make this clearer)

library/core/src/str/pattern.rs

thomcc · 2022-10-31T05:29:09Z

CC @BurntSushi (who has talked about this from time to time before)

the8472 · 2022-10-31T07:43:32Z

I had kind of a hard time following some parts of this (mainly the index arithmetic).

That's not just you. This went through several rounds of fixing out of bounds and integer overflow errors and lots of debug-printing to get the pieces aligned 😅

RalfJung · 2022-10-31T07:50:21Z

Hmm, is this the first libcore function with architecture-specific SIMD intrinsics?

Seems like it is. At least Miri doesn't implement the vast majority of them and so far that has not caused errors from stdlib functions, to my knowledge.

marmeladema · 2022-10-31T07:50:52Z

Not sure if you have seen it, but we already maintain a crate for this algorithm: https://github.com/cloudflare/sliceslice-rs

the8472 · 2022-10-31T08:24:14Z

Thanks, I wasn't aware. I may steal some tweaks from there. Although the AVX parts wouldn't be useful in core unless we start shipping some x86-64-v3 toolchain.

marmeladema · 2022-10-31T08:28:50Z

If that makes it easier for rustc, we could split out the sse2 / SSE 4.2 versions out of the avx2 one.
Anyway, if you'd rather use an external crate for this and sliceslice fits your needs, we are open to discuss changes to make integration easier.

bjorn3 · 2022-10-31T12:41:39Z

@thomcc cg_clif should implement all intrinsics used in this PR, however they are implemented using scalar emulation. With the exception of _mm_movemask_epi8 all used vendor intrinsics are implemented using simd_* platform intrinsics or even without any simd related intrinsics at all. _mm_movemask_epi8 is implemented using llvm.x86.sse2.pmovmskb.128 which cg_clif implements for supporting the memchr crate used by regex. cg_clif lacks simd support for a lot of float operations (or implements it incorrectly) as well as more exotic simd operations for which there are no platform independent intrinsics.

bjorn3 · 2022-10-31T12:43:15Z

Hmm, is this the first libcore function with architecture-specific SIMD intrinsics? At one point I know this was kinda discouraged.

The main issue is that libcore is almost never recompiled and as such would never benefit from the SIMD at all. However SSE2 is a mandatory feature for x86_64, so using SSE2 SIMD on x86_64 should be fine I think.

BurntSushi · 2022-10-31T12:55:49Z

memchr::memmem also implements a variant of this approach and AIUI memchr is already a dependency of std: https://github.com/BurntSushi/memchr/blob/8037d11b4357b0f07be2bb66dc2659d9cf28ad32/Cargo.toml#L34-L36 (Although I don't know if that means it can be used as a dependency of core.)

In practical terms, I think this is a great addition and will make a lot of common cases faster. So 👍 from me. I do agree with @thomcc though that using core::simd might be wiser here, if only as a way of dog-fooding.

One concern I do have here is whether this is a correct optimization or not. My understanding was that even though SSE2 is part of x86_64, the operating system might still choose to disable it because SSE2 requires support for the xmm registers. What that means is that, in my understanding, to be completely correct, you still have to do a CPUID check in order to use SSE2 on x86_64. But... I am not sure about this, and even if it is true, it may be irrelevant for all OS'es that we support? Dunno, but maybe something to look into.

Bigger picture, I'd like to point out that the memchr::memmem implementation kind of already did all of this substring litigation while maintaining std's time complexity guarantees. Two-Way is its path of last resort. And it uses almost exactly the algorithm implemented in this PR to accelerate the vast majority of substring searches on x86_64, even with needles bigger than 8 bytes. It also has special cases for optimizing latency in the presence of small haystacks. The only thing I can think of that would be needed to make it suitable for std is some internal rejiggering to make it so the same searcher can scan both forwards and backwards. (memchr::memmem supports reverse searching, but you have to specify that at construction time. And once the searcher is built, you can only use it for reverse searching. Similarly for forward searching. But I did this only because it's simpler, not because there is some fundamental reason why it cannot be done otherwise.)

Now whether it makes sense to depend on an external crate for such a core part of std, I'm not sure about that. We could bring memchr into the regex repo to make it a bit more official if that helps. Or maybe it makes sense to just take pieces from memchr and re-implement them in core. But there's a lot there.

In terms of the algorithm choice, memchr::memmem does effectively the same thing with one small tweak: it recognizes the choice of first and last as arbitrary, and instead tries to choose the prefilter bytes based on which bytes are thought to be less common. It's obviously a heuristic, but it works surprisingly well and is IMO better than just always selecting the first and last bytes.

Another algorithm choice would be to use the SIMD routine as a prefilter for Two-Way in the case of longer needles. But that is a more invasive change I think. (It is what memchr::memmem does.)

bjorn3 · 2022-10-31T13:20:50Z

https://github.com/BurntSushi/memchr/blob/8037d11b4357b0f07be2bb66dc2659d9cf28ad32/Cargo.toml#L34-L36 (Although I don't know if that means it can be used as a dependency of core.)

Nothing can be a dependency of core as everything depends on core. (technically #![no_core] exists, but you have to define a lot of lang items to do anything practical with it. Those definitions will conflict with the definitions in core)

One concern I do have here is whether this is a correct optimization or not. My understanding was that even though SSE2 is part of x86_64, the operating system might still choose to disable it because SSE2 requires support for the xmm registers. What that means is that, in my understanding, to be completely correct, you still have to do a CPUID check in order to use SSE2 on x86_64. But... I am not sure about this, and even if it is true, it may be irrelevant for all OS'es that we support? Dunno, but maybe something to look into.

LLVM already uses SSE2 everywhere (including for passing float function arguments as per the x86_64 sysv abi) without runtime checks as the target specs for almost all x86_64 targets enable SSE2. If an OS disables SSE2 it will have to be disabled in the target spec, at which point the SSE2 fast path in this PR will simply be compiled out.)

bors · 2022-11-15T17:31:47Z

⌛ Trying commit f29ca60f231c22fc341a90ba17ea69cf45b385ca with merge b926bd75207e52f30bfdaec8a2efdaae590de6b0...

bors · 2022-11-15T19:38:07Z

☀️ Try build successful - checks-actions
Build commit: b926bd75207e52f30bfdaec8a2efdaae590de6b0 (b926bd75207e52f30bfdaec8a2efdaae590de6b0)

rust-timer · 2022-11-15T23:22:34Z

Finished benchmarking commit (b926bd75207e52f30bfdaec8a2efdaae590de6b0): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	0.4%	[0.4%, 0.4%]	3
Regressions ❌ (secondary)	1.5%	[1.5%, 1.5%]	1
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-0.8%	[-1.0%, -0.5%]	2
All ❌✅ (primary)	0.4%	[0.4%, 0.4%]	3

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	2.4%	[0.1%, 7.0%]	8
Regressions ❌ (secondary)	3.2%	[0.9%, 8.5%]	6
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-1.1%	[-1.1%, -1.1%]	1
All ❌✅ (primary)	2.4%	[0.1%, 7.0%]	8

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-2.2%	[-2.2%, -2.2%]	1
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-2.2%	[-2.2%, -2.2%]	1

the8472 · 2022-11-15T23:56:53Z

@bors r=thomcc rollup=never

The slight compile time regressions are probably unavoidable since it adds code-paths that are dispatched at runtime. Doc benchmarks are still mostly green but now below the noise threshold (compared to the initial benchmark). I think the microbenchmarks justify this change and we can tweak things around the edges later.

bors · 2022-11-15T23:56:55Z

📌 Commit a2b2010 has been approved by thomcc

It is now in the queue for this repository.

Manishearth · 2022-11-16T17:45:18Z

@bors p=1

going to close the tree for non-nevers for a while so they can drain out

bors · 2022-11-17T04:47:14Z

⌛ Testing commit a2b2010 with merge 9340e5c...

bors · 2022-11-17T07:42:19Z

☀️ Test successful - checks-actions
Approved by: thomcc
Pushing 9340e5c to master...

rust-timer · 2022-11-17T09:32:19Z

Finished benchmarking commit (9340e5c): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Next Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please open an issue or create a new PR that fixes the regressions, add a comment linking to the newly created issue or PR, and then add the perf-regression-triaged label to this PR.

@rustbot label: +perf-regression
cc @rust-lang/wg-compiler-performance

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	0.4%	[0.3%, 0.6%]	4
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-0.4%	[-0.5%, -0.3%]	5
Improvements ✅ (secondary)	-1.0%	[-1.3%, -0.6%]	4
All ❌✅ (primary)	-0.1%	[-0.5%, 0.6%]	9

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	3.4%	[0.1%, 6.9%]	4
Regressions ❌ (secondary)	2.5%	[2.5%, 2.5%]	1
Improvements ✅ (primary)	-2.1%	[-2.1%, -2.1%]	1
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	2.3%	[-2.1%, 6.9%]	5

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	1.8%	[1.6%, 2.1%]	3
Regressions ❌ (secondary)	1.8%	[1.2%, 2.4%]	2
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	1.8%	[1.6%, 2.1%]	3

nnethercote · 2022-11-18T02:02:50Z

A few wins, a few losses, it roughly balances out.

@rustbot label: +perf-regression-triaged

…=thomcc" The current implementation seems to be unsound. See rust-lang#104726.

Simd contains fix Fixes rust-lang#104726 The bug was introduced by an improvement late in the original PR (rust-lang#103779) which added the backtracking when the last and first byte of the needle were the same. That changed the meaning of the variable for the last probe offset, which I should have split into the last byte offset and last probe offset. Not doing so lead to incorrect loop conditions.

x86_64 SSE2 fast-path for str.contains(&str) and short needles Based on Wojciech Muła's [SIMD-friendly algorithms for substring searching](http://0x80.pl/articles/simd-strfind.html#sse-avx2) The two-way algorithm is Big-O efficient but it needs to preprocess the needle to find a "critical factorization" of it. This additional work is significant for short needles. Additionally it mostly advances needle.len() bytes at a time. The SIMD-based approach used here on the other hand can advance based on its vector width, which can exceed the needle length. Except for pathological cases, but due to being limited to small needles the worst case blowup is also small. benchmarks taken on a Zen2, compiled with `-Ccodegen-units=1`: ``` OLD: test str::bench_contains_16b_in_long ... bench: 504 ns/iter (+/- 14) = 5061 MB/s test str::bench_contains_2b_repeated_long ... bench: 948 ns/iter (+/- 175) = 2690 MB/s test str::bench_contains_32b_in_long ... bench: 445 ns/iter (+/- 6) = 5732 MB/s test str::bench_contains_bad_naive ... bench: 130 ns/iter (+/- 1) = 569 MB/s test str::bench_contains_bad_simd ... bench: 84 ns/iter (+/- 8) = 880 MB/s test str::bench_contains_equal ... bench: 142 ns/iter (+/- 7) = 394 MB/s test str::bench_contains_short_long ... bench: 677 ns/iter (+/- 25) = 3768 MB/s test str::bench_contains_short_short ... bench: 27 ns/iter (+/- 2) = 2074 MB/s NEW: test str::bench_contains_16b_in_long ... bench: 82 ns/iter (+/- 0) = 31109 MB/s test str::bench_contains_2b_repeated_long ... bench: 73 ns/iter (+/- 0) = 34945 MB/s test str::bench_contains_32b_in_long ... bench: 71 ns/iter (+/- 1) = 35929 MB/s test str::bench_contains_bad_naive ... bench: 7 ns/iter (+/- 0) = 10571 MB/s test str::bench_contains_bad_simd ... bench: 97 ns/iter (+/- 41) = 762 MB/s test str::bench_contains_equal ... bench: 4 ns/iter (+/- 0) = 14000 MB/s test str::bench_contains_short_long ... bench: 73 ns/iter (+/- 0) = 34945 MB/s test str::bench_contains_short_short ... bench: 12 ns/iter (+/- 0) = 4666 MB/s ```

rustbot assigned m-ou-se Oct 30, 2022

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Oct 30, 2022

This comment was marked as resolved.

Sign in to view

the8472 force-pushed the simd-str-contains branch from 7e2ee3b to f233747 Compare October 30, 2022 21:05

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Oct 30, 2022

rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Oct 31, 2022

rustbot assigned thomcc and unassigned m-ou-se Oct 31, 2022

thomcc requested changes Oct 31, 2022

View reviewed changes

library/core/src/str/pattern.rs Show resolved Hide resolved

library/core/src/str/pattern.rs Outdated Show resolved Hide resolved

library/core/src/str/pattern.rs Outdated Show resolved Hide resolved

This comment has been minimized.

Sign in to view

rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Nov 15, 2022

the8472 force-pushed the simd-str-contains branch from f29ca60 to a2b2010 Compare November 15, 2022 23:32

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Nov 15, 2022

the8472 added the perf-regression-triaged The performance regression has been triaged. label Nov 15, 2022

bors added the merged-by-bors This PR was explicitly merged by bors. label Nov 17, 2022

bors merged commit 9340e5c into rust-lang:master Nov 17, 2022

rustbot added this to the 1.67.0 milestone Nov 17, 2022

pietroalbini mentioned this pull request Nov 22, 2022

Unsafe precondition violated in the x86_64 SIMD implementation of str.contains #104726

Closed

pietroalbini added a commit to ferrocene/rust that referenced this pull request Nov 22, 2022

Revert "Auto merge of rust-lang#103779 - the8472:simd-str-contains, r…

7953508

…=thomcc" The current implementation seems to be unsound. See rust-lang#104726.

pietroalbini mentioned this pull request Nov 22, 2022

Revert #103779 #104727

Closed

the8472 mentioned this pull request Nov 22, 2022

Simd contains fix #104735

Merged

ritchie46 mentioned this pull request Nov 24, 2022

chore: update to rustc nightly-2022-11-24 pola-rs/polars#5616

Merged

the8472 mentioned this pull request Mar 1, 2023

Specialization for StepBy<ops::Range<T>> #108615

Closed

jhorstmann mentioned this pull request Apr 10, 2024

String::to_lowercase does not get vectorized well contrary to code comments #123712

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x86_64 SSE2 fast-path for str.contains(&str) and short needles #103779

x86_64 SSE2 fast-path for str.contains(&str) and short needles #103779

the8472 commented Oct 30, 2022 •

edited

Loading

rustbot commented Oct 30, 2022

This comment was marked as resolved.

the8472 commented Oct 30, 2022

rust-timer commented Oct 30, 2022

bors commented Oct 30, 2022

bors commented Oct 30, 2022

rust-timer commented Oct 30, 2022

rust-timer commented Oct 31, 2022

thomcc commented Oct 31, 2022

thomcc commented Oct 31, 2022 •

edited

Loading

thomcc left a comment •

edited

Loading

thomcc commented Oct 31, 2022

the8472 commented Oct 31, 2022

RalfJung commented Oct 31, 2022

marmeladema commented Oct 31, 2022

the8472 commented Oct 31, 2022

marmeladema commented Oct 31, 2022

bjorn3 commented Oct 31, 2022 •

edited

Loading

bjorn3 commented Oct 31, 2022

BurntSushi commented Oct 31, 2022

bjorn3 commented Oct 31, 2022

bors commented Nov 15, 2022

bors commented Nov 15, 2022

This comment has been minimized.

rust-timer commented Nov 15, 2022

the8472 commented Nov 15, 2022

bors commented Nov 15, 2022

Manishearth commented Nov 16, 2022

bors commented Nov 17, 2022

bors commented Nov 17, 2022

rust-timer commented Nov 17, 2022

nnethercote commented Nov 18, 2022

x86_64 SSE2 fast-path for str.contains(&str) and short needles #103779

x86_64 SSE2 fast-path for str.contains(&str) and short needles #103779

Conversation

the8472 commented Oct 30, 2022 • edited Loading

rustbot commented Oct 30, 2022

This comment was marked as resolved.

the8472 commented Oct 30, 2022

rust-timer commented Oct 30, 2022

bors commented Oct 30, 2022

bors commented Oct 30, 2022

rust-timer commented Oct 30, 2022

rust-timer commented Oct 31, 2022

Overall result: ✅ improvements - no action needed

thomcc commented Oct 31, 2022

Footnotes

thomcc commented Oct 31, 2022 • edited Loading

thomcc left a comment • edited Loading

Choose a reason for hiding this comment

thomcc commented Oct 31, 2022

the8472 commented Oct 31, 2022

RalfJung commented Oct 31, 2022

marmeladema commented Oct 31, 2022

the8472 commented Oct 31, 2022

marmeladema commented Oct 31, 2022

bjorn3 commented Oct 31, 2022 • edited Loading

bjorn3 commented Oct 31, 2022

BurntSushi commented Oct 31, 2022

bjorn3 commented Oct 31, 2022

bors commented Nov 15, 2022

bors commented Nov 15, 2022

This comment has been minimized.

rust-timer commented Nov 15, 2022

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

the8472 commented Nov 15, 2022

bors commented Nov 15, 2022

Manishearth commented Nov 16, 2022

bors commented Nov 17, 2022

bors commented Nov 17, 2022

rust-timer commented Nov 17, 2022

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

nnethercote commented Nov 18, 2022

the8472 commented Oct 30, 2022 •

edited

Loading

thomcc commented Oct 31, 2022 •

edited

Loading

thomcc left a comment •

edited

Loading

bjorn3 commented Oct 31, 2022 •

edited

Loading