Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simd: reapply and fix split cursor advancing #175

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

lucab
Copy link
Contributor

@lucab lucab commented Jun 12, 2024

This reapplies #156 with an additional commit on top in order to fix #172.

@AaronO
Copy link
Contributor

AaronO commented Sep 1, 2024

Note, cursory benches for this on M1 / arm64 show a ~10% perf regression vs main

@AaronO
Copy link
Contributor

AaronO commented Sep 1, 2024

@seanmonstar I can take a deeper look at this before reconsidering landing it, haven't looked into it in depth, but potentially there could be gains on avx2 simd from this approach because it avoids reprocessing an entire/last block on SIMD miss, allowing the SWAR or byte-wise code to pick up.

Since avx2 operates on 32B/256bit blocks it possibly "misses" a lot on header-values, short URLs etc...

So definitely a variant of this is worth considering, but I think it should be rigorously benchmarked given those shared in #156 were unclear... @seanmonstar probably worth adding CI benches (which can be noisy, but probably not meaningfully so these pure compute bound flows)

@AaronO
Copy link
Contributor

AaronO commented Sep 1, 2024

Also a big factor here could be avx2 being inlined or dispatched at runtime (when generically compiled for x64), the inlined version compiled for avx2 will obviously have less overhead.

@lucab Also are all the pub => pub(crate) changes necessary ?

@AaronO
Copy link
Contributor

AaronO commented Sep 1, 2024

@seanmonstar probably worth adding CI benches (which can be noisy, but probably not meaningfully so these pure compute bound flows)

Did a first pass at CI benches: #179, this PR should show a measurable delta

@lucab
Copy link
Contributor Author

lucab commented Sep 1, 2024

@AaronO from #156 (comment):

There is a minor cleanup bundled in this PR (marking several functions as pub(crate)) which I did in order to make sure I wasn't changing public APIs. I can split that to a dedicated PR if you prefer.

@lucab lucab force-pushed the ups/simd-remove-bytes-cursor-2 branch from 7de9ddb to 2262ff1 Compare September 1, 2024 14:00
@AaronO
Copy link
Contributor

AaronO commented Sep 1, 2024

@AaronO from #156 (comment):

There is a minor cleanup bundled in this PR (marking several functions as pub(crate)) which I did in order to make sure I wasn't changing public APIs. I can split that to a dedicated PR if you prefer.

@lucab Yes I think that should be split out. Especially since the original PR made substantial (AVX2) perf claims, it's best if we can focus on that and remove unrelated "noise". The alt refactor should be judged on its own merits. The focus here should be demonstrating, explaining clear perf benefits. From the benchmarks (local [arm64] & CI) that appears unclear.

@AaronO
Copy link
Contributor

AaronO commented Sep 1, 2024

@lucab Modern CPUs with speculative execution can counterintuitively be faster "processing the same bytes twice" if both execution paths are independent. There can also be tradeoffs impacting inline-ability if you increase register usage in hot loop, etc...

I touched this code a while back so not 100% sure, but I think I explored your change (if I'm distilling it correctly to returning partial matches) and concluded (based off local M1 benches at the time) that it wasn't a net positive.

For example your original PR #156, explains the improvement with:

This refactors all SIMD modules in order to make the value-matching logic self-contained. Thus, all bytes-cursor manipulations are now grouped and performed once at the end, outside of SIMD logic.

I should double check but IIRC, minimizing the amount of times you update the bytes-cursor sounds like a plausible improvement but in practice it impacts lowering. What actually happens here is that the "stack-allocated" Bytes struct actually gets allocated into 3 registers (start, cursor, end), we're always reading from cursor, with your code I think it gets lowered as basepointer+offset or ends up requiring an extra register for that inner loop.

Long story short, we should really distill this perf improvement hypothesis to its essentials and fully test/understand that.

@AaronO
Copy link
Contributor

AaronO commented Sep 1, 2024

@lucab Ok looking at the CI benches there does appear to be a beyind-noise improvement for sse42 (not avx2, but possibly I missed something in the CI benches).

It smells like it might boil down to inline-ability of sse42. Would need to double check, I'll take a deeper look tomorrow.

In part it's a perf bug vs purely raising the ceiling, especially for small values sse42 was substantially underperforming swar, possibly due to (not-)inlining (if sse42 is called non-inlined in the loop you pay some painful register init costs) or could be register alloc or something else.

@lucab lucab marked this pull request as draft September 1, 2024 16:05
@lucab
Copy link
Contributor Author

lucab commented Sep 1, 2024

@AaronO Ack, I'll split the pub cleanup as I was originally offering to.
I've marked this PR as a draft, as I think we aren't going to land this as-is anyway.

For context, I wasn't really looking at perf improvements when I opened this, but it happened as a local side effect:

My goal was to (eventually) use the bytes crate in this library, but then I realized that the required SIMD-related groundwork was actually providing performance improvements on its own.

AaronO added a commit to AaronO/httparse that referenced this pull request Sep 2, 2024
This has massive implications on the default runtime perf, improving how the code is lowered/inlined. (Falling back to SSE4.2 for a handful of bytes was wasteful).

Should supersede seanmonstar#175, seanmonstar#156
seanmonstar pushed a commit that referenced this pull request Sep 3, 2024
This has massive implications on the default runtime perf, improving how the code is lowered/inlined. (Falling back to SSE4.2 for a handful of bytes was wasteful).

Should supersede #175, #156
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Weird header parsing in 1.9
2 participants