-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve count
vectorization: replace popcnt
implementation with vector counting
#4614
Improve count
vectorization: replace popcnt
implementation with vector counting
#4614
Conversation
0x1FFF'FFFF would be strict for the SSE4.2 codepath, but 0xFFF'FFFF is strict for the AVX2 codepath.
…SE4.2 loop. It never carries information across iterations.
…ain loop and tail. This allows us to avoid subtracting 1 from `_Max_count`, making it more similar to the SSE4.2 loop.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
I think it is good to use |
Thanks! 😻 I pushed a commit to further clarify the comments. Things make so much more sense now! I verified that my manual 20 GB test case passes. I reran the benchmarks, and it looks like you've reversed the perf damage I did, which is great. Curiously,
|
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
Thanks again for continuing to refine this important algorithm, and working so hard to educate my cat-sized brain about SIMD! 🐈 🧠 💚 |
Somewhat better in a long run, some pessimization for a short one.
Before:
After: