-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[lazy] Optimize ZSTD_row_getMatchMask for levels 8-10 for ARM #3139
Conversation
We found that movemask is not used properly or consumes too much CPU. This effort helps to optimize the movemask emulation on ARM. For level 8-9 we saw 3-5% improvements. For level 10 we say 1.5% improvement. The key idea is not to use pure movemasks but to have groups of bits. For rowEntries == 16, 32 we are going to have groups of size 4 and 2 respectively. It means that each bit will be duplicated within the group Then we do AND to have only one bit set in the group so that iteration with lowering bit `a &= (a - 1)` works as well. Also, aarch64 does not have rotate instructions for 16 bit, only for 32 and 64, that's why we see more improvements for level 8-9. vshrn_n_u16 instruction is used to achieve that: vshrn_n_u16 shifts by 4 every u16 and narrows to 8 lower bits. See the picture below. It's also used in [Folly](https://github.com/facebook/folly/blob/c5702590080aa5d0e8d666d91861d64634065132/folly/container/detail/F14Table.h#L446). It also uses 2 cycles according to Neoverse-N{1,2} guidelines. 64 bit movemask is already well optimized. We have ongoing experiments but were not able to validate other implementations work reliably faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks for the patch @danlark1!
When @senhuang42 was implemented this, we tested that we didn't regress ARM, but we didn't optimize for it specifically. And I'm not super familiar with NEON, so I can totally see how this was missed. Now that I have an M1 Macbook, and can easily test NEON out, I'll have to familiarize myself.
I trust that you've measured carefully, but I'll double check that it doesn't regress x86-64, and measure on my M1.
I have one nit, then I just need to measure, and this will be good to go!
I see neutral performance on x86-64 with our clang 9 & 12, and gcc 9 & 11. On my 2021 Macbook Air I see a +3.5-4.5% performance improvements for levels 5 through 8. |
Thanks for the PR @danlark1! |
We found that movemask is not used properly or consumes too much CPU.
This effort helps to optimize the movemask emulation on ARM.
For levels 8-9 we saw 3-5% improvement over all compression. For level 10 we saw 1.5% improvement.
The key idea is not to use pure movemasks but to have groups of bits.
For rowEntries == 16, 32 we are going to have groups of size 4 and 2
respectively. It means that each bit will be duplicated within the group
Then we do AND to have only one bit set in the group so that iteration
with lowering bit
a &= (a - 1)
works as well.Also, aarch64 does not have rotate instructions for 16 bit, only for 32
and 64, that's why we see more improvements for level 8-9.
vshrn_n_u16 instruction is used to achieve that: vshrn_n_u16 shifts by
4 every u16 and narrows to 8 lower bits. See the picture below. It's also used in Folly.
It also uses 2 cycles according to Neoverse-N{1,2} guidelines.
And after AND it will be something like
64 bit movemask is already well optimized. We have ongoing experiments
but were not able to validate other implementations work reliably faster.