Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[lazy] Optimize ZSTD_row_getMatchMask for levels 8-10 for ARM #3139

Merged
merged 4 commits into from
May 24, 2022
Merged

[lazy] Optimize ZSTD_row_getMatchMask for levels 8-10 for ARM #3139

merged 4 commits into from
May 24, 2022

Conversation

danlark1
Copy link
Contributor

@danlark1 danlark1 commented May 22, 2022

We found that movemask is not used properly or consumes too much CPU.
This effort helps to optimize the movemask emulation on ARM.

For levels 8-9 we saw 3-5% improvement over all compression. For level 10 we saw 1.5% improvement.

The key idea is not to use pure movemasks but to have groups of bits.
For rowEntries == 16, 32 we are going to have groups of size 4 and 2
respectively. It means that each bit will be duplicated within the group

Then we do AND to have only one bit set in the group so that iteration
with lowering bit a &= (a - 1) works as well.

Also, aarch64 does not have rotate instructions for 16 bit, only for 32
and 64, that's why we see more improvements for level 8-9.

vshrn_n_u16 instruction is used to achieve that: vshrn_n_u16 shifts by
4 every u16 and narrows to 8 lower bits. See the picture below. It's also used in Folly.
It also uses 2 cycles according to Neoverse-N{1,2} guidelines.

2022-05-22-114550_1114x827_scrot

And after AND it will be something like

2022-05-22-114604_1083x147_scrot

64 bit movemask is already well optimized. We have ongoing experiments
but were not able to validate other implementations work reliably faster.

We found that movemask is not used properly or consumes too much CPU.
This effort helps to optimize the movemask emulation on ARM.

For level 8-9 we saw 3-5% improvements. For level 10 we say 1.5%
improvement.

The key idea is not to use pure movemasks but to have groups of bits.
For rowEntries == 16, 32 we are going to have groups of size 4 and 2
respectively. It means that each bit will be duplicated within the group

Then we do AND to have only one bit set in the group so that iteration
with lowering bit `a &= (a - 1)` works as well.

Also, aarch64 does not have rotate instructions for 16 bit, only for 32
and 64, that's why we see more improvements for level 8-9.

vshrn_n_u16 instruction is used to achieve that: vshrn_n_u16 shifts by
4 every u16 and narrows to 8 lower bits. See the picture below. It's
also used in
[Folly](https://github.com/facebook/folly/blob/c5702590080aa5d0e8d666d91861d64634065132/folly/container/detail/F14Table.h#L446).
It also uses 2 cycles according to Neoverse-N{1,2} guidelines.

64 bit movemask is already well optimized. We have ongoing experiments
but were not able to validate other implementations work reliably faster.
@danlark1 danlark1 changed the title [lazy] Optimize ZSTD_row_getMatchMask for level 8-10 [lazy] Optimize ZSTD_row_getMatchMask for level 8-10 for ARM May 22, 2022
@danlark1 danlark1 changed the title [lazy] Optimize ZSTD_row_getMatchMask for level 8-10 for ARM [lazy] Optimize ZSTD_row_getMatchMask for levels 8-10 for ARM May 22, 2022
Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for the patch @danlark1!

When @senhuang42 was implemented this, we tested that we didn't regress ARM, but we didn't optimize for it specifically. And I'm not super familiar with NEON, so I can totally see how this was missed. Now that I have an M1 Macbook, and can easily test NEON out, I'll have to familiarize myself.

I trust that you've measured carefully, but I'll double check that it doesn't regress x86-64, and measure on my M1.

I have one nit, then I just need to measure, and this will be good to go!

lib/compress/zstd_lazy.c Outdated Show resolved Hide resolved
@terrelln
Copy link
Contributor

I see neutral performance on x86-64 with our clang 9 & 12, and gcc 9 & 11.

On my 2021 Macbook Air I see a +3.5-4.5% performance improvements for levels 5 through 8.

@terrelln terrelln merged commit 1c8a697 into facebook:dev May 24, 2022
@terrelln
Copy link
Contributor

Thanks for the PR @danlark1!

@Cyan4973 Cyan4973 mentioned this pull request Feb 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants