[lazy] Optimize ZSTD_row_getMatchMask for levels 8-10 for ARM #3139

danlark1 · 2022-05-22T10:47:25Z

We found that movemask is not used properly or consumes too much CPU.
This effort helps to optimize the movemask emulation on ARM.

For levels 8-9 we saw 3-5% improvement over all compression. For level 10 we saw 1.5% improvement.

The key idea is not to use pure movemasks but to have groups of bits.
For rowEntries == 16, 32 we are going to have groups of size 4 and 2
respectively. It means that each bit will be duplicated within the group

Then we do AND to have only one bit set in the group so that iteration
with lowering bit a &= (a - 1) works as well.

Also, aarch64 does not have rotate instructions for 16 bit, only for 32
and 64, that's why we see more improvements for level 8-9.

vshrn_n_u16 instruction is used to achieve that: vshrn_n_u16 shifts by
4 every u16 and narrows to 8 lower bits. See the picture below. It's also used in Folly.
It also uses 2 cycles according to Neoverse-N{1,2} guidelines.

And after AND it will be something like

64 bit movemask is already well optimized. We have ongoing experiments
but were not able to validate other implementations work reliably faster.

We found that movemask is not used properly or consumes too much CPU. This effort helps to optimize the movemask emulation on ARM. For level 8-9 we saw 3-5% improvements. For level 10 we say 1.5% improvement. The key idea is not to use pure movemasks but to have groups of bits. For rowEntries == 16, 32 we are going to have groups of size 4 and 2 respectively. It means that each bit will be duplicated within the group Then we do AND to have only one bit set in the group so that iteration with lowering bit `a &= (a - 1)` works as well. Also, aarch64 does not have rotate instructions for 16 bit, only for 32 and 64, that's why we see more improvements for level 8-9. vshrn_n_u16 instruction is used to achieve that: vshrn_n_u16 shifts by 4 every u16 and narrows to 8 lower bits. See the picture below. It's also used in [Folly](https://github.com/facebook/folly/blob/c5702590080aa5d0e8d666d91861d64634065132/folly/container/detail/F14Table.h#L446). It also uses 2 cycles according to Neoverse-N{1,2} guidelines. 64 bit movemask is already well optimized. We have ongoing experiments but were not able to validate other implementations work reliably faster.

terrelln

Awesome, thanks for the patch @danlark1!

When @senhuang42 was implemented this, we tested that we didn't regress ARM, but we didn't optimize for it specifically. And I'm not super familiar with NEON, so I can totally see how this was missed. Now that I have an M1 Macbook, and can easily test NEON out, I'll have to familiarize myself.

I trust that you've measured carefully, but I'll double check that it doesn't regress x86-64, and measure on my M1.

I have one nit, then I just need to measure, and this will be good to go!

lib/compress/zstd_lazy.c

embg · 2022-05-24T14:35:33Z

Maybe relevant: https://branchfree.org/2019/04/01/fitting-my-head-through-the-arm-holes-or-two-sequences-to-substitute-for-the-missing-pmovmskb-instruction-on-arm-neon/

terrelln · 2022-05-24T15:06:34Z

I see neutral performance on x86-64 with our clang 9 & 12, and gcc 9 & 11.

On my 2021 Macbook Air I see a +3.5-4.5% performance improvements for levels 5 through 8.

terrelln · 2022-05-24T15:10:40Z

Thanks for the PR @danlark1!

facebook-github-bot added the CLA Signed label May 22, 2022

danlark1 changed the title ~~[lazy] Optimize ZSTD_row_getMatchMask for level 8-10~~ [lazy] Optimize ZSTD_row_getMatchMask for level 8-10 for ARM May 22, 2022

Disable unused variable warning

778f639

danlark1 changed the title ~~[lazy] Optimize ZSTD_row_getMatchMask for level 8-10 for ARM~~ [lazy] Optimize ZSTD_row_getMatchMask for levels 8-10 for ARM May 22, 2022

terrelln reviewed May 23, 2022

View reviewed changes

lib/compress/zstd_lazy.c Outdated Show resolved Hide resolved

danlark1 added 2 commits May 23, 2022 14:49

Move NEON version to a separate function and fix indentation

6b561d2

Again unused error warning. Fixed

9166c6a

terrelln merged commit 1c8a697 into facebook:dev May 24, 2022

Cyan4973 mentioned this pull request Feb 9, 2023

release v1.5.4 #3487

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[lazy] Optimize ZSTD_row_getMatchMask for levels 8-10 for ARM #3139

[lazy] Optimize ZSTD_row_getMatchMask for levels 8-10 for ARM #3139

danlark1 commented May 22, 2022 •

edited

Loading

terrelln left a comment

embg commented May 24, 2022

terrelln commented May 24, 2022

terrelln commented May 24, 2022

[lazy] Optimize ZSTD_row_getMatchMask for levels 8-10 for ARM #3139

[lazy] Optimize ZSTD_row_getMatchMask for levels 8-10 for ARM #3139

Conversation

danlark1 commented May 22, 2022 • edited Loading

terrelln left a comment

Choose a reason for hiding this comment

embg commented May 24, 2022

terrelln commented May 24, 2022

terrelln commented May 24, 2022

danlark1 commented May 22, 2022 •

edited

Loading