[C++][Parquet] Faster scalar BYTE_STREAM_SPLIT #38542

pitrou · 2023-11-01T16:17:02Z

Describe the enhancement requested

We have SIMD optimizations selected at compile-time for x86, but we should also improve the speed of the fallback scalar implementations of BYTE_STREAM_SPLIT encoding and decoding.

Component(s)

C++, Parquet

### Rationale for this change BYTE_STREAM_SPLIT encoding and decoding benefit from SIMD accelerations on x86, but scalar implementations are used otherwise. ### What changes are included in this PR? Improve the speed of scalar implementations of BYTE_STREAM_SPLIT by using a blocked algorithm. Benchmark numbers on the author's machine (AMD Ryzen 9 3900X): * before: ``` ------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------------------- BM_ByteStreamSplitDecode_Float_Scalar/1024 1374 ns 1374 ns 510396 bytes_per_second=2.77574G/s BM_ByteStreamSplitDecode_Float_Scalar/4096 5483 ns 5483 ns 127498 bytes_per_second=2.78303G/s BM_ByteStreamSplitDecode_Float_Scalar/32768 44042 ns 44035 ns 15905 bytes_per_second=2.77212G/s BM_ByteStreamSplitDecode_Float_Scalar/65536 87966 ns 87952 ns 7962 bytes_per_second=2.77583G/s BM_ByteStreamSplitDecode_Double_Scalar/1024 2583 ns 2583 ns 271436 bytes_per_second=2.95408G/s BM_ByteStreamSplitDecode_Double_Scalar/4096 10533 ns 10532 ns 65695 bytes_per_second=2.89761G/s BM_ByteStreamSplitDecode_Double_Scalar/32768 84067 ns 84053 ns 8275 bytes_per_second=2.90459G/s BM_ByteStreamSplitDecode_Double_Scalar/65536 168332 ns 168309 ns 4155 bytes_per_second=2.9011G/s BM_ByteStreamSplitEncode_Float_Scalar/1024 1435 ns 1435 ns 484278 bytes_per_second=2.65802G/s BM_ByteStreamSplitEncode_Float_Scalar/4096 5725 ns 5725 ns 121877 bytes_per_second=2.66545G/s BM_ByteStreamSplitEncode_Float_Scalar/32768 46291 ns 46283 ns 15134 bytes_per_second=2.63745G/s BM_ByteStreamSplitEncode_Float_Scalar/65536 91139 ns 91128 ns 7707 bytes_per_second=2.6791G/s BM_ByteStreamSplitEncode_Double_Scalar/1024 3093 ns 3093 ns 226198 bytes_per_second=2.46663G/s BM_ByteStreamSplitEncode_Double_Scalar/4096 12724 ns 12722 ns 54522 bytes_per_second=2.39873G/s BM_ByteStreamSplitEncode_Double_Scalar/32768 100488 ns 100475 ns 6957 bytes_per_second=2.42987G/s BM_ByteStreamSplitEncode_Double_Scalar/65536 200885 ns 200852 ns 3486 bytes_per_second=2.43105G/s ``` * after: ``` ------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------------------- BM_ByteStreamSplitDecode_Float_Scalar/1024 932 ns 932 ns 753352 bytes_per_second=4.09273G/s BM_ByteStreamSplitDecode_Float_Scalar/4096 3715 ns 3715 ns 188394 bytes_per_second=4.10783G/s BM_ByteStreamSplitDecode_Float_Scalar/32768 30167 ns 30162 ns 23441 bytes_per_second=4.04716G/s BM_ByteStreamSplitDecode_Float_Scalar/65536 59483 ns 59475 ns 11744 bytes_per_second=4.10496G/s BM_ByteStreamSplitDecode_Double_Scalar/1024 1862 ns 1862 ns 374715 bytes_per_second=4.09823G/s BM_ByteStreamSplitDecode_Double_Scalar/4096 7554 ns 7553 ns 91975 bytes_per_second=4.04038G/s BM_ByteStreamSplitDecode_Double_Scalar/32768 60429 ns 60421 ns 11499 bytes_per_second=4.04067G/s BM_ByteStreamSplitDecode_Double_Scalar/65536 120992 ns 120972 ns 5756 bytes_per_second=4.03631G/s BM_ByteStreamSplitEncode_Float_Scalar/1024 737 ns 737 ns 947423 bytes_per_second=5.17843G/s BM_ByteStreamSplitEncode_Float_Scalar/4096 2934 ns 2933 ns 239459 bytes_per_second=5.20175G/s BM_ByteStreamSplitEncode_Float_Scalar/32768 23730 ns 23727 ns 29243 bytes_per_second=5.14485G/s BM_ByteStreamSplitEncode_Float_Scalar/65536 47671 ns 47664 ns 14682 bytes_per_second=5.12209G/s BM_ByteStreamSplitEncode_Double_Scalar/1024 1517 ns 1517 ns 458928 bytes_per_second=5.02827G/s BM_ByteStreamSplitEncode_Double_Scalar/4096 6224 ns 6223 ns 111361 bytes_per_second=4.90407G/s BM_ByteStreamSplitEncode_Double_Scalar/32768 49719 ns 49713 ns 14059 bytes_per_second=4.91099G/s BM_ByteStreamSplitEncode_Double_Scalar/65536 99445 ns 99432 ns 7027 bytes_per_second=4.91072G/s ``` ### Are these changes tested? Yes, though the scalar implementations are unfortunately only exercised on non-x86 by default (see added comment in the PR). ### Are there any user-facing changes? No. * Closes: #38542 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

…e#38529) ### Rationale for this change BYTE_STREAM_SPLIT encoding and decoding benefit from SIMD accelerations on x86, but scalar implementations are used otherwise. ### What changes are included in this PR? Improve the speed of scalar implementations of BYTE_STREAM_SPLIT by using a blocked algorithm. Benchmark numbers on the author's machine (AMD Ryzen 9 3900X): * before: ``` ------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------------------- BM_ByteStreamSplitDecode_Float_Scalar/1024 1374 ns 1374 ns 510396 bytes_per_second=2.77574G/s BM_ByteStreamSplitDecode_Float_Scalar/4096 5483 ns 5483 ns 127498 bytes_per_second=2.78303G/s BM_ByteStreamSplitDecode_Float_Scalar/32768 44042 ns 44035 ns 15905 bytes_per_second=2.77212G/s BM_ByteStreamSplitDecode_Float_Scalar/65536 87966 ns 87952 ns 7962 bytes_per_second=2.77583G/s BM_ByteStreamSplitDecode_Double_Scalar/1024 2583 ns 2583 ns 271436 bytes_per_second=2.95408G/s BM_ByteStreamSplitDecode_Double_Scalar/4096 10533 ns 10532 ns 65695 bytes_per_second=2.89761G/s BM_ByteStreamSplitDecode_Double_Scalar/32768 84067 ns 84053 ns 8275 bytes_per_second=2.90459G/s BM_ByteStreamSplitDecode_Double_Scalar/65536 168332 ns 168309 ns 4155 bytes_per_second=2.9011G/s BM_ByteStreamSplitEncode_Float_Scalar/1024 1435 ns 1435 ns 484278 bytes_per_second=2.65802G/s BM_ByteStreamSplitEncode_Float_Scalar/4096 5725 ns 5725 ns 121877 bytes_per_second=2.66545G/s BM_ByteStreamSplitEncode_Float_Scalar/32768 46291 ns 46283 ns 15134 bytes_per_second=2.63745G/s BM_ByteStreamSplitEncode_Float_Scalar/65536 91139 ns 91128 ns 7707 bytes_per_second=2.6791G/s BM_ByteStreamSplitEncode_Double_Scalar/1024 3093 ns 3093 ns 226198 bytes_per_second=2.46663G/s BM_ByteStreamSplitEncode_Double_Scalar/4096 12724 ns 12722 ns 54522 bytes_per_second=2.39873G/s BM_ByteStreamSplitEncode_Double_Scalar/32768 100488 ns 100475 ns 6957 bytes_per_second=2.42987G/s BM_ByteStreamSplitEncode_Double_Scalar/65536 200885 ns 200852 ns 3486 bytes_per_second=2.43105G/s ``` * after: ``` ------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------------------- BM_ByteStreamSplitDecode_Float_Scalar/1024 932 ns 932 ns 753352 bytes_per_second=4.09273G/s BM_ByteStreamSplitDecode_Float_Scalar/4096 3715 ns 3715 ns 188394 bytes_per_second=4.10783G/s BM_ByteStreamSplitDecode_Float_Scalar/32768 30167 ns 30162 ns 23441 bytes_per_second=4.04716G/s BM_ByteStreamSplitDecode_Float_Scalar/65536 59483 ns 59475 ns 11744 bytes_per_second=4.10496G/s BM_ByteStreamSplitDecode_Double_Scalar/1024 1862 ns 1862 ns 374715 bytes_per_second=4.09823G/s BM_ByteStreamSplitDecode_Double_Scalar/4096 7554 ns 7553 ns 91975 bytes_per_second=4.04038G/s BM_ByteStreamSplitDecode_Double_Scalar/32768 60429 ns 60421 ns 11499 bytes_per_second=4.04067G/s BM_ByteStreamSplitDecode_Double_Scalar/65536 120992 ns 120972 ns 5756 bytes_per_second=4.03631G/s BM_ByteStreamSplitEncode_Float_Scalar/1024 737 ns 737 ns 947423 bytes_per_second=5.17843G/s BM_ByteStreamSplitEncode_Float_Scalar/4096 2934 ns 2933 ns 239459 bytes_per_second=5.20175G/s BM_ByteStreamSplitEncode_Float_Scalar/32768 23730 ns 23727 ns 29243 bytes_per_second=5.14485G/s BM_ByteStreamSplitEncode_Float_Scalar/65536 47671 ns 47664 ns 14682 bytes_per_second=5.12209G/s BM_ByteStreamSplitEncode_Double_Scalar/1024 1517 ns 1517 ns 458928 bytes_per_second=5.02827G/s BM_ByteStreamSplitEncode_Double_Scalar/4096 6224 ns 6223 ns 111361 bytes_per_second=4.90407G/s BM_ByteStreamSplitEncode_Double_Scalar/32768 49719 ns 49713 ns 14059 bytes_per_second=4.91099G/s BM_ByteStreamSplitEncode_Double_Scalar/65536 99445 ns 99432 ns 7027 bytes_per_second=4.91072G/s ``` ### Are these changes tested? Yes, though the scalar implementations are unfortunately only exercised on non-x86 by default (see added comment in the PR). ### Are there any user-facing changes? No. * Closes: apache#38542 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

pitrou added the Type: enhancement label Nov 1, 2023

github-actions bot added Component: Parquet Component: C++ labels Nov 1, 2023

github-actions bot mentioned this issue Nov 1, 2023

GH-38542: [C++][Parquet] Faster scalar BYTE_STREAM_SPLIT #38529

Merged

github-actions bot assigned pitrou Nov 1, 2023

pitrou closed this as completed in #38529 Nov 2, 2023

pitrou added this to the 15.0.0 milestone Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Parquet] Faster scalar BYTE_STREAM_SPLIT #38542

[C++][Parquet] Faster scalar BYTE_STREAM_SPLIT #38542

pitrou commented Nov 1, 2023

[C++][Parquet] Faster scalar BYTE_STREAM_SPLIT #38542

[C++][Parquet] Faster scalar BYTE_STREAM_SPLIT #38542

Comments

pitrou commented Nov 1, 2023

Describe the enhancement requested

Component(s)