Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Faster scalar BYTE_STREAM_SPLIT #38542

Closed
pitrou opened this issue Nov 1, 2023 · 0 comments · Fixed by #38529
Closed

[C++][Parquet] Faster scalar BYTE_STREAM_SPLIT #38542

pitrou opened this issue Nov 1, 2023 · 0 comments · Fixed by #38529

Comments

@pitrou
Copy link
Member

pitrou commented Nov 1, 2023

Describe the enhancement requested

We have SIMD optimizations selected at compile-time for x86, but we should also improve the speed of the fallback scalar implementations of BYTE_STREAM_SPLIT encoding and decoding.

Component(s)

C++, Parquet

pitrou added a commit that referenced this issue Nov 2, 2023
### Rationale for this change

BYTE_STREAM_SPLIT encoding and decoding benefit from SIMD accelerations on x86, but scalar implementations are used otherwise.

### What changes are included in this PR?

Improve the speed of scalar implementations of BYTE_STREAM_SPLIT by using a blocked algorithm.

Benchmark numbers on the author's machine (AMD Ryzen 9 3900X):

* before:
```
-------------------------------------------------------------------------------------------------------
Benchmark                                             Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------
BM_ByteStreamSplitDecode_Float_Scalar/1024         1374 ns         1374 ns       510396 bytes_per_second=2.77574G/s
BM_ByteStreamSplitDecode_Float_Scalar/4096         5483 ns         5483 ns       127498 bytes_per_second=2.78303G/s
BM_ByteStreamSplitDecode_Float_Scalar/32768       44042 ns        44035 ns        15905 bytes_per_second=2.77212G/s
BM_ByteStreamSplitDecode_Float_Scalar/65536       87966 ns        87952 ns         7962 bytes_per_second=2.77583G/s

BM_ByteStreamSplitDecode_Double_Scalar/1024        2583 ns         2583 ns       271436 bytes_per_second=2.95408G/s
BM_ByteStreamSplitDecode_Double_Scalar/4096       10533 ns        10532 ns        65695 bytes_per_second=2.89761G/s
BM_ByteStreamSplitDecode_Double_Scalar/32768      84067 ns        84053 ns         8275 bytes_per_second=2.90459G/s
BM_ByteStreamSplitDecode_Double_Scalar/65536     168332 ns       168309 ns         4155 bytes_per_second=2.9011G/s

BM_ByteStreamSplitEncode_Float_Scalar/1024         1435 ns         1435 ns       484278 bytes_per_second=2.65802G/s
BM_ByteStreamSplitEncode_Float_Scalar/4096         5725 ns         5725 ns       121877 bytes_per_second=2.66545G/s
BM_ByteStreamSplitEncode_Float_Scalar/32768       46291 ns        46283 ns        15134 bytes_per_second=2.63745G/s
BM_ByteStreamSplitEncode_Float_Scalar/65536       91139 ns        91128 ns         7707 bytes_per_second=2.6791G/s

BM_ByteStreamSplitEncode_Double_Scalar/1024        3093 ns         3093 ns       226198 bytes_per_second=2.46663G/s
BM_ByteStreamSplitEncode_Double_Scalar/4096       12724 ns        12722 ns        54522 bytes_per_second=2.39873G/s
BM_ByteStreamSplitEncode_Double_Scalar/32768     100488 ns       100475 ns         6957 bytes_per_second=2.42987G/s
BM_ByteStreamSplitEncode_Double_Scalar/65536     200885 ns       200852 ns         3486 bytes_per_second=2.43105G/s
```
* after:
```
-------------------------------------------------------------------------------------------------------
Benchmark                                             Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------
BM_ByteStreamSplitDecode_Float_Scalar/1024          932 ns          932 ns       753352 bytes_per_second=4.09273G/s
BM_ByteStreamSplitDecode_Float_Scalar/4096         3715 ns         3715 ns       188394 bytes_per_second=4.10783G/s
BM_ByteStreamSplitDecode_Float_Scalar/32768       30167 ns        30162 ns        23441 bytes_per_second=4.04716G/s
BM_ByteStreamSplitDecode_Float_Scalar/65536       59483 ns        59475 ns        11744 bytes_per_second=4.10496G/s

BM_ByteStreamSplitDecode_Double_Scalar/1024        1862 ns         1862 ns       374715 bytes_per_second=4.09823G/s
BM_ByteStreamSplitDecode_Double_Scalar/4096        7554 ns         7553 ns        91975 bytes_per_second=4.04038G/s
BM_ByteStreamSplitDecode_Double_Scalar/32768      60429 ns        60421 ns        11499 bytes_per_second=4.04067G/s
BM_ByteStreamSplitDecode_Double_Scalar/65536     120992 ns       120972 ns         5756 bytes_per_second=4.03631G/s

BM_ByteStreamSplitEncode_Float_Scalar/1024          737 ns          737 ns       947423 bytes_per_second=5.17843G/s
BM_ByteStreamSplitEncode_Float_Scalar/4096         2934 ns         2933 ns       239459 bytes_per_second=5.20175G/s
BM_ByteStreamSplitEncode_Float_Scalar/32768       23730 ns        23727 ns        29243 bytes_per_second=5.14485G/s
BM_ByteStreamSplitEncode_Float_Scalar/65536       47671 ns        47664 ns        14682 bytes_per_second=5.12209G/s

BM_ByteStreamSplitEncode_Double_Scalar/1024        1517 ns         1517 ns       458928 bytes_per_second=5.02827G/s
BM_ByteStreamSplitEncode_Double_Scalar/4096        6224 ns         6223 ns       111361 bytes_per_second=4.90407G/s
BM_ByteStreamSplitEncode_Double_Scalar/32768      49719 ns        49713 ns        14059 bytes_per_second=4.91099G/s
BM_ByteStreamSplitEncode_Double_Scalar/65536      99445 ns        99432 ns         7027 bytes_per_second=4.91072G/s
```

### Are these changes tested?

Yes, though the scalar implementations are unfortunately only exercised on non-x86 by default (see added comment in the PR).

### Are there any user-facing changes?

No.

* Closes: #38542

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
@pitrou pitrou added this to the 15.0.0 milestone Nov 2, 2023
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…e#38529)

### Rationale for this change

BYTE_STREAM_SPLIT encoding and decoding benefit from SIMD accelerations on x86, but scalar implementations are used otherwise.

### What changes are included in this PR?

Improve the speed of scalar implementations of BYTE_STREAM_SPLIT by using a blocked algorithm.

Benchmark numbers on the author's machine (AMD Ryzen 9 3900X):

* before:
```
-------------------------------------------------------------------------------------------------------
Benchmark                                             Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------
BM_ByteStreamSplitDecode_Float_Scalar/1024         1374 ns         1374 ns       510396 bytes_per_second=2.77574G/s
BM_ByteStreamSplitDecode_Float_Scalar/4096         5483 ns         5483 ns       127498 bytes_per_second=2.78303G/s
BM_ByteStreamSplitDecode_Float_Scalar/32768       44042 ns        44035 ns        15905 bytes_per_second=2.77212G/s
BM_ByteStreamSplitDecode_Float_Scalar/65536       87966 ns        87952 ns         7962 bytes_per_second=2.77583G/s

BM_ByteStreamSplitDecode_Double_Scalar/1024        2583 ns         2583 ns       271436 bytes_per_second=2.95408G/s
BM_ByteStreamSplitDecode_Double_Scalar/4096       10533 ns        10532 ns        65695 bytes_per_second=2.89761G/s
BM_ByteStreamSplitDecode_Double_Scalar/32768      84067 ns        84053 ns         8275 bytes_per_second=2.90459G/s
BM_ByteStreamSplitDecode_Double_Scalar/65536     168332 ns       168309 ns         4155 bytes_per_second=2.9011G/s

BM_ByteStreamSplitEncode_Float_Scalar/1024         1435 ns         1435 ns       484278 bytes_per_second=2.65802G/s
BM_ByteStreamSplitEncode_Float_Scalar/4096         5725 ns         5725 ns       121877 bytes_per_second=2.66545G/s
BM_ByteStreamSplitEncode_Float_Scalar/32768       46291 ns        46283 ns        15134 bytes_per_second=2.63745G/s
BM_ByteStreamSplitEncode_Float_Scalar/65536       91139 ns        91128 ns         7707 bytes_per_second=2.6791G/s

BM_ByteStreamSplitEncode_Double_Scalar/1024        3093 ns         3093 ns       226198 bytes_per_second=2.46663G/s
BM_ByteStreamSplitEncode_Double_Scalar/4096       12724 ns        12722 ns        54522 bytes_per_second=2.39873G/s
BM_ByteStreamSplitEncode_Double_Scalar/32768     100488 ns       100475 ns         6957 bytes_per_second=2.42987G/s
BM_ByteStreamSplitEncode_Double_Scalar/65536     200885 ns       200852 ns         3486 bytes_per_second=2.43105G/s
```
* after:
```
-------------------------------------------------------------------------------------------------------
Benchmark                                             Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------
BM_ByteStreamSplitDecode_Float_Scalar/1024          932 ns          932 ns       753352 bytes_per_second=4.09273G/s
BM_ByteStreamSplitDecode_Float_Scalar/4096         3715 ns         3715 ns       188394 bytes_per_second=4.10783G/s
BM_ByteStreamSplitDecode_Float_Scalar/32768       30167 ns        30162 ns        23441 bytes_per_second=4.04716G/s
BM_ByteStreamSplitDecode_Float_Scalar/65536       59483 ns        59475 ns        11744 bytes_per_second=4.10496G/s

BM_ByteStreamSplitDecode_Double_Scalar/1024        1862 ns         1862 ns       374715 bytes_per_second=4.09823G/s
BM_ByteStreamSplitDecode_Double_Scalar/4096        7554 ns         7553 ns        91975 bytes_per_second=4.04038G/s
BM_ByteStreamSplitDecode_Double_Scalar/32768      60429 ns        60421 ns        11499 bytes_per_second=4.04067G/s
BM_ByteStreamSplitDecode_Double_Scalar/65536     120992 ns       120972 ns         5756 bytes_per_second=4.03631G/s

BM_ByteStreamSplitEncode_Float_Scalar/1024          737 ns          737 ns       947423 bytes_per_second=5.17843G/s
BM_ByteStreamSplitEncode_Float_Scalar/4096         2934 ns         2933 ns       239459 bytes_per_second=5.20175G/s
BM_ByteStreamSplitEncode_Float_Scalar/32768       23730 ns        23727 ns        29243 bytes_per_second=5.14485G/s
BM_ByteStreamSplitEncode_Float_Scalar/65536       47671 ns        47664 ns        14682 bytes_per_second=5.12209G/s

BM_ByteStreamSplitEncode_Double_Scalar/1024        1517 ns         1517 ns       458928 bytes_per_second=5.02827G/s
BM_ByteStreamSplitEncode_Double_Scalar/4096        6224 ns         6223 ns       111361 bytes_per_second=4.90407G/s
BM_ByteStreamSplitEncode_Double_Scalar/32768      49719 ns        49713 ns        14059 bytes_per_second=4.91099G/s
BM_ByteStreamSplitEncode_Double_Scalar/65536      99445 ns        99432 ns         7027 bytes_per_second=4.91072G/s
```

### Are these changes tested?

Yes, though the scalar implementations are unfortunately only exercised on non-x86 by default (see added comment in the PR).

### Are there any user-facing changes?

No.

* Closes: apache#38542

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…e#38529)

### Rationale for this change

BYTE_STREAM_SPLIT encoding and decoding benefit from SIMD accelerations on x86, but scalar implementations are used otherwise.

### What changes are included in this PR?

Improve the speed of scalar implementations of BYTE_STREAM_SPLIT by using a blocked algorithm.

Benchmark numbers on the author's machine (AMD Ryzen 9 3900X):

* before:
```
-------------------------------------------------------------------------------------------------------
Benchmark                                             Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------
BM_ByteStreamSplitDecode_Float_Scalar/1024         1374 ns         1374 ns       510396 bytes_per_second=2.77574G/s
BM_ByteStreamSplitDecode_Float_Scalar/4096         5483 ns         5483 ns       127498 bytes_per_second=2.78303G/s
BM_ByteStreamSplitDecode_Float_Scalar/32768       44042 ns        44035 ns        15905 bytes_per_second=2.77212G/s
BM_ByteStreamSplitDecode_Float_Scalar/65536       87966 ns        87952 ns         7962 bytes_per_second=2.77583G/s

BM_ByteStreamSplitDecode_Double_Scalar/1024        2583 ns         2583 ns       271436 bytes_per_second=2.95408G/s
BM_ByteStreamSplitDecode_Double_Scalar/4096       10533 ns        10532 ns        65695 bytes_per_second=2.89761G/s
BM_ByteStreamSplitDecode_Double_Scalar/32768      84067 ns        84053 ns         8275 bytes_per_second=2.90459G/s
BM_ByteStreamSplitDecode_Double_Scalar/65536     168332 ns       168309 ns         4155 bytes_per_second=2.9011G/s

BM_ByteStreamSplitEncode_Float_Scalar/1024         1435 ns         1435 ns       484278 bytes_per_second=2.65802G/s
BM_ByteStreamSplitEncode_Float_Scalar/4096         5725 ns         5725 ns       121877 bytes_per_second=2.66545G/s
BM_ByteStreamSplitEncode_Float_Scalar/32768       46291 ns        46283 ns        15134 bytes_per_second=2.63745G/s
BM_ByteStreamSplitEncode_Float_Scalar/65536       91139 ns        91128 ns         7707 bytes_per_second=2.6791G/s

BM_ByteStreamSplitEncode_Double_Scalar/1024        3093 ns         3093 ns       226198 bytes_per_second=2.46663G/s
BM_ByteStreamSplitEncode_Double_Scalar/4096       12724 ns        12722 ns        54522 bytes_per_second=2.39873G/s
BM_ByteStreamSplitEncode_Double_Scalar/32768     100488 ns       100475 ns         6957 bytes_per_second=2.42987G/s
BM_ByteStreamSplitEncode_Double_Scalar/65536     200885 ns       200852 ns         3486 bytes_per_second=2.43105G/s
```
* after:
```
-------------------------------------------------------------------------------------------------------
Benchmark                                             Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------
BM_ByteStreamSplitDecode_Float_Scalar/1024          932 ns          932 ns       753352 bytes_per_second=4.09273G/s
BM_ByteStreamSplitDecode_Float_Scalar/4096         3715 ns         3715 ns       188394 bytes_per_second=4.10783G/s
BM_ByteStreamSplitDecode_Float_Scalar/32768       30167 ns        30162 ns        23441 bytes_per_second=4.04716G/s
BM_ByteStreamSplitDecode_Float_Scalar/65536       59483 ns        59475 ns        11744 bytes_per_second=4.10496G/s

BM_ByteStreamSplitDecode_Double_Scalar/1024        1862 ns         1862 ns       374715 bytes_per_second=4.09823G/s
BM_ByteStreamSplitDecode_Double_Scalar/4096        7554 ns         7553 ns        91975 bytes_per_second=4.04038G/s
BM_ByteStreamSplitDecode_Double_Scalar/32768      60429 ns        60421 ns        11499 bytes_per_second=4.04067G/s
BM_ByteStreamSplitDecode_Double_Scalar/65536     120992 ns       120972 ns         5756 bytes_per_second=4.03631G/s

BM_ByteStreamSplitEncode_Float_Scalar/1024          737 ns          737 ns       947423 bytes_per_second=5.17843G/s
BM_ByteStreamSplitEncode_Float_Scalar/4096         2934 ns         2933 ns       239459 bytes_per_second=5.20175G/s
BM_ByteStreamSplitEncode_Float_Scalar/32768       23730 ns        23727 ns        29243 bytes_per_second=5.14485G/s
BM_ByteStreamSplitEncode_Float_Scalar/65536       47671 ns        47664 ns        14682 bytes_per_second=5.12209G/s

BM_ByteStreamSplitEncode_Double_Scalar/1024        1517 ns         1517 ns       458928 bytes_per_second=5.02827G/s
BM_ByteStreamSplitEncode_Double_Scalar/4096        6224 ns         6223 ns       111361 bytes_per_second=4.90407G/s
BM_ByteStreamSplitEncode_Double_Scalar/32768      49719 ns        49713 ns        14059 bytes_per_second=4.91099G/s
BM_ByteStreamSplitEncode_Double_Scalar/65536      99445 ns        99432 ns         7027 bytes_per_second=4.91072G/s
```

### Are these changes tested?

Yes, though the scalar implementations are unfortunately only exercised on non-x86 by default (see added comment in the PR).

### Are there any user-facing changes?

No.

* Closes: apache#38542

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant