Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add bench: decimal with byte array and fixed length byte array #2529

Merged
merged 3 commits into from
Aug 29, 2022

Conversation

liukun4515
Copy link
Contributor

Which issue does this PR close?

part of #2388

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 19, 2022
@@ -122,7 +122,7 @@
experimental!(mod array_reader);
pub mod arrow_reader;
pub mod arrow_writer;
mod buffer;
experimental!(mod buffer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid this, experimental, I really don't want buffer exposed externally

Copy link
Contributor Author

@liukun4515 liukun4515 Aug 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after your refactor #2528, I think we can use some public api like read_array_reader to create the reader in the benchmark.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After remove #2529 (comment) from your suggestion, we can avoid using the buffer crate in benchmark code

match data_type {
DataType::Decimal128(precision, scale) => {
// read decimal data from parquet binary physical type
let convert = DecimalByteArrayConvert::new(
Copy link
Contributor

@tustvold tustvold Aug 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI these functions are being removed... #2528

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for your remind, after #2528 merged, and refactor this pr and review it.

When the #2425 can be reviewed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need reviews on the PRs it builds on, that's the only thing holding it up being ready for review

@@ -1235,7 +1241,7 @@ impl FromBytes for ByteArray {
}

impl FromBytes for FixedLenByteArray {
type Buffer = [u8; 8];
type Buffer = Vec<u8>;

fn from_le_bytes(bs: Self::Buffer) -> Self {
Self(ByteArray::from(bs.to_vec()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is performing an unnecessary clone now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

@tustvold
Copy link
Contributor

I think this will serve as a nice validation of #2528 - the performance should be significantly improved

@liukun4515
Copy link
Contributor Author

I think this will serve as a nice validation of #2528 - the performance should be significantly improved

hope so, I will review your refactor work tomorrow.

@liukun4515
Copy link
Contributor Author

I think this will serve as a nice validation of #2528 - the performance should be significantly improved

Hi @tustvold this pr #2522 will improve the performance?
I can't find the point where to improve the performance in #2522 refactor to remove the DecimalByteArrayConvert?

@liukun4515
Copy link
Contributor Author

liukun4515 commented Aug 21, 2022

It's better to remove the complex_reader from the array reader @tustvold
After your refactor, the type of parquet array reader will be matched with the parquet physical type BOOL,INT32,INT64,INT96,BYTE_ARRAY,FIXED_LENGTH_BYTE_ARRAY. ❤️

@tustvold
Copy link
Contributor

tustvold commented Aug 21, 2022

I can't find the point where to improve the performance in #2522 refactor to remove the DecimalByteArrayConvert?

CompkexObjectArrayReader reads to the row format, i.e. separately boxed ByteArray for each row. ByteArrayReader reads to a contiguous arrow buffer. It also has an optimised path for reading non-nested definition levels. It should therefore be significantly faster, not to mention correctly handling nesting

See #1082

@liukun4515
Copy link
Contributor Author

great improvement for the decimal reader using the fixed length byte array after the refactor #2528

arrow_array_reader/FIXED_LENGTH_BYTE_ARRAY/Decimal128Array/plain encoded, mandatory, no NULLs
                        time:   [453.71 us 462.25 us 472.03 us]
                        change: [-57.944% -57.150% -56.305%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  9 (9.00%) high mild
  1 (1.00%) high severe
arrow_array_reader/FIXED_LENGTH_BYTE_ARRAY/Decimal128Array/plain encoded, optional, no NULLs
                        time:   [438.91 us 442.14 us 445.72 us]
                        change: [-51.955% -51.561% -51.191%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe
arrow_array_reader/FIXED_LENGTH_BYTE_ARRAY/Decimal128Array/plain encoded, optional, half NULLs
                        time:   [590.60 us 595.71 us 601.33 us]
                        change: [-31.035% -30.479% -29.836%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)

@tustvold

@liukun4515 liukun4515 force-pushed the bench_decimal_fixed_array_#2388 branch from a038bba to 862cf28 Compare August 29, 2022 10:37
@liukun4515
Copy link
Contributor Author

@tustvold PTAL
just add benchmark for decimal reader using byte array or fixed length byte array.

@tustvold tustvold merged commit 81f1f81 into apache:master Aug 29, 2022
@ursabot
Copy link

ursabot commented Aug 29, 2022

Benchmark runs are scheduled for baseline = c6e7680 and contender = 81f1f81. 81f1f81 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@liukun4515 liukun4515 deleted the bench_decimal_fixed_array_#2388 branch August 30, 2022 00:04
@liukun4515
Copy link
Contributor Author

BYTE_ARRAY

arrow_array_reader/BYTE_ARRAY/Decimal128Array/plain encoded, mandatory, no NULLs
                        time:   [521.98 us 527.20 us 532.82 us]
                        change: [-64.921% -64.209% -63.473%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  7 (7.00%) high mild
  2 (2.00%) high severe
arrow_array_reader/BYTE_ARRAY/Decimal128Array/plain encoded, optional, no NULLs
                        time:   [547.52 us 559.31 us 575.61 us]
                        change: [-60.657% -59.882% -58.990%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe
arrow_array_reader/BYTE_ARRAY/Decimal128Array/plain encoded, optional, half NULLs
                        time:   [637.63 us 643.18 us 649.31 us]
                        change: [-40.139% -39.311% -38.471%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants