Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet arrow-writer tests take very long to run (over 60 seconds) #3318

Closed
Tracked by #3216
alamb opened this issue Dec 9, 2022 · 5 comments · Fixed by #3319
Closed
Tracked by #3216

Parquet arrow-writer tests take very long to run (over 60 seconds) #3318

alamb opened this issue Dec 9, 2022 · 5 comments · Fixed by #3319
Labels
bug parquet Changes to the parquet crate

Comments

@alamb
Copy link
Contributor

alamb commented Dec 9, 2022

Describe the bug
Many arrow writer tests take more than 60 seconds to run "has been running for over 60 seconds"

test arrow::arrow_writer::tests::all_null_primitive_single_column has been running for over 60 seconds
test arrow::arrow_writer::tests::arrow_writer_primitive_dictionary has been running for over 60 seconds
test arrow::arrow_writer::tests::arrow_writer_string_dictionary has been running for over 60 seconds
test arrow::arrow_writer::tests::arrow_writer_string_dictionary_unsigned_index has been running for over 60 seconds
test arrow::arrow_writer::tests::binary_column_bloom_filter has been running for over 60 seconds
test arrow::arrow_writer::tests::binary_single_column has been running for over 60 seconds
test arrow::arrow_writer::tests::bool_single_column has been running for over 60 seconds
test arrow::arrow_writer::tests::date32_single_column has been running for over 60 seconds
test arrow::arrow_writer::tests::date64_single_column has been running for over 60 seconds
test arrow::arrow_writer::tests::empty_string_null_column_bloom_filter has been running for over 60 seconds
test arrow::arrow_writer::tests::f32_single_column has been running for over 60 seconds
test arrow::arrow_writer::tests::f64_single_column has been running for over 60 seconds
test arrow::arrow_writer::tests::fixed_size_binary_single_column has been running for over 60 seconds
test arrow::arrow_writer::tests::i16_single_column has been running for over 60 seconds
test arrow::arrow_writer::tests::i32_column_bloom_filter has been running for over 60 seconds
test arrow::arrow_reader::tests::test_utf8_single_column_reader_test ... ok
test arrow::arrow_writer::tests::i32_single_column has been running for over 60 seconds

To Reproduce

cargo test -p parquet --features=arrow

Expected behavior
The tests should pass quickly

Additional context
I believe it also affected the codecov jobs that now start timing out - #3302
https://github.com/apache/arrow-rs/actions/runs/3650327783/jobs/6166169908

@alamb alamb added the bug label Dec 9, 2022
@alamb
Copy link
Contributor Author

alamb commented Dec 9, 2022

This appears to have started with the change for bloom filter writing in fa1f611 / #3284

Runtime: 63 Minutes: The parquet test job https://github.com/apache/arrow-rs/actions/runs/3650327791/jobs/6166171296 on fa1f611

Runtime: 6 Minutes: The parquet test job https://github.com/apache/arrow-rs/actions/runs/3650326441/jobs/6166167726 on 7d21397 (the previous commit )

@alamb
Copy link
Contributor Author

alamb commented Dec 9, 2022

I think this appears to be a fairly serious performance regression bug and should be investigated prior to releasing 29.0.0

@viirya
Copy link
Member

viirya commented Dec 9, 2022

I think it is due to some test change I made in ##3284. It should be only limited to test runtime not a general performance issue in parquet writer.

I will find some time to refactor the tests and see if it helps.

@alamb
Copy link
Contributor Author

alamb commented Dec 9, 2022

I tried to run the standard release verification script for 29.0.0 but it completely saturated my CPU and I killed the test after 10 minutes and it hadn't succeeded in finishing. I suspect similar bad things would happen to others who might try to verify the release

@tustvold
Copy link
Contributor

tustvold commented Dec 9, 2022

Looking at a profile, it looks like the bloom filter writer is possibly in need of some optimisation

image

Fortunately I think we can probably just tweak the tests slightly to not enable the bloom filter except for the tests that actually test it. This will unblock the release, although we will want to optimise this in future.

I will work on this

tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 9, 2022
Disable bloom filters for most tests
alamb pushed a commit that referenced this issue Dec 9, 2022
Disable bloom filters for most tests
@tustvold tustvold added the parquet Changes to the parquet crate label Jan 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants