Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-12597: [C++] Enable per-row-group parallelism in async Parquet reader #10482

Closed
wants to merge 6 commits into from

Conversation

lidavidm
Copy link
Member

@lidavidm lidavidm commented Jun 8, 2021

This adds an OptionalParallelForAsync which lets us have per-row-group parallelism without nested parallelism in the async Parquet reader. This also uses TransferAlways, taking care of ARROW-12916. enable_parallel_column_conversion is kept as it still affects the threaded scanner.

@github-actions
Copy link

github-actions bot commented Jun 8, 2021

@lidavidm
Copy link
Member Author

lidavidm commented Jun 8, 2021

S3 Median Scan Time (s)(2)

Not much difference in a benchmark; the most pronounced change is when files << cores (this was a 4 vcpu machine), which I think makes sense since with many files, file-level parallelism takes hold.

cpp/src/arrow/util/parallel.h Show resolved Hide resolved
cpp/src/arrow/util/parallel.h Show resolved Hide resolved
cpp/src/parquet/arrow/reader.cc Show resolved Hide resolved
cpp/src/parquet/arrow/reader.cc Outdated Show resolved Hide resolved
cpp/src/parquet/arrow/reader.cc Outdated Show resolved Hide resolved
cpp/src/parquet/arrow/reader.cc Outdated Show resolved Hide resolved
cpp/src/parquet/arrow/reader.cc Show resolved Hide resolved
@pitrou pitrou self-requested a review June 15, 2021 13:53
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thank you very much!

@pitrou
Copy link
Member

pitrou commented Jun 15, 2021

Rebased, can merge if green.

@lidavidm lidavidm closed this in b73bcf0 Jun 15, 2021
@lidavidm lidavidm deleted the arrow-12597 branch June 15, 2021 15:22
sjperkins pushed a commit to sjperkins/arrow that referenced this pull request Jun 23, 2021
…reader

This adds an OptionalParallelForAsync which lets us have per-row-group parallelism without nested parallelism in the async Parquet reader. This also uses TransferAlways, taking care of ARROW-12916. `enable_parallel_column_conversion` is kept as it still affects the threaded scanner.

Closes apache#10482 from lidavidm/arrow-12597

Authored-by: David Li <[email protected]>
Signed-off-by: David Li <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants