[Parquet] Implement OpenAsync for FileSource #37917

eeroel · 2023-09-27T20:28:15Z

Describe the enhancement requested

As discussed here, Parquet reader uses a synchronous FileSource::Open when starting to read a file:

arrow/cpp/src/arrow/dataset/file_parquet.cc

Line 482 in e038498

ARROW_ASSIGN_OR_RAISE(auto input, source.Open());

In the case of reading from S3, this Open method makes a HEAD request, which delays the start of processing of all other files when reading from a dataset.

Here are two illustrative images of the performance difference, reading a FileSystemDataset of 16 files with ~200K rows in total, in Pyarrow, with maximum fragment_readahead, pre_buffer=True. Files are on the y-axis and time is on the x-axis, threads are colored. Each point represents one request (HEAD or GET) to S3.

Here's the current behavior, where the first request for each file is processed on the blue thread:

And here's the behavior with a WIP implementation of OpenAsync (note different x-axis scaling)

In both images the first two (blue) points are from a separate request for one file to get the schema. It's still a bit of a mystery to me why in the async case the concurrent requests start only after the fourth request, seems like there could be some performance to be gained there as well.

Component(s)

Parquet

### Rationale for this change Improves performance of file reads with an expensive Open operation. ### What changes are included in this PR? ### Are these changes tested? ### Are there any user-facing changes? No * Closes: #37917 Authored-by: Eero Lihavainen <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>

### Rationale for this change Improves performance of file reads with an expensive Open operation. ### What changes are included in this PR? ### Are these changes tested? ### Are there any user-facing changes? No * Closes: apache#37917 Authored-by: Eero Lihavainen <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>

eeroel added the Type: enhancement label Sep 27, 2023

github-actions bot added the Component: Parquet label Sep 27, 2023

github-actions bot mentioned this issue Sep 27, 2023

GH-37917: [Parquet] Add OpenAsync for FileSource #37918

Merged

github-actions bot assigned eeroel Sep 27, 2023

eeroel mentioned this issue Sep 30, 2023

Support async CustomOpen in FileSource #37962

Open

bkietz closed this as completed in #37918 Oct 4, 2023

bkietz added this to the 14.0.0 milestone Oct 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Parquet] Implement OpenAsync for FileSource #37917

[Parquet] Implement OpenAsync for FileSource #37917

eeroel commented Sep 27, 2023 •

edited

Loading

[Parquet] Implement OpenAsync for FileSource #37917

[Parquet] Implement OpenAsync for FileSource #37917

Comments

eeroel commented Sep 27, 2023 • edited Loading

Describe the enhancement requested

Component(s)

eeroel commented Sep 27, 2023 •

edited

Loading