Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Parquet] Implement OpenAsync for FileSource #37917

Closed
eeroel opened this issue Sep 27, 2023 · 0 comments · Fixed by #37918
Closed

[Parquet] Implement OpenAsync for FileSource #37917

eeroel opened this issue Sep 27, 2023 · 0 comments · Fixed by #37918

Comments

@eeroel
Copy link
Contributor

eeroel commented Sep 27, 2023

Describe the enhancement requested

As discussed here, Parquet reader uses a synchronous FileSource::Open when starting to read a file:

ARROW_ASSIGN_OR_RAISE(auto input, source.Open());

In the case of reading from S3, this Open method makes a HEAD request, which delays the start of processing of all other files when reading from a dataset.

Here are two illustrative images of the performance difference, reading a FileSystemDataset of 16 files with ~200K rows in total, in Pyarrow, with maximum fragment_readahead, pre_buffer=True. Files are on the y-axis and time is on the x-axis, threads are colored. Each point represents one request (HEAD or GET) to S3.

Here's the current behavior, where the first request for each file is processed on the blue thread:
Screenshot 2023-09-27 at 23 05 06

And here's the behavior with a WIP implementation of OpenAsync (note different x-axis scaling)
Screenshot 2023-09-27 at 23 04 49

In both images the first two (blue) points are from a separate request for one file to get the schema. It's still a bit of a mystery to me why in the async case the concurrent requests start only after the fourth request, seems like there could be some performance to be gained there as well.

Component(s)

Parquet

bkietz pushed a commit that referenced this issue Oct 4, 2023
### Rationale for this change

Improves performance of file reads with an expensive Open operation.

### What changes are included in this PR?

### Are these changes tested?

### Are there any user-facing changes?
No

* Closes: #37917

Authored-by: Eero Lihavainen <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
@bkietz bkietz added this to the 14.0.0 milestone Oct 4, 2023
JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Oct 23, 2023
### Rationale for this change

Improves performance of file reads with an expensive Open operation.

### What changes are included in this PR?

### Are these changes tested?

### Are there any user-facing changes?
No

* Closes: apache#37917

Authored-by: Eero Lihavainen <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
### Rationale for this change

Improves performance of file reads with an expensive Open operation.

### What changes are included in this PR?

### Are these changes tested?

### Are there any user-facing changes?
No

* Closes: apache#37917

Authored-by: Eero Lihavainen <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
### Rationale for this change

Improves performance of file reads with an expensive Open operation.

### What changes are included in this PR?

### Are these changes tested?

### Are there any user-facing changes?
No

* Closes: apache#37917

Authored-by: Eero Lihavainen <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants