You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the case of reading from S3, this Open method makes a HEAD request, which delays the start of processing of all other files when reading from a dataset.
Here are two illustrative images of the performance difference, reading a FileSystemDataset of 16 files with ~200K rows in total, in Pyarrow, with maximum fragment_readahead, pre_buffer=True. Files are on the y-axis and time is on the x-axis, threads are colored. Each point represents one request (HEAD or GET) to S3.
Here's the current behavior, where the first request for each file is processed on the blue thread:
And here's the behavior with a WIP implementation of OpenAsync (note different x-axis scaling)
In both images the first two (blue) points are from a separate request for one file to get the schema. It's still a bit of a mystery to me why in the async case the concurrent requests start only after the fourth request, seems like there could be some performance to be gained there as well.
Component(s)
Parquet
The text was updated successfully, but these errors were encountered:
### Rationale for this change
Improves performance of file reads with an expensive Open operation.
### What changes are included in this PR?
### Are these changes tested?
### Are there any user-facing changes?
No
* Closes: #37917
Authored-by: Eero Lihavainen <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
### Rationale for this change
Improves performance of file reads with an expensive Open operation.
### What changes are included in this PR?
### Are these changes tested?
### Are there any user-facing changes?
No
* Closes: apache#37917
Authored-by: Eero Lihavainen <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
### Rationale for this change
Improves performance of file reads with an expensive Open operation.
### What changes are included in this PR?
### Are these changes tested?
### Are there any user-facing changes?
No
* Closes: apache#37917
Authored-by: Eero Lihavainen <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
dgreiss
pushed a commit
to dgreiss/arrow
that referenced
this issue
Feb 19, 2024
### Rationale for this change
Improves performance of file reads with an expensive Open operation.
### What changes are included in this PR?
### Are these changes tested?
### Are there any user-facing changes?
No
* Closes: apache#37917
Authored-by: Eero Lihavainen <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
Describe the enhancement requested
As discussed here, Parquet reader uses a synchronous FileSource::Open when starting to read a file:
arrow/cpp/src/arrow/dataset/file_parquet.cc
Line 482 in e038498
In the case of reading from S3, this Open method makes a HEAD request, which delays the start of processing of all other files when reading from a dataset.
Here are two illustrative images of the performance difference, reading a FileSystemDataset of 16 files with ~200K rows in total, in Pyarrow, with maximum fragment_readahead, pre_buffer=True. Files are on the y-axis and time is on the x-axis, threads are colored. Each point represents one request (HEAD or GET) to S3.
Here's the current behavior, where the first request for each file is processed on the blue thread:
And here's the behavior with a WIP implementation of OpenAsync (note different x-axis scaling)
In both images the first two (blue) points are from a separate request for one file to get the schema. It's still a bit of a mystery to me why in the async case the concurrent requests start only after the fourth request, seems like there could be some performance to be gained there as well.
Component(s)
Parquet
The text was updated successfully, but these errors were encountered: