-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Allow passing file sizes to FileSystemDataset from Python #37857
Comments
AlenkaF
changed the title
Allow passing file sizes to FileSystemDataset from Python
[Python] Allow passing file sizes to FileSystemDataset from Python
Sep 26, 2023
pitrou
pushed a commit
that referenced
this issue
Dec 5, 2023
### Rationale for this change Allow passing known file sizes to `make_fragment`, to avoid potential network requests. ### What changes are included in this PR? ### Are these changes tested? Yes, tests with S3 that file size gets used. ### Are there any user-facing changes? Yes, new function arguments. * Closes: #37857 Lead-authored-by: Eero Lihavainen <[email protected]> Co-authored-by: Benjamin Kietzman <[email protected]> Co-authored-by: Eero Lihavainen <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
dgreiss
pushed a commit
to dgreiss/arrow
that referenced
this issue
Feb 19, 2024
…pache#37868) ### Rationale for this change Allow passing known file sizes to `make_fragment`, to avoid potential network requests. ### What changes are included in this PR? ### Are these changes tested? Yes, tests with S3 that file size gets used. ### Are there any user-facing changes? Yes, new function arguments. * Closes: apache#37857 Lead-authored-by: Eero Lihavainen <[email protected]> Co-authored-by: Benjamin Kietzman <[email protected]> Co-authored-by: Eero Lihavainen <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
This was referenced Feb 19, 2024
jorisvandenbossche
added a commit
that referenced
this issue
Feb 27, 2024
…nit (#40143) ### Rationale for this change Closes #40142 I'm developing a new dask integration with pyarrow parquet reader (see dask/dask-expr#882) and want to rely on the pyarrow Filesystem more. Right now, we are performing a list operation ourselves to get all touched files and I would like to pass the retrieved `FileInfo` objects directly to the dataset constructor. This API is already exposed in C++ and this PR is adding the necessary python bindings. The benefit of this is that there is API is that it cuts the need to perform additional HEAD requests to a remote storage. This came up in #38389 (comment) and there's been related work already with #37857 ### What changes are included in this PR? Python bindings for the `DatasetFactory` constructor that accepts a list/vector of `FileInfo` objects. ### Are these changes tested? ~I slightly modified the minio test setup such that the prometheus endpoint is exposed. This can be used to assert that there hasn't been any HEAD requests.~ I ended up removing this again since parsing the response is a bit brittle. ### Are there any user-facing changes? * Closes: #40142 Lead-authored-by: fjetter <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
zanmato1984
pushed a commit
to zanmato1984/arrow
that referenced
this issue
Feb 28, 2024
…aset init (apache#40143) ### Rationale for this change Closes apache#40142 I'm developing a new dask integration with pyarrow parquet reader (see dask/dask-expr#882) and want to rely on the pyarrow Filesystem more. Right now, we are performing a list operation ourselves to get all touched files and I would like to pass the retrieved `FileInfo` objects directly to the dataset constructor. This API is already exposed in C++ and this PR is adding the necessary python bindings. The benefit of this is that there is API is that it cuts the need to perform additional HEAD requests to a remote storage. This came up in apache#38389 (comment) and there's been related work already with apache#37857 ### What changes are included in this PR? Python bindings for the `DatasetFactory` constructor that accepts a list/vector of `FileInfo` objects. ### Are these changes tested? ~I slightly modified the minio test setup such that the prometheus endpoint is exposed. This can be used to assert that there hasn't been any HEAD requests.~ I ended up removing this again since parsing the response is a bit brittle. ### Are there any user-facing changes? * Closes: apache#40142 Lead-authored-by: fjetter <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
thisisnic
pushed a commit
to thisisnic/arrow
that referenced
this issue
Mar 8, 2024
…aset init (apache#40143) ### Rationale for this change Closes apache#40142 I'm developing a new dask integration with pyarrow parquet reader (see dask/dask-expr#882) and want to rely on the pyarrow Filesystem more. Right now, we are performing a list operation ourselves to get all touched files and I would like to pass the retrieved `FileInfo` objects directly to the dataset constructor. This API is already exposed in C++ and this PR is adding the necessary python bindings. The benefit of this is that there is API is that it cuts the need to perform additional HEAD requests to a remote storage. This came up in apache#38389 (comment) and there's been related work already with apache#37857 ### What changes are included in this PR? Python bindings for the `DatasetFactory` constructor that accepts a list/vector of `FileInfo` objects. ### Are these changes tested? ~I slightly modified the minio test setup such that the prometheus endpoint is exposed. This can be used to assert that there hasn't been any HEAD requests.~ I ended up removing this again since parsing the response is a bit brittle. ### Are there any user-facing changes? * Closes: apache#40142 Lead-authored-by: fjetter <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the enhancement requested
When reading Parquet files from table formats such as Delta Lake, file sizes are already known from the table format metadata. However, when building a dataset from fragments using https://arrow.apache.org/docs/python/generated/pyarrow.dataset.FileFormat.html#pyarrow.dataset.FileFormat.make_fragment, there is no way to inform Pyarrow about the file sizes, and this leads to unnecessary HEAD requests in the case of S3. There is already support in Arrow for specifying the file size to avoid these requests to S3, but as far as I can see this is not exposed to PyArrow: #7547
(As a side note, it seems that those HEAD requests in S3Filesystem are always executed on the same thread, which leads to poor concurrency when reading multiple files. Is this a known issue?)
I can try to put together a PR with some kind of an implementation.
Component(s)
Parquet, Python
The text was updated successfully, but these errors were encountered: