[Python] Allow passing file sizes to FileSystemDataset from Python #37857

eeroel · 2023-09-25T13:51:23Z

Describe the enhancement requested

When reading Parquet files from table formats such as Delta Lake, file sizes are already known from the table format metadata. However, when building a dataset from fragments using https://arrow.apache.org/docs/python/generated/pyarrow.dataset.FileFormat.html#pyarrow.dataset.FileFormat.make_fragment, there is no way to inform Pyarrow about the file sizes, and this leads to unnecessary HEAD requests in the case of S3. There is already support in Arrow for specifying the file size to avoid these requests to S3, but as far as I can see this is not exposed to PyArrow: #7547

(As a side note, it seems that those HEAD requests in S3Filesystem are always executed on the same thread, which leads to poor concurrency when reading multiple files. Is this a known issue?)

I can try to put together a PR with some kind of an implementation.

Component(s)

Parquet, Python

### Rationale for this change Allow passing known file sizes to `make_fragment`, to avoid potential network requests. ### What changes are included in this PR? ### Are these changes tested? Yes, tests with S3 that file size gets used. ### Are there any user-facing changes? Yes, new function arguments. * Closes: #37857 Lead-authored-by: Eero Lihavainen <[email protected]> Co-authored-by: Benjamin Kietzman <[email protected]> Co-authored-by: Eero Lihavainen <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

…pache#37868) ### Rationale for this change Allow passing known file sizes to `make_fragment`, to avoid potential network requests. ### What changes are included in this PR? ### Are these changes tested? Yes, tests with S3 that file size gets used. ### Are there any user-facing changes? Yes, new function arguments. * Closes: apache#37857 Lead-authored-by: Eero Lihavainen <[email protected]> Co-authored-by: Benjamin Kietzman <[email protected]> Co-authored-by: Eero Lihavainen <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

…nit (#40143) ### Rationale for this change Closes #40142 I'm developing a new dask integration with pyarrow parquet reader (see dask/dask-expr#882) and want to rely on the pyarrow Filesystem more. Right now, we are performing a list operation ourselves to get all touched files and I would like to pass the retrieved `FileInfo` objects directly to the dataset constructor. This API is already exposed in C++ and this PR is adding the necessary python bindings. The benefit of this is that there is API is that it cuts the need to perform additional HEAD requests to a remote storage. This came up in #38389 (comment) and there's been related work already with #37857 ### What changes are included in this PR? Python bindings for the `DatasetFactory` constructor that accepts a list/vector of `FileInfo` objects. ### Are these changes tested? ~I slightly modified the minio test setup such that the prometheus endpoint is exposed. This can be used to assert that there hasn't been any HEAD requests.~ I ended up removing this again since parsing the response is a bit brittle. ### Are there any user-facing changes? * Closes: #40142 Lead-authored-by: fjetter <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>

…aset init (apache#40143) ### Rationale for this change Closes apache#40142 I'm developing a new dask integration with pyarrow parquet reader (see dask/dask-expr#882) and want to rely on the pyarrow Filesystem more. Right now, we are performing a list operation ourselves to get all touched files and I would like to pass the retrieved `FileInfo` objects directly to the dataset constructor. This API is already exposed in C++ and this PR is adding the necessary python bindings. The benefit of this is that there is API is that it cuts the need to perform additional HEAD requests to a remote storage. This came up in apache#38389 (comment) and there's been related work already with apache#37857 ### What changes are included in this PR? Python bindings for the `DatasetFactory` constructor that accepts a list/vector of `FileInfo` objects. ### Are these changes tested? ~I slightly modified the minio test setup such that the prometheus endpoint is exposed. This can be used to assert that there hasn't been any HEAD requests.~ I ended up removing this again since parsing the response is a bit brittle. ### Are there any user-facing changes? * Closes: apache#40142 Lead-authored-by: fjetter <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>

eeroel added the Type: enhancement label Sep 25, 2023

github-actions bot added Component: Parquet Component: Python labels Sep 25, 2023

github-actions bot mentioned this issue Sep 26, 2023

GH-37857: [Python][Dataset] Expose file size to python dataset #37868

Merged

github-actions bot assigned eeroel Sep 26, 2023

AlenkaF changed the title ~~Allow passing file sizes to FileSystemDataset from Python~~ [Python] Allow passing file sizes to FileSystemDataset from Python Sep 26, 2023

pitrou closed this as completed in #37868 Dec 5, 2023

pitrou added this to the 15.0.0 milestone Dec 5, 2023

This was referenced Feb 19, 2024

[Python] Allow Dataset construction from FileInfo objects #40142

Closed

GH-40142: [Python] Allow FileInfo instances to be passed to dataset init #40143

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Allow passing file sizes to FileSystemDataset from Python #37857

[Python] Allow passing file sizes to FileSystemDataset from Python #37857

eeroel commented Sep 25, 2023 •

edited

Loading

[Python] Allow passing file sizes to FileSystemDataset from Python #37857

[Python] Allow passing file sizes to FileSystemDataset from Python #37857

Comments

eeroel commented Sep 25, 2023 • edited Loading

Describe the enhancement requested

Component(s)

eeroel commented Sep 25, 2023 •

edited

Loading