Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix read parquet from remote filesystems #115

Merged
merged 12 commits into from
May 9, 2023
Merged

Conversation

brl0
Copy link
Collaborator

@brl0 brl0 commented May 6, 2023

I noticed some of the path handling for parquet files would munge paths for remote filesystems and confirmed after testing with s3. This PR fixes that issue.

@brl0
Copy link
Collaborator Author

brl0 commented May 6, 2023

Added tests which fail on main with these errors:

FAILED spatialpandas/tests/test_parquet_s3.py::TestS3ParquetDask::test_read_parquet_dask_remote_glob_parquet - pyarrow.lib.ArrowInvalid: Missing bucket name in S3 URI
FAILED spatialpandas/tests/test_parquet_s3.py::TestS3ParquetDask::test_read_parquet_dask_remote_glob_all - pyarrow.lib.ArrowInvalid: Missing bucket name in S3 URI
FAILED spatialpandas/tests/test_parquet_s3.py::TestS3ParquetDask::test_read_parquet_dask_remote_dir - pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 'test_bucket/test_dask/part.1.parquet', which is outside base dir 's3://test_bucket/test_dask'
FAILED spatialpandas/tests/test_parquet_s3.py::TestS3ParquetDask::test_read_parquet_dask_remote_dir_slash - pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 'test_bucket/test_dask/part.1.parquet', which is outside base dir 's3://test_bucket/test_dask/'

@brl0
Copy link
Collaborator Author

brl0 commented May 8, 2023

I believe this PR should be ready for review. Let me know if there are any questions or concerns, any feedback is always appreciated.

Copy link
Member

@ianthomas23 ianthomas23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @brl0, this is really useful.

I have tested it manually back to pyarrow == 5.0.0 as we don't test this in CI, and all tests pass. I only have a few nitpicks and then I will be happy to merge it. I'd prefer to merge it before #113 as this will minimise rebasing.

@brl0
Copy link
Collaborator Author

brl0 commented May 9, 2023

@ianthomas23, thanks for reviewing. I have made the requested changes.

Currently, one of the tests has a failure due to a 503 from GitHub, but that can be rerun once the other tests are complete.

If compatibility with those versions of pyarrow is expected, it seems like it would be worth parameterizing just the parquet tests over the supported versions, but given the run time of the full test suite, it does seem unnecessary to do so with the full test suite.

@ianthomas23
Copy link
Member

If compatibility with those versions of pyarrow is expected, it seems like it would be worth parameterizing just the parquet tests over the supported versions, but given the run time of the full test suite, it does seem unnecessary to do so with the full test suite.

I agree, but I don't want to make any changes at the moment. There has been a long tail of support for old python and pyarrow versions, and we are starting to bring that back to something sensible, e.g. dropping python 3.7 now and 3.8 will not be too far behind, and there will be corresponding dropping of pyarrow version support. We are also working on an overhaul of the project build and CI scripts on all holoviz repositories, and I expect we will end up with a "minimal versions" CI run to test that CI passes using all of the important minimum versions of dependencies.

Copy link
Member

@ianthomas23 ianthomas23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All tests passed so I will merge. Thanks @brl0!

@ianthomas23 ianthomas23 merged commit 3524892 into holoviz:main May 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants