Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support remote string paths for h5netcdf engine #8423

Closed
jrbourbeau opened this issue Nov 7, 2023 · 5 comments
Closed

Support remote string paths for h5netcdf engine #8423

jrbourbeau opened this issue Nov 7, 2023 · 5 comments

Comments

@jrbourbeau
Copy link
Contributor

Is your feature request related to a problem?

Currently the h5netcdf engine supports opening remote files, but only already open file-like objects (e.g. s3fs.open(...)), not string paths like s3://.... There are situations where I'd like to use string paths instead of open file-like objets

  • Opening files can sometimes be slow (xref Opening lots of files can be slow fsspec/s3fs#816)
  • When using parallel=True for opening lots of files, serializing open file-like objects back and forth from a remote cluster can be slow
  • Some systems (e.g. NASA Earthdata) only hand out credentials that are valid when run in the same region as the data. Being able to use parallel=True + storage_options would be convenient/performant in that case.

Describe the solution you'd like

It would be nice if I could do something like the following:

ds = xr.open_mfdataset(
    files,    # A bunch of files like `s3://bucket/file`
    engine="h5netcdf",
    ...
    parallel=True,
    storage_options={...},    # fsspec-compatible options
)

and have my files opened prior to handing off to h5netcdf. storage_options is already supported for Zarr, so hopefully extending to h5netcdf feels natural.

Describe alternatives you've considered

No response

Additional context

No response

@kmuehlbauer
Copy link
Contributor

@jrbourbeau At h5netcdf we've recently made driver kwarg available (not yet released), to enable loading remote files via h5py/hdf5.

See h5netcdf/h5netcdf#220 and #8360.

Would this already help with your use-case as a first step?

@jrbourbeau
Copy link
Contributor Author

Thanks for pointing me to that @kmuehlbauer!

Based on the linked PRs, driver= definitely seems related, but I'm wondering how it compare to fsspec. fsspec handles local files, S3, GCSFS, HTTPS, etc. and allows users to forward authentication as well (e.g. AWS key and secret in the case of reading from S3). Can I do this with the new driver= functionality?

@kmuehlbauer
Copy link
Contributor

@jrbourbeau I can't say much to that, unfortunately, since my use-cases are usually local only. So my expertise with cloud access is rather limited.

But you should be able to use authentication, and different sources as well. Maybe @zequihg50 can chime in here with some additional context?

But, this will first happen after h5netcdf release and some changes to xarray to allow for the additional kwargs.

@kmuehlbauer
Copy link
Contributor

@jrbourbeau It might be good to merge #8360 first and add your changes on top. So this might take a little time.

@jrbourbeau
Copy link
Contributor Author

Closed via #9797

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants