-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow fsspec URLs in open_(mf)dataset #4823
Conversation
Docs to be added |
Question: should HTTP URLs be passed through unprocessed as before? I think that might be required by some of the netCDF engines, but we probably don't test this. |
|
Looking over the changes to |
Co-authored-by: keewis <[email protected]>
Next open question: aside from zarr, few of the other backends will know what to do with fsspec's dict-like mappers. Should we prevent them from passing through? Should we attempt to distinguish between directories and files, and make fsspec file-like objects? We could just allow the backends to fail later on incorrect input. |
@martindurant it is OK to let every backend raise errors for unsupported input. So no need to add any additional logic here IMO. |
@martindurant with respect to the backend API (old and new) looks good to me. I'm don't know |
(please definitely do not merge until I've added documentation) |
) | ||
paths = sorted(glob(_normalize_path(paths))) | ||
paths = fs.glob(fs._strip_protocol(paths)) | ||
paths = [fs.get_mapper(path) for path in paths] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit tricky. This assumes the backend is to want a mapper object (as the zarr backend does). But, what if the glob returns a list of netcdf files? Wouldn't we want a list of file objects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this is my comment for "should we actually special case zarr". It could make files - for now it would just error. We don't have tests for this, though, but now might be the time to start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now tracking with the comments above, I think we have two options:
- inject some backend specific logic in the api here to make a decision about what sort of object to return (
if engine=='zarr', return mapper; else return file_obj
) - only support globbing remote zarr stores
(1) seems to be the more reasonable thing to do here but is slightly less principled as we've been working to cleanly separate the api from the backends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose a third alternative might be to pass the paths through, and create mappers in the zarr backend (will re-instantiate the FS, but that's fine) and add the opening of remote files into each of the other backends that can handle it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have essentially done 1), but excluded HTTP for the non-zarr path, because it has a special place for some backends (dap...). In any case, I don't suppose anyone is using globbing with http, since it's generally unreliable.
I am marking this PR as ready, but please ask me for specific test cases that might be relevant and should be included. |
@martindurant Should be fixed by #4845. Probably just needs rebase. |
Thanks, @kmuehlbauer |
xarray/backends/api.py
Outdated
"{!r}. Instead, supply paths as an explicit list of strings.".format( | ||
paths | ||
) | ||
from fsspec import open_files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to handle the case when fsspec is not installed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it is, this would raise ImportError. Would it be better to have a try/except ImportError/raise from, with an extra sentence?
"To open fsspec-compatible [remote?] URLs, you must install fsspec"
Note that you might still get an error in this block if, for example, it's an s3 URL, but s3fs is not installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought because L893 cannot be reached without fsspec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I don't really mind - just because you mention backward compatibility - I am also fine to leave it as is)
xarray/backends/api.py
Outdated
"{!r}. Instead, supply paths as an explicit list of strings.".format( | ||
paths | ||
) | ||
from fsspec import open_files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought because L893 cannot be reached without fsspec.
I have decided, on reflection, to back away on the scope here and only implement for zarr for now, since, frankly, I am confused about what should happen for other backends, and they are not tested. Yes, some of them are happy to accept file-like objects, but others either don't do that at all, or want the URL passing through. My code would have changed how things were handled, depending on whether it passed through open_dataset or open_mfdataset. Best would be to set up a set of expectations as tests. |
I think my Q on SO is related to this PR https://stackoverflow.com/questions/66145459/open-mfdataset-on-remote-zarr-store-giving-zarr-errors-groupnotfounderror Was looking at reading a remote zarr store using @martindurant suggested putting the single "file" (mapping) in a list which works but I also wanted to test the other suggestion
On current xarray master I get
On this branch it works
|
@raybellwaves , might I paraphrase to "this PR is useful, please merge!" ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some minor comments, but otherwise looks good to me.
with raises_regex(ValueError, "wild-card"): | ||
open_mfdataset("http://some/remote/uri") | ||
|
||
@requires_fsspec | ||
def test_open_mfdataset_no_files(self): | ||
pytest.importorskip("aiobotocore") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this means this test is always skipped. Should this be added to some of the environments?
could you add |
Can someone please explain the minimum version policy that is failing
|
because it pins othr deps too tightly, makes solver panic
if you wait two days the minimum version policy check should pass with |
There are 3 approvals and @martindurant has been quite patient! Thanks. merging. We can update the min_deps_check later. |
Thank you, @dcherian |
* upstream/master: FIX: h5py>=3 string decoding (pydata#4893) Update matplotlib's canonical (pydata#4919) Adding vectorized indexing docs (pydata#4711) Allow fsspec URLs in open_(mf)dataset (pydata#4823) Fix typos in example notebooks (pydata#4908) pre-commit autoupdate CI (pydata#4906) replace the ci-trigger action with a external one (pydata#4905) Update area_weighted_temperature.ipynb (pydata#4903) hide the decorator from the test traceback (pydata#4900) Sort backends (pydata#4886) Compatibility with dask 2021.02.0 (pydata#4884)
pre-commit run --all-files
whats-new.rst
api.rst