Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiple zarr files + fsspec.get_mapper #286

Closed
Mikejmnez opened this issue Apr 24, 2020 · 2 comments
Closed

multiple zarr files + fsspec.get_mapper #286

Mikejmnez opened this issue Apr 24, 2020 · 2 comments

Comments

@Mikejmnez
Copy link

I have a sequence of zarr files distributed across different nodes that I want to read in parallel, while only providing a string (glob-like) path.

The behavior I want to emulate:
For netcdf-files, we can do this using

url = fsspec.open_local(paths)

where paths is given by

paths= '/directoryA/*/subdirectoryB/*.nc'

such that
len(glob(paths)) = len(url)
e.g. 5 (5 nc-files distributed on different directories). The url is then used as an argument for xarray.open_mfdataset

The problem
zarr files open with a mapper (url=fsspec.get_mapper(paths) with url as an argument to xarray.open_zarr), and a glob-like path does not work as nicely (compact) as it does with fsspec.open_local() and nc-files. That is, given

paths= '/directoryA/*/subdirectoryB/*'

(where the zarr stores appear as directories) we get

len(fsspec.get_mapper(paths))=0

If you just try, the right hand side is zero, while the LHS > 0.

A solution to the problem is to just pass the glob-like path directly to _open_zarr (with proper modifications to _open_zarr function much like xarray.open_mfdataset). I am just wondering if fsspec.get_mapper(paths) can take a glob-like path string and I just haven't figured out yet how...

@martindurant
Copy link
Member

This falls between some concepts:

  • zarr has a very well-defined spec, and would not, I think, be interested in changing its open functions to allow for multiple mappers
  • fsspec could take a glob in get_mapper and produce a set of mappers, but it's not clear how they would be collected into one output; for zarr to read this, it would also need a "virtual" .zgroup file

So indeed, intake-xarray could do this (glob-> list of mappers -> list of xarrays to be joined) or xarray itslef could do this like mfdataset. Note that since zarr may lean more on fsspec in the future ( zarr-developers/zarr-python#546 ), it may make sense to discuss this with them and/or xarray.

@Mikejmnez
Copy link
Author

Mikejmnez commented Apr 24, 2020

Thanks @martindurant , this is very helpful. I agree that it would be nice to follow up with zarr developers.

One thing:
When calling xarray.open_mfdataset through intake-xarray, my understanding after going through the code, is that the interpretation of the glob path takes place at the intake-xarray level. Even though, xarray.open_mfdataset can accept a glob-path directly or a list of directories, it is on intake.netcdf.py that a glob-path is passed to xarray as a list. This happens in the definition of _open_dataset lines 50-64:

url = fsspec.open_local(self.url_path, **self.storage_options)

If url is originally a glob, fsspec.open_local returns a list which is then passed to xarray.open_mfdataset.

I wrote a xarray.open_mzarr emulating the behavior of xarray.open_mfdataset , which can also read multiple files from a glob path, or a list of paths. That is how I stumbled upon this issue. Note that fsspec.open_local does not work with zarr files since these are interpreted by such function as directories.

Finally
It is possible, like you say, to leave the interpretation of the glob-path to take place at the xarray level. This could be done by writing on intake.xzarr.py within the definition of _open_dataset something like

    if "*" in url or isinstance(url, list):
        self._mapper = self.urlpath
    else:
        self._mapper = fsspec.get_mapper(self.urlpath, **self.storage_options)
    self._ds = _open_zarr(self._mapper, chunks=self.chunks, **kwargs)

with self._mapper being passed to xarray. If I do this, there is no problem, the xarray.open_mzarr creates the dataset as it was intended (much like xarray.open_mfdataset, but I wonder if there is something that missing by doing this...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants