Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include filename or path in open_mfdataset #2550

Closed
jsignell opened this issue Nov 8, 2018 · 19 comments
Closed

Include filename or path in open_mfdataset #2550

jsignell opened this issue Nov 8, 2018 · 19 comments

Comments

@jsignell
Copy link
Contributor

jsignell commented Nov 8, 2018

When reading from multiple files, sometimes there is information encoded in the filename. For example in these grib files the time: ./ST4.2018092500.01h, ./ST4.2018092501.01h. It seems like a generally useful thing would be to allow the passing of a kwargs (such as path_as_coord or something) that would define a set of coords with one for the data from each file.

I think the code change would be small:

if path_as_coord:
    ds = ds.assign_coords(path=file_name)

In use it would be like:

>>>xr.open_mfdataset(['./ST4.2018092500.01h', './ST4.2018092501.01h'], engine='pynio', concat_dim='path')
<xarray.Dataset>
Dimensions:               (x: 881, y: 1121, time: 2)
Coordinates:
    lat             (x, y) float32 23.116999 ... 45.618984
    lon             (x, y) float32 -119.023 ... -59.954613
  * path            (path) <U20 './ST4.2018092500.01h' './ST4.2018092501.01h'
Dimensions without coordinates: x, y
Data variables:
    var_0  (time, x, y) float32 dask.array<shape=(2, 881, 1121), chunksize=(1, 881, 1121)>
    var_1   (time, x, y) float32 dask.array<shape=(2, 881, 1121), chunksize=(1, 881, 1121)>

For context I have implemented something similar in dask: dask/dask#3908

@dcherian
Copy link
Contributor

dcherian commented Nov 8, 2018

There is a preprocess argument. You provide a function and it is run on every file.

@jsignell
Copy link
Contributor Author

jsignell commented Nov 8, 2018

There is a preprocess argument. You provide a function and it is run on every file.

Yes but the input to that function is just the ds, I couldn't figure out a way to get the filename from within a preprocess function. This is what I was doing to poke around in there:

def func(ds):
    import pdb; pdb.set_trace()

xr.open_mfdataset(['./ST4.2018092500.01h', './ST4.2018092501.01h'], 
                   engine='pynio', concat_dim='path', preprocess=func)

@jhamman
Copy link
Member

jhamman commented Nov 8, 2018

@jsignell - perhaps not a very pretty solution but we do save the source of each variable in the encoding dictionary.

ds['varname'].encoding['source']

Presumably, you could unpack this via a preprocess step.

@jsignell
Copy link
Contributor Author

jsignell commented Nov 8, 2018

@jhamman that looks pretty good, but I'm not seeing the source in the encoding dict. Is this what you were expecting?

def func(ds):
    var = next(var for var in ds)
    return ds.assign(path=ds[var].encoding['source'])

xr.open_mfdataset(['./ST4.2018092500.01h', './ST4.2018092501.01h'], 
                   engine='pynio', concat_dim='path', preprocess=func)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-49-184da62ce353> in <module>()
----> 1 ds = xr.open_mfdataset(['./ST4.2018092500.01h', './ST4.2018092501.01h'], engine='pynio', concat_dim='path', preprocess=func)

/opt/conda/lib/python3.6/site-packages/xarray/backends/api.py in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, lock, data_vars, coords, autoclose, parallel, **kwargs)
    612     file_objs = [getattr_(ds, '_file_obj') for ds in datasets]
    613     if preprocess is not None:
--> 614         datasets = [preprocess(ds) for ds in datasets]
    615 
    616     if parallel:

/opt/conda/lib/python3.6/site-packages/xarray/backends/api.py in <listcomp>(.0)
    612     file_objs = [getattr_(ds, '_file_obj') for ds in datasets]
    613     if preprocess is not None:
--> 614         datasets = [preprocess(ds) for ds in datasets]
    615 
    616     if parallel:

<ipython-input-48-fd450fa1393a> in func(ds)
      1 def func(ds):
      2     var = next(var for var in ds)
----> 3     return ds.assign(path=ds[var].encoding['source'])

KeyError: 'source'

xarray version: '0.11.0+1.g575e97ae'

@shoyer
Copy link
Member

shoyer commented Nov 8, 2018

Hmm. It really seems like the preprocess function should pass in the file-name along with the dataset.

@jhamman
Copy link
Member

jhamman commented Nov 9, 2018

@shoyer and @jsignell - I'd also be happy to see this added to the preprocess function. Ideally the function signature would look like:

def preprocess(ds, filename=None):
    ...
    return ds

This would avoid a breaking change and allow us to add additional kwargs at a later date if need be.

@shoyer
Copy link
Member

shoyer commented Nov 9, 2018

@jhamman The problem is that xarray needs way to figure out what arguments it can safely pass to preprocess, i.e., it needs to inspect the proprocess function and see if it can handle a filename argument. It's not obvious what the best way to do this in a backwards compatible way is...

@dcherian
Copy link
Contributor

dcherian commented Nov 9, 2018

Hmm... Sorry @jsignell. I thought preprocess passed the filename too.

@jsignell
Copy link
Contributor Author

jsignell commented Nov 9, 2018

Maybe we can inspect the preprocess function like this:

>>> preprocess = lambda a, b: print(a, b)
>>> preprocess .__code__.co_varnames
('a', 'b')

This response is ordered, so the first one can always be ds regardless of its name and then we can look for special names (like filename) in the rest.

From this answer: https://stackoverflow.com/a/4051447/4021797

@shoyer
Copy link
Member

shoyer commented Nov 9, 2018 via email

@dcherian
Copy link
Contributor

dcherian commented Nov 9, 2018

A dirty fix would be to add an attribute to each dataset.

@jsignell
Copy link
Contributor Author

jsignell commented Nov 9, 2018

A dirty fix would be to add an attribute to each dataset.

I thought @jhamman was suggesting that already exists, but I couldn't find it: #2550 (comment)

@dcherian
Copy link
Contributor

dcherian commented Nov 9, 2018

True, maybe we should track down why that isn't happening with your dataset

@jsignell
Copy link
Contributor Author

Ah I don't think I understood that adding source to encoding was a new addition. In latest master ('0.11.0+3.g70e9eb8) this works fine:

def func(ds):
    var = next(var for var in ds)
    return ds.assign(path=ds[var].encoding['source'])

ds = xr.open_mfdataset(['./air_1.nc', './air_2.nc'], concat_dim='path', preprocess=func)

I do think it is misleading though that after you've concatenated the data, the encoding['source'] on a concatenated var seems to be the first path.

>>> ds['air'].encoding['source'] 
'~/air_1.nc'

I'll close this one though since there is a clear way to access the filename. Thanks for the tip @jhamman!

@shoyer
Copy link
Member

shoyer commented Nov 19, 2018

I'm not sure .encoding['source'] should really be relied upon -- it wasn't really an intentional API decision. But I guess it's harmless enough to include it...

@jsignell
Copy link
Contributor Author

Should I add a test that expects .encoding['source'] to ensure its continued presence?

@shoyer
Copy link
Member

shoyer commented Nov 19, 2018 via email

@jsignell
Copy link
Contributor Author

Having started writing a test, I now think that encoding['source'] is backend specific. Here it is implemented in netcdf4:

encoding['source'] = self._filename
but I don't see it for pynio for instance:
def get_encoding(self):
encoding = {}
encoding['unlimited_dims'] = set(
[k for k in self.ds.dimensions if self.ds.unlimited(k)])
return encoding

Is this something that we want to mandate that backends provide?

@shoyer
Copy link
Member

shoyer commented Nov 19, 2018

Is this something that we want to mandate that backends provide?

I think it would be better to do this systematically, e.g., inside xarray.open_dataset(). We would need to verify that filename_or_obj is provided as a string, but if so we could add it into encoding on the Dataset object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants