Include filename or path in open_mfdataset #2550

jsignell · 2018-11-08T20:13:31Z

When reading from multiple files, sometimes there is information encoded in the filename. For example in these grib files the time: ./ST4.2018092500.01h, ./ST4.2018092501.01h. It seems like a generally useful thing would be to allow the passing of a kwargs (such as path_as_coord or something) that would define a set of coords with one for the data from each file.

I think the code change would be small:

if path_as_coord:
    ds = ds.assign_coords(path=file_name)

In use it would be like:

>>>xr.open_mfdataset(['./ST4.2018092500.01h', './ST4.2018092501.01h'], engine='pynio', concat_dim='path')
<xarray.Dataset>
Dimensions:               (x: 881, y: 1121, time: 2)
Coordinates:
    lat             (x, y) float32 23.116999 ... 45.618984
    lon             (x, y) float32 -119.023 ... -59.954613
  * path            (path) <U20 './ST4.2018092500.01h' './ST4.2018092501.01h'
Dimensions without coordinates: x, y
Data variables:
    var_0  (time, x, y) float32 dask.array<shape=(2, 881, 1121), chunksize=(1, 881, 1121)>
    var_1   (time, x, y) float32 dask.array<shape=(2, 881, 1121), chunksize=(1, 881, 1121)>

For context I have implemented something similar in dask: dask/dask#3908

The text was updated successfully, but these errors were encountered:

dcherian · 2018-11-08T20:58:47Z

There is a preprocess argument. You provide a function and it is run on every file.

jsignell · 2018-11-08T21:07:48Z

There is a preprocess argument. You provide a function and it is run on every file.

Yes but the input to that function is just the ds, I couldn't figure out a way to get the filename from within a preprocess function. This is what I was doing to poke around in there:

def func(ds):
    import pdb; pdb.set_trace()

xr.open_mfdataset(['./ST4.2018092500.01h', './ST4.2018092501.01h'], 
                   engine='pynio', concat_dim='path', preprocess=func)

jhamman · 2018-11-08T21:11:08Z

@jsignell - perhaps not a very pretty solution but we do save the source of each variable in the encoding dictionary.

ds['varname'].encoding['source']

Presumably, you could unpack this via a preprocess step.

jsignell · 2018-11-08T21:24:45Z

@jhamman that looks pretty good, but I'm not seeing the source in the encoding dict. Is this what you were expecting?

def func(ds):
    var = next(var for var in ds)
    return ds.assign(path=ds[var].encoding['source'])

xr.open_mfdataset(['./ST4.2018092500.01h', './ST4.2018092501.01h'], 
                   engine='pynio', concat_dim='path', preprocess=func)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-49-184da62ce353> in <module>()
----> 1 ds = xr.open_mfdataset(['./ST4.2018092500.01h', './ST4.2018092501.01h'], engine='pynio', concat_dim='path', preprocess=func)

/opt/conda/lib/python3.6/site-packages/xarray/backends/api.py in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, lock, data_vars, coords, autoclose, parallel, **kwargs)
    612     file_objs = [getattr_(ds, '_file_obj') for ds in datasets]
    613     if preprocess is not None:
--> 614         datasets = [preprocess(ds) for ds in datasets]
    615 
    616     if parallel:

/opt/conda/lib/python3.6/site-packages/xarray/backends/api.py in <listcomp>(.0)
    612     file_objs = [getattr_(ds, '_file_obj') for ds in datasets]
    613     if preprocess is not None:
--> 614         datasets = [preprocess(ds) for ds in datasets]
    615 
    616     if parallel:

<ipython-input-48-fd450fa1393a> in func(ds)
      1 def func(ds):
      2     var = next(var for var in ds)
----> 3     return ds.assign(path=ds[var].encoding['source'])

KeyError: 'source'

xarray version: '0.11.0+1.g575e97ae'

shoyer · 2018-11-08T22:08:06Z

Hmm. It really seems like the preprocess function should pass in the file-name along with the dataset.

jhamman · 2018-11-09T04:08:14Z

@shoyer and @jsignell - I'd also be happy to see this added to the preprocess function. Ideally the function signature would look like:

def preprocess(ds, filename=None):
    ...
    return ds

This would avoid a breaking change and allow us to add additional kwargs at a later date if need be.

shoyer · 2018-11-09T15:28:40Z

@jhamman The problem is that xarray needs way to figure out what arguments it can safely pass to preprocess, i.e., it needs to inspect the proprocess function and see if it can handle a filename argument. It's not obvious what the best way to do this in a backwards compatible way is...

dcherian · 2018-11-09T16:10:08Z

Hmm... Sorry @jsignell. I thought preprocess passed the filename too.

jsignell · 2018-11-09T17:29:05Z

Maybe we can inspect the preprocess function like this:

>>> preprocess = lambda a, b: print(a, b)
>>> preprocess .__code__.co_varnames
('a', 'b')

This response is ordered, so the first one can always be ds regardless of its name and then we can look for special names (like filename) in the rest.

From this answer: https://stackoverflow.com/a/4051447/4021797

shoyer · 2018-11-09T18:53:00Z

The danger with inspecting user provided functions is that it's pretty fragile, e.g., it fails if you use provide a signature like *args, **kwargs (which can happen pretty easily with decorators). Probably the best option is to come up with a new keyword argument to replace "preprocess" and to deprecate the current preprocess (if we can think of another good name). We could also do a deprecation cycle with FutureWarning, but that's pretty painful.

…

On Fri, Nov 9, 2018 at 12:29 PM Julia Signell ***@***.***> wrote: Maybe we can inspect the preprocess function like this: >>> preprocess = lambda a, b: print(a, b)>>> preprocess .__code__.co_varnames ('a', 'b') This response is ordered, so the first one can always be ds regardless of its name and then we can look for special names (like filename) in the rest. From this answer: https://stackoverflow.com/a/4051447/4021797 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2550 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKS1kvSMx83hACWxXD3tOyIHGqyqum8ks5utbtigaJpZM4YVZ1J> .

dcherian · 2018-11-09T19:09:34Z

A dirty fix would be to add an attribute to each dataset.

jsignell · 2018-11-09T19:11:38Z

A dirty fix would be to add an attribute to each dataset.

I thought @jhamman was suggesting that already exists, but I couldn't find it: #2550 (comment)

dcherian · 2018-11-09T19:14:35Z

True, maybe we should track down why that isn't happening with your dataset

jsignell · 2018-11-19T00:52:03Z

Ah I don't think I understood that adding source to encoding was a new addition. In latest master ('0.11.0+3.g70e9eb8) this works fine:

def func(ds):
    var = next(var for var in ds)
    return ds.assign(path=ds[var].encoding['source'])

ds = xr.open_mfdataset(['./air_1.nc', './air_2.nc'], concat_dim='path', preprocess=func)

I do think it is misleading though that after you've concatenated the data, the encoding['source'] on a concatenated var seems to be the first path.

>>> ds['air'].encoding['source'] 
'~/air_1.nc'

I'll close this one though since there is a clear way to access the filename. Thanks for the tip @jhamman!

shoyer · 2018-11-19T04:47:13Z

I'm not sure .encoding['source'] should really be relied upon -- it wasn't really an intentional API decision. But I guess it's harmless enough to include it...

jsignell · 2018-11-19T14:36:37Z

Should I add a test that expects .encoding['source'] to ensure its continued presence?

shoyer · 2018-11-19T16:47:50Z

Yes, that sounds great! Potentially this would be a good opportunity for a doc update, too.

…

On Mon, Nov 19, 2018 at 6:36 AM Julia Signell ***@***.***> wrote: Should I add a test that expects .encoding['source'] to ensure its continued presence? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2550 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKS1tyOrzPg3dUX2cPolk51Uoa4i7j5ks5uwsH2gaJpZM4YVZ1J> .

jsignell · 2018-11-19T18:53:27Z

Having started writing a test, I now think that encoding['source'] is backend specific. Here it is implemented in netcdf4:

xarray/xarray/backends/netCDF4_.py

Line 386 in 70e9eb8

encoding['source'] = self._filename

but I don't see it for pynio for instance:

xarray/xarray/backends/pynio_.py

Lines 77 to 81 in 70e9eb8

    
           def get_encoding(self): 
        
               encoding = {} 
        
               encoding['unlimited_dims'] = set( 
        
                   [k for k in self.ds.dimensions if self.ds.unlimited(k)]) 
        
               return encoding

Is this something that we want to mandate that backends provide?

shoyer · 2018-11-19T19:16:48Z

Is this something that we want to mandate that backends provide?

I think it would be better to do this systematically, e.g., inside xarray.open_dataset(). We would need to verify that filename_or_obj is provided as a string, but if so we could add it into encoding on the Dataset object.

jsignell closed this as completed Nov 19, 2018

jsignell reopened this Nov 19, 2018

jsignell mentioned this issue Nov 19, 2018

Adding path_as_pattern for netcdf intake/intake-xarray#22

Merged

TomNicholas mentioned this issue Dec 10, 2018

Option to retain boundary/guard cells boutproject/xBOUT#19

Closed

TomNicholas mentioned this issue Dec 22, 2018

Source encoding always set when opening datasets #2626

Merged

3 tasks

shoyer closed this as completed in #2626 Dec 30, 2018

AJueling mentioned this issue Mar 20, 2019

time decoding error with "days since" #521

Closed

yt87 mentioned this issue Jan 20, 2021

GH2550 revisited #4830

Closed

ognancy4life mentioned this issue Feb 24, 2021

Let's list all the netCDF files that xarray can't open #2368

Closed

snowman2 mentioned this issue Feb 26, 2021

open_rasterio throws AttributeError after updating new xarray version 0.17.0 corteva/rioxarray#254

Closed

pbranson mentioned this issue Mar 30, 2021

Spec v1 for Ike fsspec/kerchunk#18

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include filename or path in open_mfdataset #2550

Include filename or path in open_mfdataset #2550

jsignell commented Nov 8, 2018 •

edited

Loading

dcherian commented Nov 8, 2018

jsignell commented Nov 8, 2018

jhamman commented Nov 8, 2018

jsignell commented Nov 8, 2018

shoyer commented Nov 8, 2018

jhamman commented Nov 9, 2018

shoyer commented Nov 9, 2018

dcherian commented Nov 9, 2018

jsignell commented Nov 9, 2018

shoyer commented Nov 9, 2018 via email

dcherian commented Nov 9, 2018

jsignell commented Nov 9, 2018

dcherian commented Nov 9, 2018

jsignell commented Nov 19, 2018

shoyer commented Nov 19, 2018

jsignell commented Nov 19, 2018

shoyer commented Nov 19, 2018 via email

jsignell commented Nov 19, 2018

shoyer commented Nov 19, 2018

Include filename or path in open_mfdataset #2550

Include filename or path in open_mfdataset #2550

Comments

jsignell commented Nov 8, 2018 • edited Loading

dcherian commented Nov 8, 2018

jsignell commented Nov 8, 2018

jhamman commented Nov 8, 2018

jsignell commented Nov 8, 2018

shoyer commented Nov 8, 2018

jhamman commented Nov 9, 2018

shoyer commented Nov 9, 2018

dcherian commented Nov 9, 2018

jsignell commented Nov 9, 2018

shoyer commented Nov 9, 2018 via email

dcherian commented Nov 9, 2018

jsignell commented Nov 9, 2018

dcherian commented Nov 9, 2018

jsignell commented Nov 19, 2018

shoyer commented Nov 19, 2018

jsignell commented Nov 19, 2018

shoyer commented Nov 19, 2018 via email

jsignell commented Nov 19, 2018

shoyer commented Nov 19, 2018

jsignell commented Nov 8, 2018 •

edited

Loading