-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include filename or path in open_mfdataset #2550
Comments
There is a preprocess argument. You provide a function and it is run on every file. |
Yes but the input to that function is just the ds, I couldn't figure out a way to get the filename from within a preprocess function. This is what I was doing to poke around in there: def func(ds):
import pdb; pdb.set_trace()
xr.open_mfdataset(['./ST4.2018092500.01h', './ST4.2018092501.01h'],
engine='pynio', concat_dim='path', preprocess=func) |
@jsignell - perhaps not a very pretty solution but we do save the source of each variable in the encoding dictionary. ds['varname'].encoding['source'] Presumably, you could unpack this via a preprocess step. |
@jhamman that looks pretty good, but I'm not seeing the source in the encoding dict. Is this what you were expecting? def func(ds):
var = next(var for var in ds)
return ds.assign(path=ds[var].encoding['source'])
xr.open_mfdataset(['./ST4.2018092500.01h', './ST4.2018092501.01h'],
engine='pynio', concat_dim='path', preprocess=func) ---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-49-184da62ce353> in <module>()
----> 1 ds = xr.open_mfdataset(['./ST4.2018092500.01h', './ST4.2018092501.01h'], engine='pynio', concat_dim='path', preprocess=func)
/opt/conda/lib/python3.6/site-packages/xarray/backends/api.py in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, lock, data_vars, coords, autoclose, parallel, **kwargs)
612 file_objs = [getattr_(ds, '_file_obj') for ds in datasets]
613 if preprocess is not None:
--> 614 datasets = [preprocess(ds) for ds in datasets]
615
616 if parallel:
/opt/conda/lib/python3.6/site-packages/xarray/backends/api.py in <listcomp>(.0)
612 file_objs = [getattr_(ds, '_file_obj') for ds in datasets]
613 if preprocess is not None:
--> 614 datasets = [preprocess(ds) for ds in datasets]
615
616 if parallel:
<ipython-input-48-fd450fa1393a> in func(ds)
1 def func(ds):
2 var = next(var for var in ds)
----> 3 return ds.assign(path=ds[var].encoding['source'])
KeyError: 'source' xarray version: '0.11.0+1.g575e97ae' |
Hmm. It really seems like the |
@jhamman The problem is that xarray needs way to figure out what arguments it can safely pass to |
Hmm... Sorry @jsignell. I thought preprocess passed the filename too. |
Maybe we can inspect the >>> preprocess = lambda a, b: print(a, b)
>>> preprocess .__code__.co_varnames
('a', 'b') This response is ordered, so the first one can always be From this answer: https://stackoverflow.com/a/4051447/4021797 |
The danger with inspecting user provided functions is that it's pretty
fragile, e.g., it fails if you use provide a signature like *args, **kwargs
(which can happen pretty easily with decorators). Probably the best option
is to come up with a new keyword argument to replace "preprocess" and to
deprecate the current preprocess (if we can think of another good name). We
could also do a deprecation cycle with FutureWarning, but that's pretty
painful.
…On Fri, Nov 9, 2018 at 12:29 PM Julia Signell ***@***.***> wrote:
Maybe we can inspect the preprocess function like this:
>>> preprocess = lambda a, b: print(a, b)>>> preprocess .__code__.co_varnames
('a', 'b')
This response is ordered, so the first one can always be ds regardless of
its name and then we can look for special names (like filename) in the
rest.
From this answer: https://stackoverflow.com/a/4051447/4021797
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2550 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1kvSMx83hACWxXD3tOyIHGqyqum8ks5utbtigaJpZM4YVZ1J>
.
|
A dirty fix would be to add an attribute to each dataset. |
I thought @jhamman was suggesting that already exists, but I couldn't find it: #2550 (comment) |
True, maybe we should track down why that isn't happening with your dataset |
Ah I don't think I understood that adding def func(ds):
var = next(var for var in ds)
return ds.assign(path=ds[var].encoding['source'])
ds = xr.open_mfdataset(['./air_1.nc', './air_2.nc'], concat_dim='path', preprocess=func) I do think it is misleading though that after you've concatenated the data, the >>> ds['air'].encoding['source']
'~/air_1.nc' I'll close this one though since there is a clear way to access the filename. Thanks for the tip @jhamman! |
I'm not sure |
Should I add a test that expects |
Yes, that sounds great! Potentially this would be a good opportunity for a
doc update, too.
…On Mon, Nov 19, 2018 at 6:36 AM Julia Signell ***@***.***> wrote:
Should I add a test that expects .encoding['source'] to ensure its
continued presence?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2550 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1tyOrzPg3dUX2cPolk51Uoa4i7j5ks5uwsH2gaJpZM4YVZ1J>
.
|
Having started writing a test, I now think that xarray/xarray/backends/netCDF4_.py Line 386 in 70e9eb8
xarray/xarray/backends/pynio_.py Lines 77 to 81 in 70e9eb8
Is this something that we want to mandate that backends provide? |
I think it would be better to do this systematically, e.g., inside |
When reading from multiple files, sometimes there is information encoded in the filename. For example in these grib files the time:
./ST4.2018092500.01h
,./ST4.2018092501.01h
. It seems like a generally useful thing would be to allow the passing of akwargs
(such aspath_as_coord
or something) that would define a set of coords with one for the data from each file.I think the code change would be small:
In use it would be like:
For context I have implemented something similar in dask: dask/dask#3908
The text was updated successfully, but these errors were encountered: