-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
concat_dim getting added to *all* variables of multifile datasets #2064
Comments
cc @pwolfram |
What happens if you open multiple files with |
@shoyer, in that case as well, |
👍 This is a persistent problem for me as well. I often find myself writing a preprocessor function like this def process_coords(ds, concat_dim='time', drop=True):
coord_vars = [v for v in ds.data_vars if concat_dim not in ds[v].dims]
if drop:
return ds.drop(coord_vars)
else:
return ds.set_coords(coord_vars)
ds = xr.open_mfdataset('*.nc', preprocess=process_coords) The reason to drop the coordinates is to avoid the comparison that happens when you concatenate coords. |
@shoyer, I stand corrected. in 0.10.1, I also see the |
@rabernat, so you're fooling xarray into not including the |
Exactly. They are coordinates. Those variables are usually related to grid geometry or constants, as I presume is |
Yes, true. I'm trying to think if there are any examples where the fixed-in-time variables would not be coordinates. So far, none come to mind. Thanks for the tip. |
But this issue raises an important basic point: we might want different behavior for variables in which |
OK, in that case I think #2048 was still the right change/bug-fix, making multi-file and single-file behavior consistent. But you certainly have exposed a real issue here.
Yes, we shouldn't implicitly add a new dimensions to variables in the case where the dimension already exists in the dataset. We only need the heuristics/comparisons when an entirely new dimension is being added. |
Yes, I think that's exactly right. |
@rabernat, your suggestion above has worked perfectly to get our unit tests working again in MPAS-Analysis so that will tide us over until this issue can be addressed directly in xarray. Thanks! |
I'm glad! FWIW, I think this is a relatively simple fix within xarray. @xylar, if you are game, we would love to see a PR from you. Could be a good opportunity to learn more about xarray internals. |
Hmm, I agree that it shouldn't be hard but I don't really have time to do this right now. If no one has had a chance to look into it by mid May I might be able to take it on then. |
I recently ran into a similar issue and found a potential solution. The functionality, as far as I understand, is already in the I think this is a very non-intrusive solution since it only affects the This is my first time attempting to contribute, does this sound like a good idea? I can try to make a pull request but would very much value some input first. |
The comment from henrica above gives a solution to @xylar 's issue. Here is the original example where I added : data_vars='minimal'. #!/usr/bin/env python3
import xarray
ds = xarray.open_mfdataset('example_jan.nc', concat_dim='Time', data_vars='minimal')
print(ds)
|
I'm a new developer at the SciPy 2019 Sprints, and I'm interested in making a pull request for this issue as a learning step. @henrikca Would this be useful? Or would this conflict with your work? I went ahead and made a pull request because it looks like a very small change, potentially. |
So there are some units tests that assert the behavior for open_mfdataset() is identical to the behavior for concat(). This implies that if we change the default data_vars value from "all" to "minimal" for one function, we need to change it for both functions. @shoyer I think you suggested that concat() default behavior should change in #2145, in the same way it will change for open_mfdataset. So I am going to add this change to the pull request. UPDATE: There is a problem with changing concat() away from having data_ vars='all'. This breaks many unit tests that check for compatibility with Pandas. What I've been told is that Pandas concat() will include all unique variables from each dataframe. This is what data_vars='all' will also do. By changing to data_vars='minimal', only data variables with the specified concatenation dimension will be included. So it seems that in order to stay compatible with Pandas, we need to include all data variables, but not add the concatenation dimension to data variables that do not already have that dimension. The problem, however, is what to do when both datasets have a variable Please, could someone confirm that I have understood the problem correctly? Thank you in advance. |
@bonnland I don't think you want to change the default
|
@dcherian . I believe you are correct in principle, but there is a logical problem that is expensive to evaluate. The difficult case is when two datasets have a variable with the same name, and that variable does not include the concatenation dimension. In order to align the datasets for concatenation, both variables would need to be identical. The resulting dataset would just have one (unchanged) instance of that variable, say from the first dataset. I think someone along the way decided this operation was too expensive. This is from concat.py, lines 302-307:
So I think some consensus needs to be reached, about whether it is a good idea to load these variables into memory to check for identical-ness between them. Or another possibility is that we leave "unique" variables alone: if a variable exists only once across all datasets being concatenated, we do not add the concatenation dimension to it. This might solve @xylar original poster's issue when opening a single dataset. |
The logic for determining which variables to concatenate is in the Line 146 in 539fb4a
Only Right now we also have
Recall that
Here's my notebook testing this out: https://gist.github.com/shoyer/f44300eddda4f7c476c61f76d1df938b So I'm thinking that we probably want to combine "all" and "minimal" into a single mode to use as the default, and remove the other behavior, which is either useless or broken. Maybe it would make sense to come up with a new name for this mode, and to make both |
@shoyer Your explanation makes sense, but there are unit tests that expect the default concat() behavior to be the same as default behavior for Pandas concat(), which tries to perform an "outer" join between dataframes. Therefore, from my limited understanding, the default behavior for xarray concat() should be to preserve all variables. If this default behavior changes, then it may break code making these expectations. Can we get a perspective from the author of concat.py, @TomNicholas ? Thanks. Specifically, what should the default behavior of concat() be, when both datasets include a variable that does not include the concatenation dimension? Currently, the concat dimension is added, and the result is a "stacked" version of the variable. Others have argued that this variable should not be included in the concat() result by default, but this appears to break compatibility with Pandas concat(). Another possibility could be to include the first instance of the variable in the result set, throwing away any other instances of the same variable, so a "stacking" dimension is not needed. This would potentially lose information if the variable instances are not identical, however. |
@shoyer I'm sorry I didn't look at your examples more closely at first. I see now that your first example of using data_vars='minimal' is already preserving one instance of the variable x, and I was suggesting earlier that this variable was not being included in the concatenation. So I am not clear on why so many unit tests fail when I switch the default value for data_vars to 'minimal'. The output from your examples seems compatible with Pandas concat, though I don't understand Pandas very well yet. I wonder if the unit tests that fail are written correctly. I have to add that I spent an entire day trying to understand the code in concat.py, by stepping through it for several unit tests. I found the code quite difficult to understand. |
Can you give a specific example of the behavior in question? |
Here is the most specific thing I can say: If I switch the default value for data_vars to 'minimal' for concat() and open_mfdataset(), then I get a lot of failing unit tests (when running "pytest xarray -n 4". I may be wrong about why they are failing. The unit tests have comments in them, like "Check pandas compatibility"; see for example, line 370 in test_duck_array_ops.py for an example instruction that raises a ValueError exception. Many failures appear to be caused by a ValueError exception being raised, like in the final example you have in your notebook. I hope this is specific enough; I realize that I'm not deeply comprehending what the unit tests are actually supposed to be testing. UPDATE: @shoyer it could be that unit tests are failing because, as your final example shows, you get an error for data_vars='minimal' if any variables have different values across datasets, when adding a new concatentation dimension. If this is the reason so many unit tests are failing, then the failures are a red herring and should probably be ignored/rewritten. |
This seems very likely to me. The existing behavior of Xarray's unit test suite is definitely a good "smoke test" for understanding the impact of changes to The tests we should feel free to rewrite are cases where we set |
I'm in favour of this. What should we name this mode? One comment on "existing dimensions" mode:
For variables without the dimension, this will still raise a |
I have a draft solution in #3239. It adds a new mode called "sensible" that acts like "all" when the concat dimension doesn't exist in the dataset and acts like "minimal" when the dimension is present. We can decide whether this is the right way i.e. add a new mode but the more fundamental problem is below. The issue is dealing with variables that should not be concatentated in "minimal" mode (e.g. time-invariant non dim coords when concatenating in time). In this case, we want to skip the equality checks in I thought the clean way to do this would be to add the However, So do we want to support all the other @shoyer What do you think? |
I have tried to understand why the xarray developers decided to provide their own options for concatenation. I am not an experienced user of xarray, but I can't find any discussion of how the current options for concatenation were derived. The pandas concat() function uses the option join = {'inner', 'outer', 'left', 'right'} in order to mimic logical database join operations. If there is a reason that xarray cannot do the same, it is not obvious to me. I think the pandas options have the advantage of logical simplicity and traditional usage within database systems. Perhaps the reason is that xarray is modeling collections of variables, rather than a single dataframe, as with pandas. But even then, it seems like the pandas rules can be applied on a per-variable basis. |
Thanks for your input @bonnland.
We do have a What's under discussion here is what to do about variables duplicated across datasets or indeed, how do we know that these variables are duplicated across datasets when concatenating other variables. |
#3239 has been merged. Now What's left is to change defaults to implement @shoyer's comment
|
Code Sample
Using the following example data set:
example_jan.nc
The result from xarray 0.10.2 (and all previous various xarray versions we've worked with):
The results with xarray 0.10.3:
Problem description
The expected behavior for us was that
refBottomDepth
should not haveTime
as a dimension. It does not vary with time and does not have aTime
dimension in the input data set.It seems like #1988 and #2048 were intended to address cases where the
concat_dim
was not yet present in the input files. But in cases whereconcat_dim
is already in the input files, it seems like only those fields that include this dimensions should be concatenated and other fields should remain free ofconcat_dim
. Part of the problem for us is that the number of dimensions of some of our variables change depending on which xarray version is being used.Expected Output
That for 0.10.2 (see above)
Output of
xr.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.0-38-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
xarray: 0.10.3
pandas: 0.22.0
numpy: 1.14.2
scipy: 1.0.1
netCDF4: 1.3.1
h5netcdf: 0.5.1
h5py: 2.7.1
Nio: None
zarr: None
bottleneck: 1.2.1
cyordereddict: None
dask: 0.17.2
distributed: 1.21.6
matplotlib: 2.2.2
cartopy: 0.16.0
seaborn: None
setuptools: 39.0.1
pip: 9.0.3
conda: None
pytest: 3.5.0
IPython: None
sphinx: 1.7.2
The text was updated successfully, but these errors were encountered: