Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xr.combine_nested() fails when passed nested DataSets #3315

Open
friedrichknuth opened this issue Sep 17, 2019 · 8 comments
Open

xr.combine_nested() fails when passed nested DataSets #3315

friedrichknuth opened this issue Sep 17, 2019 · 8 comments
Labels
topic-combine combine/concat/merge

Comments

@friedrichknuth
Copy link

xr.__version__ '0.13.0'

xr.combine_nested() works when passed a nested list of DataArray objects.

da1 = xr.DataArray(name="a", data=[[0]], dims=["x", "y"])
da2 = xr.DataArray(name="b", data=[[1]], dims=["x", "y"])
da3 = xr.DataArray(name="a", data=[[2]], dims=["x", "y"])
da4 = xr.DataArray(name="b", data=[[3]], dims=["x", "y"])
xr.combine_nested([[da1, da2], [da3, da4]], concat_dim=["x", "y"])

returns

<xarray.DataArray 'a' (x: 2, y: 2)>
array([[0, 1],
       [2, 3]])
Dimensions without coordinates: x, y

but fails if passed a nested list of DataSet objects.

ds1 = da1.to_dataset()
ds2 = da2.to_dataset()
ds3 = da3.to_dataset()
ds4 = da4.to_dataset()
xr.combine_nested([[ds1, ds2], [ds3, ds4]], concat_dim=["x", "y"])

returns

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-8-c0035883fc68> in <module>
      3 ds3 = da3.to_dataset()
      4 ds4 = da4.to_dataset()
----> 5 xr.combine_nested([[ds1, ds2], [ds3, ds4]], concat_dim=["x", "y"])

~/repos/contribute/xarray/xarray/core/combine.py in combine_nested(datasets, concat_dim, compat, data_vars, coords, fill_value, join)
    462         ids=False,
    463         fill_value=fill_value,
--> 464         join=join,
    465     )
    466 

~/repos/contribute/xarray/xarray/core/combine.py in _nested_combine(datasets, concat_dims, compat, data_vars, coords, ids, fill_value, join)
    305         coords=coords,
    306         fill_value=fill_value,
--> 307         join=join,
    308     )
    309     return combined

~/repos/contribute/xarray/xarray/core/combine.py in _combine_nd(combined_ids, concat_dims, data_vars, coords, compat, fill_value, join)
    196             compat=compat,
    197             fill_value=fill_value,
--> 198             join=join,
    199         )
    200     (combined_ds,) = combined_ids.values()

~/repos/contribute/xarray/xarray/core/combine.py in _combine_all_along_first_dim(combined_ids, dim, data_vars, coords, compat, fill_value, join)
    218         datasets = combined_ids.values()
    219         new_combined_ids[new_id] = _combine_1d(
--> 220             datasets, dim, compat, data_vars, coords, fill_value, join
    221         )
    222     return new_combined_ids

~/repos/contribute/xarray/xarray/core/combine.py in _combine_1d(datasets, concat_dim, compat, data_vars, coords, fill_value, join)
    246                 compat=compat,
    247                 fill_value=fill_value,
--> 248                 join=join,
    249             )
    250         except ValueError as err:

~/repos/contribute/xarray/xarray/core/concat.py in concat(objs, dim, data_vars, coords, compat, positions, fill_value, join)
    131             "objects, got %s" % type(first_obj)
    132         )
--> 133     return f(objs, dim, data_vars, coords, compat, positions, fill_value, join)
    134 
    135 

~/repos/contribute/xarray/xarray/core/concat.py in _dataset_concat(datasets, dim, data_vars, coords, compat, positions, fill_value, join)
    363     for k in datasets[0].variables:
    364         if k in concat_over:
--> 365             vars = ensure_common_dims([ds.variables[k] for ds in datasets])
    366             combined = concat_vars(vars, dim, positions)
    367             assert isinstance(combined, Variable)

~/repos/contribute/xarray/xarray/core/concat.py in <listcomp>(.0)
    363     for k in datasets[0].variables:
    364         if k in concat_over:
--> 365             vars = ensure_common_dims([ds.variables[k] for ds in datasets])
    366             combined = concat_vars(vars, dim, positions)
    367             assert isinstance(combined, Variable)

~/repos/contribute/xarray/xarray/core/utils.py in __getitem__(self, key)
    383 
    384     def __getitem__(self, key: K) -> V:
--> 385         return self.mapping[key]
    386 
    387     def __iter__(self) -> Iterator[K]:

KeyError: 'a'
@dcherian
Copy link
Contributor

This honestly makes no sense to me.

da1 = xr.DataArray(name="a", data=[[0]], dims=["x", "y"])
da2 = xr.DataArray(name="b", data=[[1]], dims=["x", "y"])
da3 = xr.DataArray(name="a", data=[[2]], dims=["x", "y"])
da4 = xr.DataArray(name="b", data=[[3]], dims=["x", "y"])
xr.combine_nested([[da1, da2], [da3, da4]], concat_dim=["x", "y"])

These are dataarrays with two different names. Why is this the expected result?

<xarray.DataArray 'a' (x: 2, y: 2)>
array([[0, 1],
       [2, 3]])
Dimensions without coordinates: x, y

That error arises because it's trying to concatenate data_vars a and b but there are datasets that don't have a. If you set those DataArrays to have the same name, this will work.

da1 = xr.DataArray(name="a", data=[[0]], dims=["x", "y"])
da2 = xr.DataArray(name="a", data=[[1]], dims=["x", "y"])
da3 = xr.DataArray(name="a", data=[[2]], dims=["x", "y"])
da4 = xr.DataArray(name="a", data=[[3]], dims=["x", "y"])

ds1 = da1.to_dataset()
ds2 = da2.to_dataset()
ds3 = da3.to_dataset()
ds4 = da4.to_dataset()
xr.combine_nested([[ds1, ds2], [ds3, ds4]], concat_dim=["x", "y"])
<xarray.Dataset>
Dimensions:  (x: 2, y: 2)
Dimensions without coordinates: x, y
Data variables:
    a        (x, y) int64 0 1 2 3

ping @TomNicholas

@TomNicholas
Copy link
Member

TomNicholas commented Sep 18, 2019 via email

@dcherian
Copy link
Contributor

Yes/

def test_concat_name_symmetry(self):
"""Inspired by the discussion on GH issue #2777"""
da1 = DataArray(name="a", data=[[0]], dims=["x", "y"])
da2 = DataArray(name="b", data=[[1]], dims=["x", "y"])
da3 = DataArray(name="a", data=[[2]], dims=["x", "y"])
da4 = DataArray(name="b", data=[[3]], dims=["x", "y"])
x_first = combine_nested([[da1, da2], [da3, da4]], concat_dim=["x", "y"])
y_first = combine_nested([[da1, da3], [da2, da4]], concat_dim=["y", "x"])
assert_identical(x_first, y_first)

@TomNicholas
Copy link
Member

TomNicholas commented Sep 18, 2019 via email

@TomNicholas
Copy link
Member

Okay something has definitely gone wrong here.

My intention with that test was to check that the order of operations doesn't matter, but you're right that the test as written makes no sense. It would probably be a good idea to remove this test and check that property correctly by adding a second assert to the (poorly-named) test_auto_combine_2d:

# Prove it works symmetrically
datasets = [[ds(0), ds(3)],
            [ds(1), ds(4)],
            [ds(2), ds(5)]]
result = combine_nested(datasets, concat_dim=["dim2", "dim1"])
assert_equal(result, expected)

(This passes fine)

However, that still leaves the question of why is this nonsensical test passing?

I think it's because concat is not failing when it should - that test boils down to calling concat on those DataArrays (called from _combine_1d internally). Surely concat should fail when you ask it to do this, because how can you concatenate two different variables?

da1 = DataArray(name="a", data=[[0]], dims=["x", "y"])
da2 = DataArray(name="b", data=[[1]], dims=["x", "y"])

result = concat([da1, da2], dim="x")

However it doesn't fail, instead it gives this!:

<xarray.DataArray 'a' (x: 2, y: 1)>
array([[0],
       [1]])
Dimensions without coordinates: x, y

Where has 'b' gone?! This is the reason that test_concat_name_symmetry gives such a weird result.

@dcherian
Copy link
Contributor

concat ignores DataArray.name. I don't know if we should consider it a bug or a feature :)

@TomNicholas
Copy link
Member

TomNicholas commented Sep 25, 2019

Really? Okay, so that means that currently we don't treat a named DataArray and a single-variable Dataset as if they are the same. For example I would have expected these two operations to give the same result:

objs = [DataArray([0], dims='x', name='a'),
        DataArray([0], dims='x', name='b')]
concat(objs, dim='x')
<xarray.DataArray 'a' (x: 2)>
array([0, 0])
Dimensions without coordinates: x
objs = [Dataset({'a': ('x', [0])}),
        Dataset({'b': ('x', [0])})]
concat(objs, dim='x')
self = Frozen(OrderedDict([('b', <xarray.Variable (x: 1)>
array([0]))])), key = 'a'

    def __getitem__(self, key: K) -> V:
>       return self.mapping[key]
E       KeyError: 'a'

xarray/core/utils.py:385: KeyError

Is this what we want to do? Surely the first one should also fail, else this is counter-intuitive. I think of a named DataArray and a single-variable Dataset as being the same thing, just a single physical variable? @shoyer am I misunderstanding xarray's data model here?

@friedrichknuth
Copy link
Author

Few observations after looking at the default flags for concat:

xr.concat(
    objs,
    dim,
    data_vars='all',
    coords='different',
    compat='equals',
    positions=None,
    fill_value=<NA>,
    join='outer',
)

The description of compat='equals' indicates combining DataArrays with different names should fail: 'equals': all values and dimensions must be the same. (though I am not entirely sure what is meant by values... I assume this perhaps generically means keys?)

Another option is compat='identical' which is described as: 'identical': all values, dimensions and attributes must be the same. Using this flag will cause the operation to fail, as one would expect from the description...

objs = [xr.DataArray([0], 
                     dims='x', 
                     name='a'),
        xr.DataArray([1], 
                     dims='x', 
                     name='b')]

xr.concat(objs, dim='x', compat='identical')
ValueError: array names not identical

... and is the case for concat on Datasets, as previously shown by @TomNicholas

objs = [xr.Dataset({'a': ('x', [0])}),
        xr.Dataset({'b': ('x', [0])})]

xr.concat(objs, dim='x')
ValueError: 'a' is not present in all datasets.

However, 'identical': all values, dimensions and **attributes** must be the same. doesn't quite seem to be the case for DataArrays, as

objs = [xr.DataArray([0], 
                     dims='x', 
                     name='a', 
                     attrs={'foo':1}),
        xr.DataArray([1], 
                     dims='x', 
                     name='a', 
                     attrs={'bar':2})]

xr.concat(objs, dim='x', compat='identical')

succeeds with

<xarray.DataArray 'a' (x: 2)>
array([0, 1])
Dimensions without coordinates: x
Attributes:
    foo:      1

but again fails on Datasets, as one would expect from the description.

ds1 = xr.Dataset({'a': ('x', [0])})
ds1.attrs['foo'] = 'example attribute'

ds2 = xr.Dataset({'a': ('x', [1])})
ds2.attrs['bar'] = 'example attribute'

objs = [ds1,ds2]
xr.concat(objs, dim='x',compat='identical')
ValueError: Dataset global attributes not equal.

Also had a look at compat='override', which will override an attrs inconsistency but not a naming one when applied to Datasets. Works as expected on DataArrays. It is described as 'override': skip comparing and pick variable from first dataset.

Potential resolutions:

  1. 'identical' should raise an error when attributes are not the same for DataArrays

  2. 'equals' should raise an error when DataArray names are not identical (unless one is None, which works with Datasets and seems fine to be replaced)

  3. 'override' should override naming inconsistencies when combining DataSets.

Final thought: perhaps promoting to Dataset when all requirements are met for a DataArray to be considered as such, might simplify keeping operations and checks consistent?

@dcherian dcherian added the topic-combine combine/concat/merge label Jul 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-combine combine/concat/merge
Projects
None yet
Development

No branches or pull requests

3 participants