Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rename_vars followed by swap_dims and merge causes swapped dim to reappear #8646

Open
5 tasks done
brendan-m-murphy opened this issue Jan 23, 2024 · 16 comments
Open
5 tasks done

Comments

@brendan-m-murphy
Copy link

brendan-m-murphy commented Jan 23, 2024

What happened?

I wanted to rename a dimension coordinate for two datasets before merging: ds = ds.rename_vars(y="z").swap_dims(y="z"), and the same for the second data set. After merging the datasets, the merged result has the dimension "y" in addition to "z".

Swapping the order of rename_vars and swap_dims before merging works in that "y" does not reappear, but then "z" is listed as a non-dimension coordinate.

Doing rename_vars followed by swap_dims /after/ merging gives the result I wanted, but if I merge again, the same issue occurs.

My current solution is to only rename dimension coordinates before saving to netCDF.

What did you expect to happen?

Merging two datasets with the same coordinates and dimensions (but different data variables) should result in a single dataset with all of the data variables from the two datasets and exactly the same coordinates and dimensions.

Minimal Complete Verifiable Example

import numpy as np
import xarray as xr
from xarray.core.utils import Frozen


A = np.arange(4).reshape((2, 2))
B = np.arange(4).reshape((2, 2)) + 4
ds1 = xr.Dataset({"A": (["x", "y"], A), "B": (["x", "y"], B)}, coords={"x": ("x", [1, 2]), "y": ("y", [1, 2])})
ds2 = xr.Dataset({"C": (["x", "y"], A), "D": (["x", "y"], B)}, coords={"x": ("x", [1, 2]), "y": ("y", [1, 2])})

assert ds1.dims == Frozen({"x": 2, "y": 2})
assert ds2.dims == Frozen({"x": 2, "y": 2})

ds1_swap = ds1.rename_vars(y="z").swap_dims(y="z")
ds2_swap = ds2.rename_vars(y="z").swap_dims(y="z")

assert ds1_swap.dims == Frozen({"x": 2, "z": 2})
assert ds2_swap.dims == Frozen({"x": 2, "z": 2})

# merging makes the dimension "y" reappear (I would expect this assertion to fail):
assert xr.merge([ds1_swap, ds2_swap]).dims == Frozen({"x": 2, "z": 2, "y": 2})

# renaming and swapping after the merge causes issues later:
ds12 = xr.merge([ds1, ds2]).rename_vars(y="z").swap_dims(y="z")
ds3 = xr.Dataset({"E": (["x", "z"], A), "F": (["x", "z"], B)}, coords={"x": ("x", [1, 2]), "z": ("z", [1, 2])})

# ds12 and ds3 have the same dimensions:
assert ds12.dims == Frozen({"x": 2, "z": 2})
assert ds3.dims == Frozen({"x": 2, "z": 2})

# but merging brings back "y"
ds123 = xr.merge([ds12, ds3])
assert ds123.dims == Frozen({"x": 2, "z": 2, "y": 2})

# as do other operations:
ds12_as = ds12.assign_coords(x=(ds12.x + 1))
assert ds12_as.sizes == Frozen({"x": 2, "z": 2, "y": 2})

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

The MVCE works in all venvs I've tried including:

INSTALLED VERSIONS

commit: None
python: 3.10.13 (main, Nov 10 2023, 15:02:19) [GCC 11.4.0]
python-bits: 64
OS: Linux
OS-release: 6.5.0-14-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: ('en_GB', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.9.3-development

xarray: 2023.11.0
pandas: 1.5.3
numpy: 1.26.2
scipy: 1.11.4
netCDF4: 1.6.5
pydap: None
h5netcdf: 1.3.0
h5py: 3.10.0
Nio: None
zarr: None
cftime: 1.6.3
nc_time_axis: 1.4.1
iris: None
bottleneck: None
dask: 2023.12.0
distributed: None
matplotlib: 3.8.2
cartopy: 0.22.0
seaborn: 0.13.0
numbagg: None
fsspec: 2023.12.1
cupy: None
pint: None
sparse: 0.15.1
flox: None
numpy_groupies: None
setuptools: 69.0.2
pip: 23.3.1
conda: None
pytest: 7.4.3
mypy: None
IPython: 8.18.1
sphinx: None
/home/brendan/Documents/inversions/.pymc_venv/lib/python3.10/site-packages/_distutils_hack/init.py:33: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")

INSTALLED VERSIONS

commit: None
python: 3.9.7 (default, Sep 16 2021, 13:09:58)
[GCC 7.5.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-1160.81.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: ('en_GB', 'UTF-8')
libhdf5: None
libnetcdf: None

xarray: 2024.1.0
pandas: 2.2.0
numpy: 1.26.3
scipy: None
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: None
nc_time_axis: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 69.0.3
pip: 23.3.2
conda: None
pytest: None
mypy: None
IPython: 8.18.1
sphinx: None

@brendan-m-murphy brendan-m-murphy added bug needs triage Issue that has not been reviewed by xarray team member labels Jan 23, 2024
Copy link

welcome bot commented Jan 23, 2024

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@max-sixty
Copy link
Collaborator

Thanks for the excellent issue @brendan-m-murphy .

This is really surprising. If we look at these two objects, there's no y to be found:

[ins] In [3]: ds1_swap
Out[3]:
<xarray.Dataset>
Dimensions:  (x: 2, z: 2)
Coordinates:
  * x        (x) int64 1 2
  * z        (z) int64 1 2
Data variables:
    A        (x, z) int64 0 1 2 3
    B        (x, z) int64 4 5 6 7

[ins] In [4]: ds2_swap
Out[4]:
<xarray.Dataset>
Dimensions:  (x: 2, z: 2)
Coordinates:
  * x        (x) int64 1 2
  * z        (z) int64 1 2
Data variables:
    C        (x, z) int64 0 1 2 3
    D        (x, z) int64 4 5 6 7

But then if we merge them, a y dimension appears:

[ins] In [2]: xr.merge([ds1_swap, ds2_swap])
Out[2]:
<xarray.Dataset>
Dimensions:  (x: 2, z: 2, y: 2)  # where does this come from?
Coordinates:
  * x        (x) int64 1 2
  * z        (y) int64 1 2
Dimensions without coordinates: y
Data variables:
    A        (x, z) int64 0 1 2 3
    B        (x, z) int64 4 5 6 7
    C        (x, z) int64 0 1 2 3
    D        (x, z) int64 4 5 6 7

Where does the y come from? Even if I look through the entries of ds1_swap.__slots__, I can't find it (though I haven't looked exhaustively)

I think we should be quite worried about this — one of the wonderful things about xarray is that it's not doing surprising things — the repr of the object is effectively a full representation of it. But this violates that — somewhere there's a y lurking! :)

IIRC @benbovy has helped fix similar things in the past, hope it's OK to tag you in case you have any ideas.

@max-sixty max-sixty added topic-indexing and removed needs triage Issue that has not been reviewed by xarray team member labels Jan 23, 2024
@benbovy
Copy link
Member

benbovy commented Jan 23, 2024

It might be worth looking at the ._indexes property of the datasets obtained after rename_vars and/or swap_dims. If it returns a dict with a "y" item that could be the culprit (it shouldn't be there) causing merge to recreate a coordinate from it.

I suspect a bug in swap_dims in this (very?) specific case where for the old dimension ("y") there is an indexed, non-dimension coordinate that has the same name than the new dimension ("z"). Not sure, though. I'll investigate more later.

@max-sixty
Copy link
Collaborator

It might be worth looking at the ._indexes property of the datasets obtained after rename_vars and/or swap_dims.

Thanks for the response + the idea. It's indeed on the dim:

[nav] In [32]: ds2_swap._indexes['z'].dim
Out[32]: 'y'

Should the z & y there ever be different? Unless we can define them in a single place, should we have a runtime check that they don't differ?

@benbovy
Copy link
Member

benbovy commented Jan 25, 2024

Good catch. A possible fix would be

class Dataset(...):

    def swap_dims(self, ...):
        ...

        for current_name, current_variable in self.variables.items():
            ...
            if current_name in result_dims:
                ...
                if current_name in self._indexes:
                    indexes[current_name] = self._indexes[current_name]
                    indexes[current_name].dim = dims_dict[current_name]     # update the index dim here
                    variables[current_name] = var
                else:
                    ...
            else:
                ...

However, this fix works only in the case of a single PandasIndex. I suspect that swap_dims might also not work correctly for other edge cases involving more advanced (custom) indexes, so it might be a bit trickier to address properly.

Maybe it's time to deprecate swap_dims? It has been suggested as TODO for a while... There are now working alternatives that are easier to understand (e.g., using non-dimension indexed coordinate and/or using rename_dims or rename_vars that preserve the index).

@keewis
Copy link
Collaborator

keewis commented Jan 25, 2024

Is swap_dims just a composition of rename_dims and the new-ish set_xindex and drop_index, or does it do something else on top? If not, I agree we should deprecate it, or at least its current implementation.

@benbovy
Copy link
Member

benbovy commented Jan 25, 2024

To my knowledge it doesn't allow doing more than what we can do now with those other methods, it only adds some confusion :).

@keewis
Copy link
Collaborator

keewis commented Jan 25, 2024

if these two:

ds = xr.Dataset(coords={"x": ("x", [0, 1]), "y": ("x", [-1, 1])})

ds.swap_dims({"x": "y"})
ds.rename_dims({"x": "y"}).set_xindex("y").drop_indexes(["x"])

do the same thing, could we just make the former use the latter as implementation (obviously adapted to allow renaming multiple dimensions at the same time)? I don't think we'd even need a deprecation cycle for that.

@benbovy
Copy link
Member

benbovy commented Jan 25, 2024

Ah yes probably, although not sure that rename_dims should be always chained with set_xindex and/or drop_indexes depending on the cases. I'd be in favor to just drop swap_dims (after deprecating it).

@keewis
Copy link
Collaborator

keewis commented Jan 25, 2024

If you're not chaining set_xindex and drop_indexes then it would be better to do rename_dims directly, no?

In any case, I'm just saying that if we are to drop swap_dims we should be able to point to something that does exactly the same thing. set_xindex is a bit limited in that it only allow creating an index for one coordinate per call, so as far as I can tell we'd need to point to something using assign_coords and xr.Coordinates. However, with that my worry is that that would be a lot more to write.

@brendan-m-murphy
Copy link
Author

In my case, I'm using swap_dims instead of rename_dims because of the rename_vars call before (coord y to z precludes renaming dim y to z). Also, I think the "reappearing dimension" problem doesn't happen if I only call swap_dims, e.g. if there is a coordinate lat with dimension latitude and I swap the dimension latitude to lat, this seems to work fine.

Anyway, it doesn't seem like you'd be able to replace swap_dims with rename_dims plus set_xindex and drop_indexes in this case.

@keewis
Copy link
Collaborator

keewis commented Jan 25, 2024

interestingly, rename_dimsrename_vars works, but not the other way around (it suggests to use swap_dims instead). This, again, is a relic of how xarray used to work in earlier versions, and we should probably remove that error.

@dcherian
Copy link
Contributor

Anecdotally, I've seen swap_dims used a bunch so I support reimplementing it in terms of more atomic operations

@brendan-m-murphy
Copy link
Author

brendan-m-murphy commented Jan 25, 2024

interestingly, rename_dimsrename_vars works, but not the other way around (it suggests to use swap_dims instead). This, again, is a relic of how xarray used to work in earlier versions, and we should probably remove that error.

Ah okay, I probably tried rename_vars then rename_dims and followed the suggestion to use swap_dims.

As you say, for

ds = xr.Dataset(coords={"x": ("x", [0, 1])})
ds1 = ds.rename_dims({"x": "y"}).rename_vars({"x": "y"})
ds2 = ds.rename_vars({"x": "y"}).swap_dims({"x": "y"})

I get the expected result when I merge ds and ds1

In [11]: xr.merge([ds, ds1])
Out[11]: 
<xarray.Dataset>
Dimensions:  (x: 2, y: 2)
Coordinates:
  * x        (x) int64 0 1
  * y        (y) int64 0 1
Data variables:
    *empty*

but the dimension of y is reverted to x if I merge ds and ds2:

In [10]: xr.merge([ds, ds2])
Out[10]: 
<xarray.Dataset>
Dimensions:  (x: 2)
Coordinates:
  * x        (x) int64 0 1
  * y        (x) int64 0 1
Data variables:
    *empty*

(Also I just realised ds.rename({"x": "y"}) would work as well. I'm not sure why I didn't try that.)

@benbovy
Copy link
Member

benbovy commented Jan 25, 2024

I might miss the high-level picture here.

I thought that swap_dims was mostly used as a workaround to (temporarily?) perform label-based operations (selection, alignment, etc) using another existing coordinate than the dimension coordinate.

This workaround is not needed anymore since we support indexes for non-dimension coordinates. And for renaming a dimension coordinate there is indeed .rename() that is more explicit.

@dcherian did you see swap_dims l used for other applications?

dcherian added a commit to dcherian/xarray that referenced this issue Apr 1, 2024
dcherian added a commit to dcherian/xarray that referenced this issue Apr 1, 2024
dcherian added a commit that referenced this issue Apr 3, 2024
* Stateful tests with Dataset

* Disable check_default_indexes when needed

* Add Zarr roundtrip

* Randomize dimension choice

* Fix a bug

* Add reset_index

* Add stack, unstack

* [revert] Disable Zarr till we control attrs strategy

* Try making unique names

* Share names strategy to ensure uniques?

* cleanup

* Try sharing strategies better

* Fix endianness

* Better swap_dims

* More improvements

* WIP

* Drop duplicates before unstacking

* Add reset_index

* Better duplicate assumption

* Move

* Fix reset_index

* Skip if hypothesis not installed

* Better precondition around reset_index

* Note

* Try a bundle

* Use unique_subset_of

* Use Bundles more

* Add index_variables strategy

* Small improvement

* fix

* Use st.shared

* Revert "Use st.shared"

This reverts commit 50f6030.

* fix unstacking

* cleanup

* WIP

* Remove bundles

* Fixes

* Add hypothesis cache to CI

* Prevent index variables with NaNs, infs

* [revert]

* Always save hypothesis cache

* Expand dtypes

* Add invariant check for #8646

* Add drop_dims

* Add create_index to stack

* Generalize a bit

* limit number of indexes  to stack

* Fix endianness?

* uniquify drop_dims

* Avoid NaTs in index vars

HypothesisWorks/hypothesis#3943

* Guard swap_dims

* Revert "Add invariant check for #8646"

This reverts commit 4a958dc.

* Add drop_indexes

* Add assign_coords

* Fix max_period for pandas timedelta

* Add xfailed test

* Add notes

* Skip timedelta indexes

* to_zarr

* small tweaks

* Remove NaT assume

* Revert "[revert]"

This reverts commit 6a38e27.

* Add hypothesis workflow

* Swtich out

* fix

* Use st.builds

* cleanup

* Add initialize

* review feedback
dcherian added a commit to dcherian/xarray that referenced this issue Apr 4, 2024
@benbovy
Copy link
Member

benbovy commented Apr 5, 2024

I've tried re-implementing swap_dims it in terms of more atomic operations in #8911, but unfortunately it cannot handle the pandas multi-index special case. I think this cannot be easily done until we fully drop the implicit creation of coordinates from a pd.MultiIndex (#8140).

Regarding this issue, the other possible fix in #8646 (comment) is limited to pandas single indexes.

As an alternative to those fixes and deprecating swap_dims, I wonder if we shouldn't limit it to the case of dimension coordinates, i.e.,

ds = xr.Dataset(coords={"x": ("x", [0, 1])})
renamed = ds.rename_vars({"x": "y"})

renamed.swap_dims({"x": "y"})
# ValueError: swap_dims only works with dimension coordinates but found coordinate "y" with dimension "x"

This will not fix the issue here but at least it will provide a clear error message instead of weird issues (old dimension appearing from nowhere) later on. Workarounds exist like .rename() and we should also allow chaining rename_vars -> rename_dims (implemented in #8911).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants