Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent behavior between DatasetRolling.construct and DataArrayRolling.construct with stride > 1. #7021

Closed
p4perf4ce opened this issue Sep 12, 2022 · 3 comments · Fixed by p4perf4ce/xarray#1 or #7578

Comments

@p4perf4ce
Copy link
Contributor

What is your issue?

INSTALLED VERSIONS

commit: None
python: 3.8.10 | packaged by conda-forge | (default, May 11 2021, 07:01:05)
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 5.4.0-73-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.9.0

xarray: 2022.6.0
pandas: 1.4.2
numpy: 1.19.5
scipy: 1.7.0
netCDF4: 1.6.0
pydap: None
h5netcdf: 1.0.2
h5py: 3.1.0
Nio: None
zarr: 2.12.0
cftime: 1.6.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.3.4
dask: 2021.06.2
distributed: 2021.06.2
matplotlib: 3.5.3
cartopy: None
seaborn: 0.12.0
numbagg: None
fsspec: 2021.07.0
cupy: 9.2.0
pint: None
sparse: 0.13.0
flox: None
numpy_groupies: None
setuptools: 49.6.0.post20210108
pip: 21.1.3
conda: 4.10.3
pytest: 6.2.4
IPython: 7.24.1
sphinx: None

Reproducing the problem

I have an xarray Dataset with a single dimension as specified. (Or any arbitrary Xarray's Dataset

> Dimensions:
> time: 11058688

When applied rolling operation on DataArray with no overlapping window, it is working as one would normally expected.

dataset.var_a.rolling(k=256).construct('w', stride=256)

11058688 / 256 = 43198

> Dimensions:
> time: 43198, k:  256   # 43198 windows

However when applying the same operation to the Dataset:

dataset.rolling(k=256).construct('w', stride=256)
> Dimensions:
> time: 169, k:  256   # How can we even arrived at 169 windows?

I don't see any reasons why should rolling on Dataset and DataArray should behave differently. Shouldn't rolling on dataset is just repeating DataArray rolling on every data variable?
This differing behavior is not mentioned on the documentation either.

@p4perf4ce p4perf4ce added the needs triage Issue that has not been reviewed by xarray team member label Sep 12, 2022
@mathause
Copy link
Collaborator

Thanks for the report & I agree that this should lead to the same but the code paths are indeed different - but I have not looked in to the actual root cause. Could be that this is also not super thoroughly tested (and used!):

def construct(

def construct(

B.t.w. a copy-pastable example would be appreciated.

@p4perf4ce
Copy link
Contributor Author

p4perf4ce commented Sep 12, 2022

Thanks for the report & I agree that this should lead to the same but the code paths are indeed different - but I have not looked in to the actual root cause. Could be that this is also not super thoroughly tested (and used!):

def construct(

def construct(

B.t.w. a copy-pastable example would be appreciated.

Thanks for the response, here is a straightforward example.

import xarray as xr
dummy = list(range(100))
x, y, z = [xr.DataArray(dummy, dims=['t']) for _ in range(3)]
ds = xr.Dataset(
    {'x': x, 'y': y, 'z': z}
)
print(x.rolling(t=4).construct('w', stride=4).shape)
print(ds.rolling(t=4).construct('w', stride=4).x.shape)

Results:

> (25, 4)
> (7, 4)

I had a hunch that the problem come from this part - not quite sure what self._mapping_to_list did here, haven't look it up yet.

strides = self._mapping_to_list(stride, default=1)
dataset = {}
for key, da in self.obj.data_vars.items():
# keeps rollings only for the dataset depending on self.dim
dims = [d for d in self.dim if d in da.dims]
if dims:
wi = {d: window_dims[i] for i, d in enumerate(self.dim) if d in da.dims}
st = {d: strides[i] for i, d in enumerate(self.dim) if d in da.dims}

Since I only had one dimension to deal with, removing this loop solves the problem for me.

@headtr1ck headtr1ck added topic-rolling bug and removed needs triage Issue that has not been reviewed by xarray team member labels Oct 1, 2022
@p4perf4ce
Copy link
Contributor Author

Been half a year and I found myself stuck at this inconsistent behavior again. Another problem I found but haven't mentioned yet is that DatasetRolling.construct will swap the rolling dimension name with window_dim when DataArrayRolling.construct doesn't.

This time, I've actually identified a cause for this problem below:

return Dataset(dataset, coords=self.obj.coords, attrs=attrs).isel(
{d: slice(None, None, s) for d, s in zip(self.dim, strides)}
)

.isel({d: slice(None, None, s) for d, s in zip(self.dim, strides)}) 

I currently still can't figure it out what is the original intention that .isel trying to achieve since it causes so much problem without any benefit. It should be noted that this can explode the memory if xr.Dataset is reasonably large (It just explode 3 channels PPG, 135Hz, 6Hrs of recording, a mere 300MB to 20-40GB++, so I think this is critical).

Solution

Removing .isel part fixed everything.

Test case

test_arr = xr.DataArray(np.arange(8).reshape(2, 4), dims=('a', 'b'))  # Borrowed from `DataArray.__doc__`'s example.
test_dset= xr.Dataset(data_vars={i: tr for i in range(3)})

DataArray

tr.rolling(b=2).construct('window_dim', stride=2)

>>> <xarray.DataArray (a: 2, b: 2, window_dim: 2)>
array([[[nan,  0.],
        [ 1.,  2.]],

       [[nan,  4.],
        [ 5.,  6.]]])
Dimensions without coordinates: a, b, window_dim

Dataset

trd.rolling(b=2).construct('window_dim', stride=2)

>>> <xarray.Dataset>
Dimensions:  (a: 2, b: 2, window_dim: 2)
Dimensions without coordinates: a, b, window_dim
Data variables:
    0        (a, b, window_dim) float64 nan 0.0 1.0 2.0 nan 4.0 5.0 6.0
    1        (a, b, window_dim) float64 nan 0.0 1.0 2.0 nan 4.0 5.0 6.0
    2        (a, b, window_dim) float64 nan 0.0 1.0 2.0 nan 4.0 5.0 6.0

trd.rolling(b=2).construct('window_dim', stride=2)[0]

>>> <xarray.DataArray 0 (a: 2, b: 2, window_dim: 2)>
array([[[nan,  0.],
        [ 1.,  2.]],

       [[nan,  4.],
        [ 5.,  6.]]])
Dimensions without coordinates: a, b, window_dim

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants