Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

map_blocks fails with lazy loaded dask array #9504

Open
eni-awowale opened this issue Sep 16, 2024 · 0 comments
Open

map_blocks fails with lazy loaded dask array #9504

eni-awowale opened this issue Sep 16, 2024 · 0 comments
Labels
topic-dask topic-DataTree Related to the implementation of a DataTree class

Comments

@eni-awowale
Copy link
Collaborator

What is your issue?

Copied from xarray-contrib/datatree#152

Issue

Hi,

I'm very excited about this package and I'm just familiarising myself to see where I can use it for my use cases. I followed the example in the documentation to apply a groupby to the datatree. However, I did use dask because my dataset is too large to fit it into memory. I realised that my group_by function is not being applied to lazy loaded dask arrays.

Minimal example

import datatree
import xarray as xr
import pandas as pd
import dask
import numpy as np

def group_by(da, groupby_type="time.floor('1D')"):
    gb = da.groupby(groupby_type)
    mean = gb.mean()
    return mean

times = pd.date_range("2022-09-01","2022-09-03", freq="6H")

a=xr.Dataset({'x': ('time', np.random.randint(0,10,len(times)))}, coords={'time':times})
b=xr.Dataset({'x': ('time', dask.array.random.randint(0,10,len(times)))}, coords={'time':times})
dt=datatree.DataTree.from_dict({'first':a, 'second':b})

dt.map_blocks(group_by, kwargs={"groupby_type": "time.day"}, template=dt)

Please compare the results for the eager (a) and lazy (b) loaded datasets below:

DataTree('None', parent=None)
├── DataTree('first')
│       Dimensions:  (day: 3)
│       Coordinates:
│         * day      (day) int64 1 2 3
│       Data variables:
│           x        (day) float64 5.75 7.75 6.0
└── DataTree('second')
        Dimensions:  (time: 9)
        Coordinates:
          * time     (time) datetime64[ns] 2022-09-01 2022-09-01T06:00:00 ... 2022-09-03
        Data variables:
            x        (time) int64 dask.array<chunksize=(9,), meta=np.ndarray>

Any ideas what is going wrong?

This can likely be generalised for any map_blocks function:

def func(da):
    return da.mean('time')

b=xr.Dataset({'x': ('time', dask.array.random.randint(0,10,len(times)))})
dt=datatree.DataTree.from_dict({'second':b})

dt.map_blocks(func, template=dt)
DataTree('None', parent=None)
└── DataTree('second')
        Dimensions:  (time: 9)
        Dimensions without coordinates: time
        Data variables:
            x        (time) int64 dask.array<chunksize=(9,), meta=np.ndarray>

Versions

xarray: 2022.6.0
datatree: 0.0.9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-dask topic-DataTree Related to the implementation of a DataTree class
Projects
None yet
Development

No branches or pull requests

2 participants