Pangeo Climate Anomalies example #223

TomNicholas · 2023-06-22T17:52:50Z

I tried to set up the problem of calculating the anomaly with respect to the group mean from pangeo-data/distributed-array-examples#4. I used fake data with the same chunking scheme instead of real data to make it easier to test.

I was not expecting this plan though 😬

No idea what's going on here, especially as #145 also has a groupby in it.

I set up the problem like this:

from datetime import datetime, timedelta
import numpy as np

time = np.arange(datetime(1979,1,1), datetime(2022,1,1), timedelta(hours=1)).astype('datetime64[ns]')
lat = np.linspace(-90.0, 90.0, 721)[::-1].astype(np.float32)
lon = np.linspace(0.0, 359.8, 1440).astype(np.float32)

def create_cubed_data(t_length):
  
    return xr.DataArray(
        name="asn",
        data=cubed.array_api.astype(cubed.random.random((t_length, 721, 1440), chunks=(31, -1, -1), spec=spec), np.float32),
        dims=['time', 'latitude', 'longitude'],
        coords={'time': time[:t_length], 'latitude': lat, 'longitude': lon}
    ).to_dataset()

datasets = {
    '1.5GB': create_cubed_data(372),
    '15GB':  create_cubed_data(3720),
    '150GB': create_cubed_data(37200),
    '1.5TB': create_cubed_data(372000),
}

1.5 GB dataset, 
1.5e+01 GB dataset, 
1.5e+02 GB dataset, 
1.5e+03 GB dataset

for scale, ds in datasets.items():
    print(f'{ds.nbytes / 1e9:.2} GB dataset, ')

workloads['1.5GB']['asn'].data.visualize()

The text was updated successfully, but these errors were encountered:

tomwhite · 2023-06-23T12:15:10Z

I think this is happening because Xarray's default groupby strategy indexes out each group as a new array and then applies the operation (mean in this case) to each array. Having a new array per group is obviously not a scalable approach. (BTW #145 has the same problem.)

To make this work efficiently we'd use Flox's map-reduce strategy, which is basically the same as Cubed's reduction implementation. This is tracked in xarray-contrib/flox#224.

TomNicholas · 2023-06-25T16:24:38Z

Even once I build a plausible groupby graph (see xarray-contrib/flox#224 (comment)), the subtraction of the mean presents a problem.

def climate_anomaly(ds):
    mean = ds.groupby("time.dayofyear").mean(skipna=False, method='map-reduce')  # skipna=False to avoid eager load, see xarray issue #7243
    
    anomaly = ds - mean.sel(dayofyear=ds.time.dt.dayofyear)
    return anomaly

For the 1.5GB dataset it raises:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[17], line 1
----> 1 result = climate_anomaly(datasets['1.5GB'])
      2 result

Cell In[14], line 6, in climate_anomaly(ds)
      2 mean = ds.groupby("time.dayofyear").mean(skipna=False, method='map-reduce')  # skipna=False to avoid eager load, see xarray issue #7243
      4 print(mean)
----> 6 anomaly = ds - mean.sel(dayofyear=ds.time.dt.dayofyear)
      7 return anomaly

File ~/Documents/Work/Code/xarray/xarray/core/_typed_ops.py:19, in DatasetOpsMixin.__sub__(self, other)
     18 def __sub__(self, other):
---> 19     return self._binary_op(other, operator.sub)

File ~/Documents/Work/Code/xarray/xarray/core/dataset.py:6693, in Dataset._binary_op(self, other, f, reflexive, join)
   6691     self, other = align(self, other, join=align_type, copy=False)  # type: ignore[assignment]
   6692 g = f if not reflexive else lambda x, y: f(y, x)
-> 6693 ds = self._calculate_binary_op(g, other, join=align_type)
   6694 keep_attrs = _get_keep_attrs(default=False)
   6695 if keep_attrs:

File ~/Documents/Work/Code/xarray/xarray/core/dataset.py:6757, in Dataset._calculate_binary_op(self, f, other, join, inplace)
   6754 ds = self.coords.merge(other_coords)
   6756 if isinstance(other, Dataset):
-> 6757     new_vars = apply_over_both(
   6758         self.data_vars, other.data_vars, self.variables, other.variables
   6759     )
   6760 else:
   6761     other_variable = getattr(other, "variable", other)

File ~/Documents/Work/Code/xarray/xarray/core/dataset.py:6737, in Dataset._calculate_binary_op.<locals>.apply_over_both(lhs_data_vars, rhs_data_vars, lhs_vars, rhs_vars)
   6735 for k in lhs_data_vars:
   6736     if k in rhs_data_vars:
-> 6737         dest_vars[k] = f(lhs_vars[k], rhs_vars[k])
   6738     elif join in ["left", "outer"]:
   6739         dest_vars[k] = f(lhs_vars[k], np.nan)

File ~/Documents/Work/Code/xarray/xarray/core/_typed_ops.py:431, in VariableOpsMixin.__sub__(self, other)
    430 def __sub__(self, other):
--> 431     return self._binary_op(other, operator.sub)

File ~/Documents/Work/Code/xarray/xarray/core/variable.py:2705, in Variable._binary_op(self, other, f, reflexive)
   2702 attrs = self._attrs if keep_attrs else None
   2703 with np.errstate(all="ignore"):
   2704     new_data = (
-> 2705         f(self_data, other_data) if not reflexive else f(other_data, self_data)
   2706     )
   2707 result = Variable(dims, new_data, attrs=attrs)
   2708 return result

File ~/Documents/Work/Code/cubed/cubed/array_api/array_object.py:132, in Array.__sub__(self, other)
    130 if other is NotImplemented:
    131     return other
--> 132 return elemwise(np.subtract, self, other, dtype=result_type(self, other))

File ~/Documents/Work/Code/cubed/cubed/core/ops.py:302, in elemwise(func, dtype, *args)
    300 if dtype is None:
    301     raise ValueError("dtype must be specified for elemwise")
--> 302 return blockwise(
    303     func,
    304     expr_inds,
    305     *concat((a, tuple(range(a.ndim)[::-1])) for a in args),
    306     dtype=dtype,
    307 )

File ~/Documents/Work/Code/cubed/cubed/core/ops.py:206, in blockwise(func, out_ind, dtype, adjust_chunks, new_axes, align_arrays, target_store, *args, **kwargs)
    203     raise ValueError("Unknown dimension", new)
    205 if align_arrays:
--> 206     chunkss, arrays = unify_chunks(*args)
    207 else:
    208     inds = args[1::2]

File ~/Documents/Work/Code/cubed/cubed/core/ops.py:845, in unify_chunks(*args, **kwargs)
    835 chunks = tuple(
    836     chunkss[j]
    837     if a.shape[n] > 1
   (...)
    841     for n, j in enumerate(i)
    842 )
    843 if chunks != a.chunks and all(a.chunks):
    844     # this will raise if chunks are not regular
--> 845     chunksize = to_chunksize(chunks)
    846     arrays.append(rechunk(a, chunksize))
    847 else:

File ~/Documents/Work/Code/cubed/cubed/utils.py:104, in to_chunksize(chunkset)
     91 """Convert a chunkset to a chunk size for Zarr.
     92 
     93 Parameters
   (...)
    100 Tuple of ints
    101 """
    103 if not _check_regular_chunks(chunkset):
--> 104     raise ValueError("Array must have regular chunks")
    106 return tuple(c[0] for c in chunkset)

ValueError: Array must have regular chunks

For the 15GB dataset it doesn't raise (how??) but creates this graph

For the 150GB data is raises the same error again. Should probably not read too much into this but I thought it might be interesting to see for now.

tomwhite · 2023-06-26T09:05:29Z

ValueError: Array must have regular chunks

It would be useful to see what the irregular chunking actually is so we can see how it came about - can you change the exception to print it?

TomNicholas · 2023-06-26T14:57:20Z

It would be useful to see what the irregular chunking actually is so we can see how it came about - can you change the exception to print it?

Array must have regular chunks, but found chunks=((16, 15, 1, 16, 14, 2, 16, 13, 3, 16, 12, 4, 16, 11, 5, 16, 10, 6, 16, 9, 7, 16, 8, 8, 16, 7, 9, 16, 6, 10, 16, 5, 11, 16, 4), (721,), (1440,))

Is this a chunk for every group?

dcherian · 2023-06-26T15:01:24Z

Yes blockwise needs irregular chunks in general. It rechunks so that there is an integer number of groups per block

TomNicholas · 2023-06-26T15:05:39Z

Yes blockwise needs irregular chunks in general. It rechunks so that there is an integer number of groups per block

Zarr doesn't yet support irregular yet chunks right? Seems to be tracked in ZEP003. So does that mean blockwise is a no-go for groupby in cubed for now?

dcherian · 2023-06-26T15:12:16Z

does that mean blockwise is a no-go for groupby in cubed

👍 It'll only work for the special case of equal sized groups (.groupby("time.hour") for example), when the algorithm decides to put an equal number of groups in each block.

tomwhite · 2023-06-26T15:12:22Z

Zarr doesn't yet support irregular yet chunks right? Seems to be tracked in ZEP003. So does that mean blockwise is a no-go for groupby in cubed for now?

That's right. Note that if there was an early implementation of irregular chunks then we could use that since it's an implementation detail of Cubed (it's intermediate data).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pangeo Climate Anomalies example #223

Pangeo Climate Anomalies example #223

TomNicholas commented Jun 22, 2023

tomwhite commented Jun 23, 2023

TomNicholas commented Jun 25, 2023 •

edited

Loading

tomwhite commented Jun 26, 2023

TomNicholas commented Jun 26, 2023

dcherian commented Jun 26, 2023 •

edited

Loading

TomNicholas commented Jun 26, 2023

dcherian commented Jun 26, 2023

tomwhite commented Jun 26, 2023

TomNicholas commented Jun 26, 2023

tomwhite commented Jun 7, 2024

Pangeo Climate Anomalies example #223

Pangeo Climate Anomalies example #223

Comments

TomNicholas commented Jun 22, 2023

tomwhite commented Jun 23, 2023

TomNicholas commented Jun 25, 2023 • edited Loading

tomwhite commented Jun 26, 2023

TomNicholas commented Jun 26, 2023

dcherian commented Jun 26, 2023 • edited Loading

TomNicholas commented Jun 26, 2023

dcherian commented Jun 26, 2023

tomwhite commented Jun 26, 2023

TomNicholas commented Jun 26, 2023

tomwhite commented Jun 7, 2024

TomNicholas commented Jun 25, 2023 •

edited

Loading

dcherian commented Jun 26, 2023 •

edited

Loading