Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Refactor]: Consider using flox and xr.resample() to improve temporal averaging grouping logic #217

Open
tomvothecoder opened this issue Apr 6, 2022 · 3 comments
Assignees
Labels
type: enhancement New enhancement request

Comments

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Apr 6, 2022

Is your feature request related to a problem?

Currently, Xarray's GroupBy operations are limited to single variables. Grouping by multiple coordinates (e.g., time.year and time.season) requires creating a new set of coordinates before grouping due to the xarray limitations described below (source)

xarray >= 2024.09.0 now supports grouping by multiple variables: https://xarray.dev/blog/multiple-groupers and https://docs.xarray.dev/en/stable/user-guide/groupby.html#grouping-by-multiple-variables.

Related code in xcdat for temporal grouping:

xcdat/xcdat/temporal.py

Lines 1266 to 1322 in c9bcbcd

def _label_time_coords(self, time_coords: xr.DataArray) -> xr.DataArray:
"""Labels time coordinates with a group for grouping.
This methods labels time coordinates for grouping by first extracting
specific xarray datetime components from time coordinates and storing
them in a pandas DataFrame. After processing (if necessary) is performed
on the DataFrame, it is converted to a numpy array of datetime
objects. This numpy serves as the data source for the final
DataArray of labeled time coordinates.
Parameters
----------
time_coords : xr.DataArray
The time coordinates.
Returns
-------
xr.DataArray
The DataArray of labeled time coordinates for grouping.
Examples
--------
Original daily time coordinates:
>>> <xarray.DataArray 'time' (time: 4)>
>>> array(['2000-01-01T12:00:00.000000000',
>>> '2000-01-31T21:00:00.000000000',
>>> '2000-03-01T21:00:00.000000000',
>>> '2000-04-01T03:00:00.000000000'],
>>> dtype='datetime64[ns]')
>>> Coordinates:
>>> * time (time) datetime64[ns] 2000-01-01T12:00:00 ... 2000-04-01T03:00:00
Daily time coordinates labeled by year and month:
>>> <xarray.DataArray 'time' (time: 3)>
>>> array(['2000-01-01T00:00:00.000000000',
>>> '2000-03-01T00:00:00.000000000',
>>> '2000-04-01T00:00:00.000000000'],
>>> dtype='datetime64[ns]')
>>> Coordinates:
>>> * time (time) datetime64[ns] 2000-01-01T00:00:00 ... 2000-04-01T00:00:00
"""
df_dt_components: pd.DataFrame = self._get_df_dt_components(time_coords)
dt_objects = self._convert_df_to_dt(df_dt_components)
time_grouped = xr.DataArray(
name="_".join(df_dt_components.columns),
data=dt_objects,
coords={self.dim: time_coords[self.dim]},
dims=[self.dim],
attrs=time_coords[self.dim].attrs,
)
time_grouped.encoding = time_coords[self.dim].encoding
return time_grouped

Current temporal averaging logic (workaround for multi-variable grouping):

  1. Preprocess time coordinates (e.g., drop leap days, subset based on reference climatology)
  2. Transform time coordinates from an xarray.DataArray to a pandas.DataFrame,
    a. Keep only the DataFrame columns needed for grouping (e.g., "year" and "season" for seasonal group averages), essentially "labeling" coordinates with their groups
    b. Process the DataFrame including:
  3. Convert DataFrame to cftime objects to represent new time coordinates
  4. Replace existing time coordinates in the DataArray with new time coordinates
  5. Group DataArray with new time coordinates for the mean

Describe the solution you'd like

It is would be simpler and possibly more performant to leverage Xarray's newly added support for grouping by multiple variables (e.g., .groupby(["time.year", "time.season"])) instead of using Pandas to store and manipulate Datetime components. This solution will reduce a lot of the internal complexities involved with the temporal averaging API.

Describe alternatives you've considered

Multi-variable grouping was originally done using pd.MultiIndex but we shifted away from this approach because this object cannot be written out to netcdf4. Also pd.MultiIndex is not the standard object type for representing time coordinates in xarray. The standard object types are np.datetime64 and cftime.

Additional context

Future solution through xarray + flox:

@tomvothecoder tomvothecoder added the type: enhancement New enhancement request label Apr 6, 2022
@tomvothecoder tomvothecoder self-assigned this Apr 6, 2022
@tomvothecoder tomvothecoder changed the title [FEATURE]: Improve temporal averaging grouping without the use of pandas MultiIndex [FEATURE]: Improve temporal averaging grouping logic Apr 6, 2022
@tomvothecoder tomvothecoder changed the title [FEATURE]: Improve temporal averaging grouping logic [Refactor]: Improve temporal averaging grouping logic Nov 9, 2022
@tomvothecoder tomvothecoder changed the title [Refactor]: Improve temporal averaging grouping logic [Refactor]: Consider using flox to improve temporal averaging grouping logic Apr 14, 2023
@dcherian
Copy link

I saw the ping at pydata/xarray#6610. Let me know if you run in to issues or have questions

@tomvothecoder
Copy link
Collaborator Author

Thanks @dcherian! I'm looking forward to trying out flox.

@tomvothecoder tomvothecoder removed this from the FY24Q2 (01/01/24 - 03/31/24) milestone Dec 19, 2023
@tomvothecoder tomvothecoder changed the title [Refactor]: Consider using flox to improve temporal averaging grouping logic [Refactor]: Consider using flox and xr.resample() to improve temporal averaging grouping logic Sep 5, 2024
@tomvothecoder
Copy link
Collaborator Author

tomvothecoder commented Oct 14, 2024

xarray >= 2024.09.0 now supports grouping by multiple variables: https://xarray.dev/blog/multiple-groupers and https://docs.xarray.dev/en/stable/user-guide/groupby.html#grouping-by-multiple-variables.

Example:

import xarray as xr
import numpy as np
import pandas as pd

# Create time coordinates
time = pd.date_range("2000-01-01", "2003-12-31", freq="D")

# Create lat and lon coordinates
lat = [10, 20]
lon = [30, 40]

# Create dummy air temperature data
data = np.random.rand(len(time), len(lat), len(lon))

# Create the Dataset
ds = xr.Dataset(
    {"air_temperature": (["time", "lat", "lon"], data)},
    coords={"time": time, "lat": lat, "lon": lon},
)

print(ds)

ds_gb = ds.groupby(["time.year", "time.month"]).mean()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement New enhancement request
Projects
Status: Todo
Development

No branches or pull requests

2 participants