-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Refactor]: Consider using flox
and xr.resample()
to improve temporal averaging grouping logic
#217
Comments
tomvothecoder
changed the title
[FEATURE]: Improve temporal averaging grouping without the use of pandas MultiIndex
[FEATURE]: Improve temporal averaging grouping logic
Apr 6, 2022
tomvothecoder
changed the title
[FEATURE]: Improve temporal averaging grouping logic
[Refactor]: Improve temporal averaging grouping logic
Nov 9, 2022
tomvothecoder
changed the title
[Refactor]: Improve temporal averaging grouping logic
[Refactor]: Consider using Apr 14, 2023
flox
to improve temporal averaging grouping logic
I saw the ping at pydata/xarray#6610. Let me know if you run in to issues or have questions |
Thanks @dcherian! I'm looking forward to trying out |
tomvothecoder
changed the title
[Refactor]: Consider using
[Refactor]: Consider using Sep 5, 2024
flox
to improve temporal averaging grouping logicflox
and xr.resample()
to improve temporal averaging grouping logic
tomvothecoder
modified the milestones:
FY24Q4 (07/01/24 - 09/30/24),
FY24 Items for Dev Day
Sep 12, 2024
tomvothecoder
modified the milestones:
FY24 Items for Dev Day,
FY25Q1 (10/01/24 - 12/31/24)
Sep 25, 2024
14 tasks
Example: import xarray as xr
import numpy as np
import pandas as pd
# Create time coordinates
time = pd.date_range("2000-01-01", "2003-12-31", freq="D")
# Create lat and lon coordinates
lat = [10, 20]
lon = [30, 40]
# Create dummy air temperature data
data = np.random.rand(len(time), len(lat), len(lon))
# Create the Dataset
ds = xr.Dataset(
{"air_temperature": (["time", "lat", "lon"], data)},
coords={"time": time, "lat": lat, "lon": lon},
)
print(ds)
ds_gb = ds.groupby(["time.year", "time.month"]).mean() |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Is your feature request related to a problem?
Currently, Xarray's GroupBy operations are limited to single variables. Grouping by multiple coordinates (e.g.,time.year
andtime.season
) requires creating a new set of coordinates before grouping due to the xarray limitations described below (source)xarray >= 2024.09.0
now supports grouping by multiple variables: https://xarray.dev/blog/multiple-groupers and https://docs.xarray.dev/en/stable/user-guide/groupby.html#grouping-by-multiple-variables.Related code in
xcdat
for temporal grouping:xcdat/xcdat/temporal.py
Lines 1266 to 1322 in c9bcbcd
Current temporal averaging logic (workaround for multi-variable grouping):
xarray.DataArray
to apandas.DataFrame
,a. Keep only the DataFrame columns needed for grouping (e.g., "year" and "season" for seasonal group averages), essentially "labeling" coordinates with their groups
b. Process the DataFrame including:
Mapping of months to custom seasons for custom seasonal groupingNow done with Xarray/NumPy via Add support for custom seasons spanning calendar years #423Correction of "DJF" seasons by shifting Decembers over to the next yearNow done with Xarray/NumPy via Add support for custom seasons spanning calendar years #423cftime
coordinates (season strings aren't supported incftime
/datetime
objects)cftime
objects to represent new time coordinatesDescribe the solution you'd like
It is would be simpler and possibly more performant to leverage Xarray's newly added support for grouping by multiple variables (e.g.,
.groupby(["time.year", "time.season"])
) instead of using Pandas to store and manipulate Datetime components. This solution will reduce a lot of the internal complexities involved with the temporal averaging API.Describe alternatives you've considered
Multi-variable grouping was originally done using
pd.MultiIndex
but we shifted away from this approach because this object cannot be written out tonetcdf4
. Alsopd.MultiIndex
is not the standard object type for representing time coordinates in xarray. The standard object types arenp.datetime64
andcftime
.Additional context
Future solution through
xarray
+flox
:xarray
version in Update GroupBy constructor for grouping by multiple variables, dask arrays pydata/xarray#6610, we should be able to do this.flox
inGroupBy
andresample
pydata/xarray#5734 is now merged which improves.groupby()
performance significantly.The text was updated successfully, but these errors were encountered: