Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should multi-model statistics handle daily data on different calendars? #1210

Open
zklaus opened this issue Jul 2, 2021 · 12 comments
Open

Comments

@zklaus
Copy link

zklaus commented Jul 2, 2021

More comprehensive metadata handling in the multi-model statistics recently turned up a conceptual issue for multi-model statistics on daily data with different calendars.

The issue is that in this case there are days, for which only a subset of models may provide data. This happens usually for datasets that contain leap days (gregorian or standard calendar, all_leap) vs those that don't (noleap), or with more unusual calendars (360_day, 30 days in every month); see CF conventions 1.7, Sect. 4.4.1 for the full list of supported calendars).

The old version of the multi-model statistics has two modes (selected with the parameter span). In overlap mode, it discards all days that don't appear in all datasets, producing a result that leaves out those days. In full mode, [missing until I have done a run in full mode].
However, the documentation states that

As the number of days in a year may vary between calendars, (sub-)daily data with different calendars are not supported.

The new version of the multi-model statistics follows the documentation by throwing an exception, albeit a cryptic one.

Alternative strategies could be

  • Follow the existing overlap/full strategies
  • Refuse daily data
  • Introduce a calendar aligning pre-processor. It is fairly common to deal with this problem by repeating certain days to fill up the missing information, or leaving out superfluous days, for example from the all_leap calendar.

Some aspects of this have been discussed in #1201 and #1198.
This issue has previously been mentioned in #937, #744.
It is also connected to #781.

@zklaus zklaus added this to the v2.3.1 milestone Jul 2, 2021
@zklaus
Copy link
Author

zklaus commented Jul 2, 2021

My own suggestion would be to implement the first alternative here, i.e. an overlap mode and a full mode as closely following the current implementation behavior as sensible for now (the bugfix release).

But I am also interested to have a discussion on the long-term best scientific strategy. If that differs from the bugfix strategy, we may tackle that after the 2.3.0 release of the tool.

In any case, the documentation should be updated if we support daily data.

@Peter9192
Copy link
Contributor

We had an implementation at some point where we did something much closer to the original behaviour:

def _subset(cube, time_points):
"""Subset cube to given time range."""
begin = cube.coord('time').units.num2date(time_points[0])
end = cube.coord('time').units.num2date(time_points[-1])
constraint = iris.Constraint(time=lambda cell: begin <= cell.point <= end)
return cube.extract(constraint)
def _extend(cube, time_points):
"""Extend cube to a specified time range."""
time_points = cube.coord('time').units.num2date(time_points)
sample_points = [('time', time_points)]
scheme = iris.analysis.Nearest(extrapolation_mode='mask')
return cube.interpolate(sample_points, scheme)
def _align(cubes, span):
"""Expand or subset cubes so they share a common time span."""
_unify_time_coordinates(cubes)
if _time_coords_are_aligned(cubes):
return cubes
all_time_arrays = [cube.coord('time').points for cube in cubes]
if span == 'overlap':
common_time_points = reduce(np.intersect1d, all_time_arrays)
new_cubes = [_subset(cube, common_time_points) for cube in cubes]
elif span == 'full':
all_time_points = reduce(np.union1d, all_time_arrays)
new_cubes = [_extend(cube, all_time_points) for cube in cubes]
else:
raise ValueError(f"Invalid argument for span: {span!r}"
"Must be one of 'overlap', 'full'.")

I thought the implementation was quite elegant from a readability point of view, but we weren't happy that we had to use an interpolation method for iris to extend a cube (it doesn't actually interpolate, but just masks missing values). We abandoned it because it didn't play well with our lazy aspirations. However, this might be something we could use as a "bugfix" for now.

@zklaus
Copy link
Author

zklaus commented Jul 2, 2021

And the problem with the lazy aspirations is the lack of a da.intersect1d and da.union1d?

@Peter9192
Copy link
Contributor

Don't really remember.. realizing only the values of the time arrays shouldn't be a problem I suppose. I think it had to do with the interpolation.

@zklaus
Copy link
Author

zklaus commented Jul 2, 2021

Ok, sounds good to me. Could you go ahead with this approach?

@Peter9192
Copy link
Contributor

I can give it a try, but I'm still not quite sure if we have agreed what the desired behaviour is. Effectively what this will do is:

  • span="overlap": discard those days that occur only in certain calendars but not in others
  • span="full": keep all days, and for each day compute the statistics over those datasets for which that specific day is available. So if you have 10 datasets of which 8 no-leap calendars and 2 standard calendars, then the multimodel will contain leap days, but the statistics for these days are based only on those 2 datasets for which they were available.

@zklaus
Copy link
Author

zklaus commented Jul 5, 2021

I think the goal for now is to emulate the behavior of the old code. For overlap that is what you describe, for full I think so too, but was not able to confirm yet.

@stefsmeets
Copy link
Contributor

And the problem with the lazy aspirations is the lack of a da.intersect1d and da.union1d?

Yes, we worked around this, because it is not lazy.

@zklaus
Copy link
Author

zklaus commented Jul 5, 2021

Ok, most importantly, since this is really urgent, let's do as discussed above with the previous code, even though it is not lazy.

Long-term, it is understandable that there are no general, lazy dask intersect1d and union1d routines since for both one needs to take all of the data into account, these are in some sense fancy sorting routines, and dask doesn't do sorting. However, I think we can exploit the pre-sorted nature of the time axis to build a custom lazy thing around that.

@Peter9192
Copy link
Contributor

Peter9192 commented Jul 5, 2021

Ok, sounds good to me. Could you go ahead with this approach?

see #1212

* `span="full"`: keep all days, and for each day compute the statistics over those datasets for which that specific day is available. So if you have 10 datasets of which 8 no-leap calendars and 2 standard calendars, then the multimodel will contain leap days, but the statistics for these days are based only on those 2 datasets for which they were available.

I was a bit too quick there. Actually, for a leap day, #1212 uses nearest-neighbour lookup to fill the missing data. Masking only happens outside the original date range. ATM I don't see an easy way to mask the missing days in the interior (okay perhaps through xarray...).

@zklaus
Copy link
Author

zklaus commented Jul 7, 2021

From discussions in #1212 it seems a bit more work is needed and probably also #744. Furthermore, the only recipe that had been using the multi-model statistics on daily data doesn't do so anymore. Since our documentation says that (sub-)daily data isn't supported and a warning to that effect is issued as well, we will bump this to 2.4.0.

@zklaus zklaus modified the milestones: v2.3.1, v2.4.0 Jul 7, 2021
@zklaus zklaus modified the milestones: v2.4.0, v2.5.0 Oct 8, 2021
@schlunma
Copy link
Contributor

schlunma commented Feb 4, 2022

Moving this to v2.6 since there is not open PR yet.

@schlunma schlunma modified the milestones: v2.5.0, v2.6.0 Feb 4, 2022
@sloosvel sloosvel removed this from the v2.6.0 milestone Jun 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants