How should multi-model statistics handle daily data on different calendars? #1210

zklaus · 2021-07-02T11:30:32Z

More comprehensive metadata handling in the multi-model statistics recently turned up a conceptual issue for multi-model statistics on daily data with different calendars.

The issue is that in this case there are days, for which only a subset of models may provide data. This happens usually for datasets that contain leap days (gregorian or standard calendar, all_leap) vs those that don't (noleap), or with more unusual calendars (360_day, 30 days in every month); see CF conventions 1.7, Sect. 4.4.1 for the full list of supported calendars).

The old version of the multi-model statistics has two modes (selected with the parameter span). In overlap mode, it discards all days that don't appear in all datasets, producing a result that leaves out those days. In full mode, [missing until I have done a run in full mode].
However, the documentation states that

As the number of days in a year may vary between calendars, (sub-)daily data with different calendars are not supported.

The new version of the multi-model statistics follows the documentation by throwing an exception, albeit a cryptic one.

Alternative strategies could be

Follow the existing overlap/full strategies
Refuse daily data
Introduce a calendar aligning pre-processor. It is fairly common to deal with this problem by repeating certain days to fill up the missing information, or leaving out superfluous days, for example from the all_leap calendar.

Some aspects of this have been discussed in #1201 and #1198.
This issue has previously been mentioned in #937, #744.
It is also connected to #781.

The text was updated successfully, but these errors were encountered:

zklaus · 2021-07-02T11:33:57Z

My own suggestion would be to implement the first alternative here, i.e. an overlap mode and a full mode as closely following the current implementation behavior as sensible for now (the bugfix release).

But I am also interested to have a discussion on the long-term best scientific strategy. If that differs from the bugfix strategy, we may tackle that after the 2.3.0 release of the tool.

In any case, the documentation should be updated if we support daily data.

Peter9192 · 2021-07-02T13:43:58Z

We had an implementation at some point where we did something much closer to the original behaviour:

ESMValCore/esmvalcore/preprocessor/_multimodel.py

Lines 130 to 164 in a43ecb9

    
           def _subset(cube, time_points): 
        
               """Subset cube to given time range.""" 
        
               begin = cube.coord('time').units.num2date(time_points[0]) 
        
               end = cube.coord('time').units.num2date(time_points[-1]) 
        
               constraint = iris.Constraint(time=lambda cell: begin <= cell.point <= end) 
        
               return cube.extract(constraint) 
        
           def _extend(cube, time_points): 
        
               """Extend cube to a specified time range.""" 
        
               time_points = cube.coord('time').units.num2date(time_points) 
        
               sample_points = [('time', time_points)] 
        
               scheme = iris.analysis.Nearest(extrapolation_mode='mask') 
        
               return cube.interpolate(sample_points, scheme) 
        
           def _align(cubes, span): 
        
               """Expand or subset cubes so they share a common time span.""" 
        
               _unify_time_coordinates(cubes) 
        
               if _time_coords_are_aligned(cubes): 
        
                   return cubes 
        
               all_time_arrays = [cube.coord('time').points for cube in cubes] 
        
               if span == 'overlap': 
        
                   common_time_points = reduce(np.intersect1d, all_time_arrays) 
        
                   new_cubes = [_subset(cube, common_time_points) for cube in cubes] 
        
               elif span == 'full': 
        
                   all_time_points = reduce(np.union1d, all_time_arrays) 
        
                   new_cubes = [_extend(cube, all_time_points) for cube in cubes] 
        
               else: 
        
                   raise ValueError(f"Invalid argument for span: {span!r}" 
        
                                    "Must be one of 'overlap', 'full'.")

I thought the implementation was quite elegant from a readability point of view, but we weren't happy that we had to use an interpolation method for iris to extend a cube (it doesn't actually interpolate, but just masks missing values). We abandoned it because it didn't play well with our lazy aspirations. However, this might be something we could use as a "bugfix" for now.

zklaus · 2021-07-02T13:48:45Z

And the problem with the lazy aspirations is the lack of a da.intersect1d and da.union1d?

Peter9192 · 2021-07-02T13:53:44Z

Don't really remember.. realizing only the values of the time arrays shouldn't be a problem I suppose. I think it had to do with the interpolation.

zklaus · 2021-07-02T13:54:44Z

Ok, sounds good to me. Could you go ahead with this approach?

Peter9192 · 2021-07-05T10:21:21Z

I can give it a try, but I'm still not quite sure if we have agreed what the desired behaviour is. Effectively what this will do is:

span="overlap": discard those days that occur only in certain calendars but not in others
span="full": keep all days, and for each day compute the statistics over those datasets for which that specific day is available. So if you have 10 datasets of which 8 no-leap calendars and 2 standard calendars, then the multimodel will contain leap days, but the statistics for these days are based only on those 2 datasets for which they were available.

zklaus · 2021-07-05T11:07:16Z

I think the goal for now is to emulate the behavior of the old code. For overlap that is what you describe, for full I think so too, but was not able to confirm yet.

stefsmeets · 2021-07-05T11:22:38Z

And the problem with the lazy aspirations is the lack of a da.intersect1d and da.union1d?

Yes, we worked around this, because it is not lazy.

zklaus · 2021-07-05T11:28:57Z

Ok, most importantly, since this is really urgent, let's do as discussed above with the previous code, even though it is not lazy.

Long-term, it is understandable that there are no general, lazy dask intersect1d and union1d routines since for both one needs to take all of the data into account, these are in some sense fancy sorting routines, and dask doesn't do sorting. However, I think we can exploit the pre-sorted nature of the time axis to build a custom lazy thing around that.

Peter9192 · 2021-07-05T13:40:35Z

Ok, sounds good to me. Could you go ahead with this approach?

see #1212

* `span="full"`: keep all days, and for each day compute the statistics over those datasets for which that specific day is available. So if you have 10 datasets of which 8 no-leap calendars and 2 standard calendars, then the multimodel will contain leap days, but the statistics for these days are based only on those 2 datasets for which they were available.

I was a bit too quick there. Actually, for a leap day, #1212 uses nearest-neighbour lookup to fill the missing data. Masking only happens outside the original date range. ATM I don't see an easy way to mask the missing days in the interior (okay perhaps through xarray...).

zklaus · 2021-07-07T08:57:09Z

From discussions in #1212 it seems a bit more work is needed and probably also #744. Furthermore, the only recipe that had been using the multi-model statistics on daily data doesn't do so anymore. Since our documentation says that (sub-)daily data isn't supported and a warning to that effect is issued as well, we will bump this to 2.4.0.

schlunma · 2022-02-04T11:20:13Z

Moving this to v2.6 since there is not open PR yet.

zklaus added this to the v2.3.1 milestone Jul 2, 2021

Peter9192 mentioned this issue Jul 5, 2021

Fix alignment of daily data with inconsistent calendars in multimodel statistics #1212

Merged

10 tasks

zklaus modified the milestones: v2.3.1, v2.4.0 Jul 7, 2021

zklaus modified the milestones: v2.4.0, v2.5.0 Oct 8, 2021

schlunma modified the milestones: v2.5.0, v2.6.0 Feb 4, 2022

sloosvel removed this from the v2.6.0 milestone Jun 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How should multi-model statistics handle daily data on different calendars? #1210

How should multi-model statistics handle daily data on different calendars? #1210

zklaus commented Jul 2, 2021 •

edited

Loading

zklaus commented Jul 2, 2021

Peter9192 commented Jul 2, 2021

zklaus commented Jul 2, 2021

Peter9192 commented Jul 2, 2021

zklaus commented Jul 2, 2021

Peter9192 commented Jul 5, 2021

zklaus commented Jul 5, 2021

stefsmeets commented Jul 5, 2021

zklaus commented Jul 5, 2021

Peter9192 commented Jul 5, 2021 •

edited

Loading

zklaus commented Jul 7, 2021

schlunma commented Feb 4, 2022

How should multi-model statistics handle daily data on different calendars? #1210

How should multi-model statistics handle daily data on different calendars? #1210

Comments

zklaus commented Jul 2, 2021 • edited Loading

zklaus commented Jul 2, 2021

Peter9192 commented Jul 2, 2021

zklaus commented Jul 2, 2021

Peter9192 commented Jul 2, 2021

zklaus commented Jul 2, 2021

Peter9192 commented Jul 5, 2021

zklaus commented Jul 5, 2021

stefsmeets commented Jul 5, 2021

zklaus commented Jul 5, 2021

Peter9192 commented Jul 5, 2021 • edited Loading

zklaus commented Jul 7, 2021

schlunma commented Feb 4, 2022

zklaus commented Jul 2, 2021 •

edited

Loading

Peter9192 commented Jul 5, 2021 •

edited

Loading