Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fx_data not preserved as 'cell_measures' after iris aggregated_by and 'extract_levels' processor #1189

Closed
tomaslovato opened this issue Jun 18, 2021 · 20 comments · Fixed by #1269
Assignees
Labels
bug Something isn't working iris Related to the Iris package preprocessor Related to the preprocessor
Milestone

Comments

@tomaslovato
Copy link
Contributor

tomaslovato commented Jun 18, 2021

Running a recipe with time aggregation (any using cube.aggregated_by()) followed by area statistics crashes as the latter doesn't find the 'cell_measures' variable in the cube. Note that this extends also to cube.collapsed() operation.

I've been already discussing this with @schlunma in #1096 and after digging a bit of testing I realized that the example we looked in there (#1096 (comment)) was not compliant with the cell_measures assigned by the code to cubes, as time coord was missing.

As cell_measures depend also on the time coordinate, by extending @schlunma example with time also in the area variable, the problem finally come up (see details)

>>> time_coord = iris.coords.DimCoord([0, 1], var_name='time', units='day')
>>> lat_coord = iris.coords.DimCoord([0.0], var_name='lat', units='rad')
>>> year_coord = iris.coords.AuxCoord([1900, 1900], var_name='year')
>>> x = iris.cube.Cube([[1.0], [2.0]], var_name='x', dim_coords_and_dims=[(time_coord, 0), (lat_coord, 1)])
>>> x.add_aux_coord(year_coord, 0)
>>> area =  iris.coords.CellMeasure([[50.], [100.0]], var_name='areacella')
>>> x.add_cell_measure(area, [0,1])
>>> print(x)
x / (unknown)                       (time: 2; lat: 1)
    Dimension coordinates:
         time                           x       -
         lat                            -       x
    Auxiliary coordinates:
         year                           x       -
    Cell measures:
         areacella                      x       x
>>> new_x = x.aggregated_by('year', iris.analysis.MEAN)
>>> print(new_x)
x / (unknown)                       (time: 1; lat: 1)
    Dimension coordinates:
         time                           x       -
         lat                            -       x
    Auxiliary coordinates:
         year                           x       -
    Cell methods:
         mean: year
>>> col_x = x.collapsed('time', iris.analysis.MEAN)
>>> print(col_x)
x / (unknown)                       (lat: 1)
    Dimension coordinates:
         lat                           x
    Scalar coordinates:
         time: 0 day, bound=(0, 1) day
         year: 1900, bound=(1900, 1900)
    Cell methods:
         mean: time
>>>

This relates to the choice made in iris to discard cell_measures when these are time dependent as I guess it will be very tricky to handle it in the correct way (or maybe it is simply a bug!).

Second point, but for a different reason, cell_measures are lost also in the extract_levels processor, where a new cube is generated by scratch and this property is not propagated (a practical example is I want to compute global average of a specific layer from a 3D variable, e.g. seawater oxygen at 500m).
In this case the 2D cell_measures is the area and should be associated to the cube, while in the case of a 3D cell_measures it should be coherently extracted along with the variable data.

main_log_debug.txt

@tomaslovato tomaslovato added bug Something isn't working iris Related to the Iris package preprocessor Related to the preprocessor labels Jun 18, 2021
@zklaus
Copy link

zklaus commented Jun 18, 2021

@sloosvel, could you please have a look at this?

@schlunma
Copy link
Contributor

This relates to the choice made in iris to discard cell_measures when these are time dependent as I guess it will be very tricky to handle it in the correct way (or maybe it is simply a bug!).

As I commented here, I'm fairly sure that this is not a bug of iris. Performing statistical operations over a coordinate is not straightforward. As shown in the comment, one way to tackle this (for true fx variables) is to not broadcast the the fx variables before adding them to the cube and only contain the dimensions that are actually covered by the fx variables here:

cube.add_cell_measure(measure, range(0, measure.ndim))

However, the would involve quite some changes to other preprocessors, since they assume identical shapes for the data and the fx variables right now. It is also not a solution for time-dependent "fx" variables.

@tomaslovato
Copy link
Contributor Author

tomaslovato commented Jun 18, 2021

@schlunma I agree with you about the choice made in iris to neglect such a sneaky problem of handling time dependent metrics (sorry but I was a bit slow in getting the whole thing).

(I amended the remaining part of this comment as it was redundant and not improving the discussion)

@tomaslovato
Copy link
Contributor Author

tomaslovato commented Jun 21, 2021

This issue comes with a number of technical constraints, so maybe a recap from the whole discussion might be useful to see if there is a viable way through.

At the moment time preprocessor do not preserve grid metrics and spatial aggregation preprocessors cannot be executed afterwards. I'm wondering if this applies to both CMIP5 and CMIP6? (but I guess it will be probably just the latter)

There are two main constraints, clearly stated by @schlunma (here and in #1096 (comment)):

  • changes made on fx variables handling of both time-dependent and time-invariant is likely hard to be reversed as a number of preprocessors "assume identical shapes for the data and the fx variables right now".
  • iris is intentionally not providing in aggregated_by and collapsed functions an outcome for cell_measures when these also accounts for time coord as "performing statistical operations over a coordinate is not straightforward"

Is it ok to keep this broken sequence in preprocessors and add a note somewhere in the documentation?

Besides, I made a search within the current main and the aggregated_by function is used in _time.py & _cycles.py preprocessors.

Asiris will probably not support a "native" solution to this issue, could it be an alternative to create an "in-house" solution to handle grid metrics aggregation, like e.g. creating a wrapper function dedicated to perform this operation?
here a previous comment from @sloosvel with some ideas on this ...

I think that for most of the time related aggregation providing the MEAN value of coordinates should be ok (also because other statistical operators makes a little sense to be applied to grid metrics).

@zklaus zklaus added this to the v2.4.0 milestone Jun 21, 2021
@sloosvel
Copy link
Contributor

I don't have much of an opinion in the time_statistics issue but if computing the mean for the fx vars would work for you, I think it's doable.

And in the case of extract_levels, I think that the only function that needs to be expanded to take into account the presence of cell measures is _vertical_interpolation. If the level extraction is done performing cube.extract, the cell measures get preserved. That's why some models lose them but others keep them.

@tomaslovato
Copy link
Contributor Author

tomaslovato commented Jun 28, 2021

@sloosvel @schlunma I'm not sure if this relates to some other on-going development branch, so before creating a new branch/PR, I put here below a first rough,working version of this 'in house' function to handle the issue.
(I didn't found a better way to get back the fx_cube from cell_measures but someone more experienced than me could improve this aspect!)

diff --git a/esmvalcore/preprocessor/_time.py b/esmvalcore/preprocessor/_time.py
index 3cf1d5f4d..2c6d3380b 100644
--- a/esmvalcore/preprocessor/_time.py
+++ b/esmvalcore/preprocessor/_time.py
@@ -558,11 +558,11 @@ def climate_statistics(cube,

     clim_coord = _get_period_coord(cube, period, seasons)
     operator = get_iris_analysis_operation(operator)
-    clim_cube = cube.aggregated_by(clim_coord, operator)
-    clim_cube.remove_coord('time')
-    if clim_cube.coord(clim_coord.name()).is_monotonic():
-        iris.util.promote_aux_coord_to_dim_coord(clim_cube, clim_coord.name())
-    else:
+    clim_cube = aggregated_by(cube, clim_coord, operator)
+    #clim_cube.remove_coord('time')
+    if not clim_cube.coord(clim_coord.name()).is_monotonic():
+    #    iris.util.promote_aux_coord_to_dim_coord(clim_cube, clim_coord.name())
+    #else:
         clim_cube = iris.cube.CubeList(clim_cube.slices_over(
             clim_coord.name())).merge_cube()
     cube.remove_coord(clim_coord)
@@ -970,3 +970,50 @@ def resample_time(cube, month=None, day=None, hour=None):
         return True

     return cube.extract(iris.Constraint(time=compare))
+
+
+def aggregated_by(cube, coords, operator):
+    """Compute iris aggregation over time preserving cell_measures (#1189).
+
+    Parameters
+    ----------
+    cube: iris.cube.Cube
+        input cube.
+
+    coords: list of coord names
+        Coordinate(s) over which group aggregation is to be performed.
+
+    operator: str
+        Select operator to apply.
+
+    Returns
+    -------
+    iris.cube.Cube
+        Cube aggregated upon operator
+
+    """
+    from ._ancillary_vars import add_cell_measure
+    if cube.cell_measures():
+        # cell_measure into temporary cube
+        measure = cube.cell_measure().measure
+        fx_cube = cube.copy()
+        fx_cube.data = cube.cell_measure().data
+        fx_cube.var_name = cube.cell_measure().var_name
+        fx_cube.standard_name = cube.cell_measure().standard_name
+        fx_cube.units = cube.cell_measure().units
+        # compute aggregation
+        cube = cube.aggregated_by(coords, operator)
+        fx_cube = fx_cube.aggregated_by(coords, iris.analysis.MEAN)
+        # add back cell_measure
+        measure = iris.coords.CellMeasure(
+            fx_cube.data,
+            standard_name=fx_cube.standard_name,
+            units=fx_cube.units,
+            measure=measure,
+            var_name=fx_cube.var_name,
+            attributes=fx_cube.attributes)
+        cube.add_cell_measure(measure, range(0, measure.ndim))
+    else:
+        cube = cube.aggregated_by(coords, operator)
+
+    return cube

@sloosvel
Copy link
Contributor

You could acces the fx cube by looping over the cube.cell_measures() so there would not be a need to copy again the variable (a bit like it's being done in the remove_fx_variables step). And then import add_cell_measure from the ancillary_variables to add the new fx cube. That way you could spare having to define again the measure.

@schlunma
Copy link
Contributor

Nice @tomaslovato! Some comments that might be relevant for the actual implementation:

  • @ESMValGroup/esmvaltool-coreteam any objections on calculating the mean over fx variables for cube aggregation? If the fx variables are not dependent on the dimension that is collapsed, this is perfectly valid; if not, then I'm still not 100% sure that this is (in general) scientifically valid.
  • Apart from all cell_measures, this should also be applied to all anciliary_variables.
  • We should make sure to use this modified iris.cube.Cube.aggregated_by in every preprocessor function for consistency. For this it might be useful to add it to this module.
  • We should also make sure to provide a modified iris.cube.Cube.collapsed that preserves cell_measures and anciliary_variables (of course only if the native function does not do that, that needs to be tested).

@tomaslovato
Copy link
Contributor Author

Thanks @sloosvel for the suggestions !!

I tried to import the add_cell_measure at the top of _time.py module but I got swamped into a circular dependency error

File "/users_home/oda/tl28319/GIT/ESMValCore/esmvalcore/preprocessor/_time.py", line 23, in <module>
    from ._ancillary_vars import add_cell_measure
ImportError: cannot import name 'add_cell_measure' from partially initialized module 'esmvalcore.preprocessor._ancillary_vars' (most likely due to a circular import)

so, for testing purposes, I imported add_cell_measure within the function and it works perfectly, but I'm not sure this way would comply with ESMValCore general coding structure.

I had a look at remove_fx_variables as you suggested, but looping over cube.cell_measures() return an item of iris.coords.CellMeasure and it doesn't allow for .aggregated_by as for a standard cube.

I actually realized that my solution is not enough general, as I made a copy of the original cube and then overwrite the data from cell_measure, but this won't work if the cube is 3D and the measure is 2D (e.g. area).
It is definitely better to create a cube from scratch ... or even implement a function in ancilliary_variables to do it (e.g. an appealing name could be get_cell_measure_cube)

@tomaslovato
Copy link
Contributor Author

  • @ESMValGroup/esmvaltool-coreteam any objections on calculating the mean over fx variables for cube aggregation? If the fx variables are not dependent on the dimension that is collapsed, this is perfectly valid; if not, then I'm still not 100% sure that this is (in general) scientifically valid.
  • Apart from all cell_measures, this should also be applied to all anciliary_variables.

I think the mean is a reasonable for cell_measures such as grid area or volume, but compiling a list of both cell_measures and ancillary variables handled by ESMValCore may help to better tailor their treatment

  • We should make sure to use this modified iris.cube.Cube.aggregated_by in every preprocessor function for consistency. For this it might be useful to add it to this module.

@schlunma I was more oriented toward the creation of a python function contained in _time.py (just to make clear it applies within this preprocessor) ... I'm not sure it might be particularly safe at this stage to start by creating a general, shared iris.cube function

  • We should also make sure to provide a modified iris.cube.Cube.collapsed that preserves cell_measures and anciliary_variables (of course only if the native function does not do that, that needs to be tested).

I guess this should go through a separate issue/development to not pile up too many things in here ...

@schlunma
Copy link
Contributor

I think the mean is a reasonable for cell_measures such as grid area or volume, but compiling a list of both cell_measures and ancillary variables handled by ESMValCore may help to better tailor their treatment

Sounds good!

@schlunma I was more oriented toward the creation of a python function contained in _time.py (just to make clear it applies within this preprocessor) ... I'm not sure it might be particularly safe at this stage to start by creating a general, shared iris.cube function

Yes, but I think it's fairly easy to write a general function with the signature

def aggregated_by(cube, coords, operator, cell_measures_operator=None, ancilliary_variables_operator=None, **kwargs)

that applies the operator cell_measures_operator to all cell measures, the operator ancilliary_variables_operator to all ancilliary variables and finally performs cube.aggregated_by(coords, operator, **kwargs). In _time.py you can use this function with cell_measures_operator=iris.analysis.MEAN for example.

I agree that it might not make sense to apply this change to all other preprocessors yet without further considerations.

I guess this should go through a separate issue/development to not pile up too many things in here ...

Agreed!

@tomaslovato
Copy link
Contributor Author

@schlunma @sloosvel
I spend a bit of time trying to map which variables are handled as cell_measures and ancilliary_variables to better frame the application of appropriate statistics.

areacella, areacello, volcello are the only fields accepted as cell_measures see here

Looking among available recipes I found the following that would be accounted as ancilliary_variables:
orog: Surface Altitude
sftlf: land area fraction
sftgif: land ice area fraction

.. are you aware of other variables that are used as ancilliary_variables?

@sloosvel
Copy link
Contributor

sloosvel commented Jul 6, 2021

Used in recipes, there is also sftof. But I guess that any other variable in CMIP6_fx could also be added as an ancillary one in a cube.

@schlunma
Copy link
Contributor

schlunma commented Jul 6, 2021

I thought about this a little bit more and I think that the statistical operation that is necessary for the fx files also depends on the dimension that it's applied to. For example, cell areas like areacella and areacello should be averaged over time, but summed if the statistical operation is performed over the latitude and longitude. Other variables like the land area fraction sftlf should be averaged in both cases.

I think for the time axis (which we consider here) we could simply use iris.analysis.MEAN regardless of the variable. I briefly looked at the CMIP6 tables and couldn't find any obvious exceptions.

@tomaslovato
Copy link
Contributor Author

tomaslovato commented Jul 7, 2021

I think for the time axis (which we consider here) we could simply use iris.analysis.MEAN regardless of the variable. I briefly looked at the CMIP6 tables and couldn't find any obvious exceptions.

Did you refer to the CMIP6_fx table only or also to some specific ones that contains variables which can potentially become ancillary?

I agree with you that time average should be the right operation, at least for CMIP6_fx as these are defined as time-invariant variables (time coord is added by _ancillary_vars.py)

@schlunma
Copy link
Contributor

schlunma commented Jul 7, 2021

I was mainly referring to CMIP6_fx, but it could also make sense for time-dependent variables like volcello from Omon.

@bouweandela
Copy link
Member

bouweandela commented Jul 9, 2021

We should make sure to use this modified iris.cube.Cube.aggregated_by in every preprocessor function for consistency. For this it might be useful to add it to this module. - comment by @schlunma

I think it would be good to take this up with the iris developers, as our general strategy is to make generic code available through iris instead of implementing it in ESMValCore. Are there any existing issues or discussions in the iris repository?

As shown in the comment, one way to tackle this (for true fx variables) is to not broadcast the fx variables before adding them to the cube - comment by @schlunma

And in the case of extract_levels, I think that the only function that needs to be expanded to take into account the presence of cell measures is _vertical_interpolation. - comment by @sloosvel

It may be better not to broadcast on load (even though I agree it is convenient), because I expect this is going to cause really slow runs as well as problems and slowness while trying to compute temporal statistics over this kind of time independent ancillary variables. An example is extract_levels where vertical interpolation is needed. If you need to vertically interpolate a land/sea mask or height independent cell areas, this is going to be a waste of time.

@schlunma
Copy link
Contributor

schlunma commented Jul 9, 2021

It may be better not to broadcast on load (even though I agree it is convenient)

For "true" fx variables (not time-dependent) I think this is by far the safest way to go. Calculating statistics over these variables is non-trivial and dependent on the type of variable and dimension you're aggregating, so I don't think that is something that will be in iris.

@valeriupredoi
Copy link
Contributor

I agree with @bouweandela and @schlunma - most of the cases I heard from @ledm meant fx vars need to be used as masksand then all sorts of temporal statistics are being calculated, after they've been used as masks - btw Lee is a good person to have contribute here, lots of experience with ocean cell measures; also, not to do stuff that could be done in iris is also a very good point, they stick to CMOR and data standards and if they don't have a certain calculation inside iris there's a reason for that - but do let them know about a potential enhancement, they might have just missed it. About your import error @tomaslovato - it's currently impossible to import any function from _ancilliary_vars partial module into _time because that will run into a circular import via _io that is used by _ancilliary_vars that in turn uses an import from _time - not the most robust import structure but we can deal with this later on, for now you can use functions from _ancilliary_vars into _time if you do a full module import like import esmvalcore.preprocessor._ancillary_vars as av (add this in _time, and then you can just call eg av.add_fx_variables() etc)

@tomaslovato
Copy link
Contributor Author

tomaslovato commented Jul 13, 2021

Thanks for the tips @valeriupredoi !

Summarizing latest comments from @bouweandela and @schlunma I kind of realize that
there is a consensus in not broadcasting "true" fx fields (namely from CMIP6_fx table) on data load, right?

If so, this part of the code (and some other) has to be modified to remove broadcasting:

try:
fx_data = da.broadcast_to(fx_cube.core_data(), cube.shape)
except ValueError:
logger.debug("Dimensions of %s and %s cubes do not match. "
"Cannot broadcast cubes.",
cube.var_name, fx_cube.var_name)
return
measure = iris.coords.CellMeasure(
fx_data,
standard_name=fx_cube.standard_name,
units=fx_cube.units,
measure=measure,
var_name=fx_cube.var_name,
attributes=fx_cube.attributes)

and then cell_measures will be correctly handled by iris and propagated in cubes.

However, the present issue will be still valid for those model having time-varying metric fields, but according to the good practices this issue should be reported to iris community and discussed in there, before providing in-house solutions, right @bouweandela?

If it may be of any help for further testing, I add a list of models providing e.g.volcello for either Omon or Oyr tables:
ACCESS-CM2, ACCESS-ESM1-5, E3SM-1-0, E3SM-1-1, E3SM-1-1-ECA, EC-Earth3, EC-Earth3-CC, GFDL-CM4, GFDL-ESM4, GFDL-OM4p5B, HadGEM3-GC31-LL, HadGEM3-GC31-MM, NorESM2-LM, NorESM2-MM, UKESM1-0-LL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working iris Related to the Iris package preprocessor Related to the preprocessor
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants