Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocessor chain is not being applied to cell measures #436

Closed
ledm opened this issue Jan 17, 2020 · 38 comments · Fixed by #999
Closed

Preprocessor chain is not being applied to cell measures #436

ledm opened this issue Jan 17, 2020 · 38 comments · Fixed by #999
Assignees
Labels
preprocessor Related to the preprocessor

Comments

@ledm
Copy link
Contributor

ledm commented Jan 17, 2020

Basically, if your preprocessor changes the shape of the diagnostic data cube (for instance a spatial or temporal slice, such as the annual_statistics or extract_region preprocesors), before using the area_statistics, zonal_meanandvolume_statistics` preprocessors, then the resulting cube will no longer have the same shape as the FX data.

This means that we can only use the mean operation in area_statistics, zonal_meanandvolume_statistics` can only be applied on th global scale. It is not currently possible to use them on a regional scale.

Many of the preprocessors need to be applied to both the diagnostic dataset, but also the FX data, in the cases when the mean operator is used in area_statistics, zonal_meanandvolume_statistics`.

This issue is related to the recent fixes merged by @mattiarighi, @schlunma and @valeriupredoi. The PR's #429 and #433 have closed some FX related issues recently, but I don't want this issue to get lost with the other closed issues and PR.

@ledm
Copy link
Contributor Author

ledm commented Jan 17, 2020

I think that this issue had been raised elsewhere, but never had a dedicate issue number.

@ledm
Copy link
Contributor Author

ledm commented Jan 20, 2020

@schlunma @mattiarighi @valeriupredoi @zklaus, do any of you have any thoughts on how to implement this fix?

I don't mind trying to do it myself, but it would be good to have a discussion on how to do it first. Cheers!

@ledm
Copy link
Contributor Author

ledm commented Jan 20, 2020

This fix needs:

  • Apply the same preprocessor functions in the same order to the fx data.

However, there are lots of caveats, because nothing is ever easy:

  • only apply when the fx files are used by a preprocessor function (typically in a *_statistics function). Remember that not all statistics operator need the fx files.
  • Only apply when a preprocessor before the statistics preprocessor changes the shape of the diagnostic cube.
  • Does not apply any of the preocessor functions that would occur after the _statistics preprocessor that uses the FX file.
  • Also applies any preprocessor to the FX file that would make the FX and the diagnostic data incompatible (even if the operation maintains the same shape). For instance, regrid_time.
  • Keep in mind that some FX files are time-dependent and some are time-independent. This means that should only apply time preprocessing functions to fx data with a time component.
  • Presumably some preprocessors should never need to be applied to the FX (mask_fillvalues, convert_units, perhaps?)

Feel free to edit the comment for clarity of if we encounter additional requirements/caveats.

@valeriupredoi
Copy link
Contributor

cool cheers @ledm - will have a stab at this later on today, first fixing tests 🔢

@mattiarighi
Copy link
Contributor

Thank you @ledm for listing the caveats.
One question: can you give me an example of time-dependent fx? I only have experience with the masks and the area files.

@ledm
Copy link
Contributor Author

ledm commented Jan 20, 2020

A time dependent FX file is fairly common in the Ofx volcello. The ocean grid varies with time (for instance, due to tides, storm air pressure changes or sea level rise) and so the cell volume needs to change.

The CMIP6 models: GFDL-CM4, UKESM1 and HadGEM3 all have time varying volcellos. Here's a recipe that shows this problem:

preprocessors:
    prep:
        custom_order: true
        extract_volume:
            z_min: 0
            z_max: 100
        annual_statistics:
            operator: mean
        volume_statistics:
            operator: mean
            fx_files: [volcello, ]

diagnostics:
  diag:
      variables:
        thetao700_Omon:
          short_name: thetao
          preprocessor: prep
          additional_datasets:
            - {dataset: GFDL-CM4,        project: CMIP6, mip: Omon, exp: historical, ensemble: r1i1p1f1, grid: gn, start_year: 1960, end_year: 2014}
            - {dataset: HadGEM3-GC31-LL, project: CMIP6, mip: Omon, exp: historical, ensemble: r1i1p1f3, grid: gn, start_year: 1960, end_year: 2014}
            - {dataset: UKESM1-0-LL,     project: CMIP6, mip: Omon, exp: historical, ensemble: r2i1p1f2, grid: gn, start_year: 1960, end_year: 2014}

In this case, we'd need to apply both the extract_volume and the annual_statistics operators to the volcello FX data, before volume_statistics can correctly use the volcello data to calculate the volume-weighted mean temperature.

@zklaus
Copy link

zklaus commented Jan 20, 2020

@ledm is right about the changing nature of volcello. Other affected variables are thkcello and masscello. However, note that if they are time-dependent they live in Omon (and Odec), not Ofx.

@valeriupredoi
Copy link
Contributor

started the work in #439 - nearly there but not ready yet, I'll let you guys know when done, most prob in a couple hours 🍺

@ledm
Copy link
Contributor Author

ledm commented Jan 22, 2020

Just noticed that zonal_statistics and meridional_statistics don't even have the fx_files argument at the moment (but they probably should!) I'm making an issue about it now.

@valeriupredoi
Copy link
Contributor

Another thing I need to fix in #439 is to retrieve and use only the fx files with the same mip as the master variable since we dont want to use the eg mip=Ofx ones, just realized that

@zklaus
Copy link

zklaus commented Jan 22, 2020

What do you mean by that? Whether we must use eg (Omon, volcello) or (Ofx, volcello) only depends on the model, after all.

@ledm
Copy link
Contributor Author

ledm commented Jan 22, 2020

I think he means this: #440.

@valeriupredoi
Copy link
Contributor

Yes but Ofx-volcello is not time-dependent and that can not he used for area or volume statistics where time ops are needed. @ledm are time-ops always needed for those preprocessors?

@valeriupredoi
Copy link
Contributor

I think he means this: #440.

No I mean 439 where fx vars are preprocessed - if they need to be time-processed then Ofx won't cut it

@ledm
Copy link
Contributor Author

ledm commented Jan 22, 2020

volcello can be in either Ofx or Omon/Oyr, depending on the model. If it's has a time component it will be in Omon, if it is time-independent it will be in Ofx.

@valeriupredoi
Copy link
Contributor

volcello can be in either Ofx or Omon/Oyr, depending on the model. If it's has a time component it will be in Omon, if it is time-independent it will be in Ofx.

yeah thx Sherlock 😁 But the question is: the code finds one of them first (if there is Ofx then it'll find Ofx and return it, if there is Omon/yr/dec it'll find that and return it) but if there's a time preprocessor needed to be applied on the fx data then if there's Ofx available then that preprocessor will fail miserably -> that's what I need to fix in #439

@zklaus
Copy link

zklaus commented Jan 22, 2020

This really all depends on the individual preprocessor, doesn't it?

Some examples:

  • extract_region makes sense on "fx" files, no problem
  • extract_time makes sense on time-dependent "fx" files, should pass time-independent "fx" vars unchanged
  • temporal statistics almost certainly makes no sense in either case, better to raise an exception
  • regridding (regular or irregular) also almost certainly makes no sense, better raise exception
  • spatial statistics also make no sense

Of course, all of this is for the case where "fx" files are passed as additional information for other variables. When they are actually treated as variables in their own right normal processing should be applied.

@mattiarighi mattiarighi added the preprocessor Related to the preprocessor label Jan 22, 2020
@bouweandela
Copy link
Member

bouweandela commented Feb 14, 2020

I experimented a bit with iris.coord.CellMeasure and it seems to work as expected (though from the iris repo it looks like they are in the process of completely redesigning it):

    cube = iris.load_cube("/home/bandela/esmvaltool_input/CMIP5/ta_Amon_CanESM2_historical_r1i1p1_185001-200512.nc")                                         
    print(cube.shape)
    area = iris.load_cube("/home/bandela/esmvaltool_input/CMIP5/areacella_fx_CanESM2_historical_r0i0p0.nc")
    print(area.shape)                                                       
    measure = iris.coords.CellMeasure(area.core_data(), standard_name=area.standard_name, units=area.units, measure='area')                          
    cube.add_cell_measure(measure, (2, 3))
    cube = extract_time(cube, 2000, 1, 1, 2001, 1, 1)
    cube = extract_region(cube, 0, 10, 0, 20)
    print(cube.shape)
    print(cube.cell_measure('cell_area').shape)
                                                                                                                     
    cube = iris.load_cube("/home/bandela/esmvaltool_input/CMIP6/thetao_Omon_GFDL-CM4_historical_r1i1p1f1_gn_201001-201412.nc")                                           
    print(cube.shape)
    volume = iris.load_cube("/home/bandela/esmvaltool_input/CMIP6/volcello_Omon_GFDL-CM4_historical_r1i1p1f1_gn_201001-201412.nc")                                                    
    print(volume.shape)
    coord = iris.coords.CellMeasure(volume.core_data(), standard_name=volume.standard_name, units=volume.units, measure='volume')   
    cube.add_cell_measure(coord, (0, 1, 2, 3))
    cube = extract_time(cube, 2010, 1, 1, 2011, 1, 1)
    cube = extract_region(cube, 0, 10, 0, 20)    
    print(cube.shape)
    print(cube.cell_measure('ocean_volume').shape)

prints

/home/bandela/conda/envs/esmvaltool/lib/python3.7/site-packages/iris/fileformats/cf.py:803: UserWarning: Missing CF-netCDF measure variable 'areacella', referenced by netCDF variable 'ta'
  warnings.warn(message % (variable_name, nc_var_name))
(1872, 22, 64, 128)
(64, 128)
(12, 22, 7, 4)
(7, 4)
/home/bandela/conda/envs/esmvaltool/lib/python3.7/site-packages/iris/fileformats/cf.py:803: UserWarning: Missing CF-netCDF measure variable 'areacello', referenced by netCDF variable 'thetao'
  warnings.warn(message % (variable_name, nc_var_name))
/home/bandela/conda/envs/esmvaltool/lib/python3.7/site-packages/iris/fileformats/cf.py:803: UserWarning: Missing CF-netCDF measure variable 'volcello', referenced by netCDF variable 'thetao'
  warnings.warn(message % (variable_name, nc_var_name))
(60, 35, 1080, 1440)
/home/bandela/conda/envs/esmvaltool/lib/python3.7/site-packages/iris/fileformats/cf.py:803: UserWarning: Missing CF-netCDF measure variable 'areacello', referenced by netCDF variable 'volcello'
  warnings.warn(message % (variable_name, nc_var_name))
(60, 35, 1080, 1440)
(12, 35, 1080, 1440)
(12, 35, 1080, 1440)

If iris cell measures work as expected, I would strongly prefer to add them to the cubes at load time (if needed for the preprocessor steps defined in the recipe) and take them through the preprocessing chain in that way.

@valeriupredoi
Copy link
Contributor

indeed! But let's get #439 first in since that has been tested in anger and production by @ledm while we are still experimenting with iris new features. Please don't delay that because I'd rather we have a working donkey than a theoretical race horse 🐴

@valeriupredoi
Copy link
Contributor

also note that the fx variables are not exactly the same as cell_measure since as seen in #439 a variety of time-varying or FIXED area/volume variables are needed by the scientist - probably what the iris guys have in mind for the overhaul?

@zklaus
Copy link

zklaus commented Feb 17, 2020

also note that the fx variables are not exactly the same as cell_measure

Indeed, this issue really is only about cell_measures. Somebody wanna fix the title?

Regarding iris support, @bouweandela is right that a lot is working already. In some ways, the situation is even better, because CMIP6 data often already contains the correct cell_measures attribute. The problem is that generally the cell measure variable is not stored in the same file but referred to with the global :external_variables attribute. Ideally, iris would be able to sort that out if you pass in a list and/or glob of files to load_cube containing the relevant cell measure among them.

@bouweandela bouweandela changed the title Preprocessor chain is not being applied to fx files Preprocessor chain is not being applied to cell measures Feb 20, 2020
@bouweandela
Copy link
Member

bouweandela commented Feb 20, 2020

Ideally, iris would be able to sort that out if you pass in a list and/or glob of files to load_cube containing the relevant cell measure among them.

Indeed, but until it does, I think we can just modify our load/fix_metadata functions so the cell measures get attached as shown in the example above #436 (comment).

But let's get #439 first in since that has been tested in anger and production by @ledm while we are still experimenting with iris new features. Please don't delay that because I'd rather we have a working donkey than a theoretical race horse horse

That pull request makes a lot of changes to the way tasks are constructed from the recipe and in general is very large. I'm afraid we will introduce all kinds of bugs if we merge it like this and I think it's also not needed. We could just pass the filenames of the fx files to the load preprocessor function and adjust it so things work as expected. Once iris fully supports loading cell measures from external files, we can remove our custom loading code.

@valeriupredoi
Copy link
Contributor

@bouweandela - I'll let @ledm explain to you in what many ways we need that PR - mate, it's really not as simple as load the fx files and change them, we need to preprocess them as per the user's needs. Do note that I have removed the changes on tasks apart the one that actually cleans up identical ancestors within a single task, that is needed imho. The PR is long but that's not a reason to cold refuse it 🍺

@valeriupredoi
Copy link
Contributor

also that PR contains a lot of very useful changes @schlunma and myself have done to the fx vars handling in general, not just for the purpose of preprocessing them for the area_statistics or volume_statistics. Note that @LisaBock and @ledm and a bunch of other IPCC authors (inc. myself) need this functionality yesterday

@valeriupredoi
Copy link
Contributor

we can review it together, over a beer, not a worry about doing it alone 🍺

@valeriupredoi
Copy link
Contributor

on top of it all - @ledm has already produced preliminary IPCC results so delaying the integration will affect the scientists -> and am done lobbying 😁

@zklaus
Copy link

zklaus commented Feb 21, 2020

If it's so important you will be eager to prepare the changes to facilitate easier, ie quicker review.
Some things you can do:

  • Rebase on the current master instead of merging, so it is clear in the history what change actually is from this pull request.
  • Have meaningful commits, ie every commit is exactly one meaningful unit of change
  • Have meaningful commit messages, ie a one-liner summary that describes the commit and, if necessary, an extended commit message that gives crucial details

This might require some history rewriting.

And most important of all: Have one PR do one thing. As you say yourself

that PR contains a lot of [...] changes [...] to the fx vars handling in general, not just for [...] preprocessing [...] for the area_statistics or volume_statistics.

In other words, it contains a lot of things that have nothing to do with the ostensible intent of the PR.

This leaves you in the position where the reviewer has to wonder at every hunk of the large diff what might be the intent of this change. Thanks to the convoluted history the unclear commit messages are also of little help. So here we have a PR with 77 commits, making substantial changes to 8 core files of the project with the only information about the intent being that it contains, but is very much not limited to, changes for area and volume statistics.

For me that certainly is reason enough to request breaking up the PR and replacing the current monster with a few more manageable and clearer ones.

@valeriupredoi
Copy link
Contributor

K-man @zklaus you are correct albeit if we were at a pub right now I'd refuse to buy you a pint 😁 🍺 But, here's the deal - how about I don't perform a rebase since I am not a big fan of rebasing (I love history!) but rather break #439 into two or three smaller PR's - how's that sound?

@zklaus
Copy link

zklaus commented Feb 21, 2020

That sounds great! And let's discuss the rebase over a beer. I'll buy. 😄 🍻

@valeriupredoi
Copy link
Contributor

OK guys I have now broken down PR #439 into four bits (and @zklaus was quick and and on the ball to approve the first bit so that one is now in master). So here they are, in the order in which we need to examine, change and eventually merge them:

I will now add relevant info into each PR description linking it to this comment 🍺

@bouweandela
Copy link
Member

it's really not as simple as load the fx files and change them, we need to preprocess them as per the user's needs.

Maybe I'm missing something here, but to me it looks like it is. If you have a careful look at the example I posted earlier (#436 (comment)), you will see that once you have attached the cell measures to the cube, iris automatically takes care of applying the preprocessor functions to the cube's cell measures as well as to the cube's data.

@zklaus
Copy link

zklaus commented Feb 25, 2020

I like this simplified approach! There are a couple of things to consider, though:

As touched upon before, not all fx files are cell measures. According to CF (that released version 1.8 just a few days ago), only area and volume are supported at the moment, so masks and fractions, for example, are not considered cell measures. Maybe that is something to take up with CF here?

The other thing is that some preprocessors will deal better with this than others. Where we use directly iris functionality (as in using cube.intersection for extract_region) we will be fine, but where we do things more manually (as in weighting_landsea_fraction, or possibly regridding) this might be more involved.

Still, we would save some and the stuff we would have left to do would need to be done anyway, so I am all for it.

@valeriupredoi
Copy link
Contributor

right, so @bouweandela and myself had a chat on the phone yesterday and concluded we should approach things this way rather than the proletarian way in #439 (and subsequent PR children) but this approach is still presenting us with caveats or thinhs to test nonetheless, so in the meantime, we should keep #439 and its children alive (that works well, tested by @ledm quite heavily) so that the people producing results for likes of IPCC and stuff can use its functionality. So your review on #511 is absolutely awesome and to be used, @zklaus - but we should start thinking about plugging in this approach there 🍺

@bouweandela
Copy link
Member

As touched upon before, not all fx files are cell measures. According to CF (that released version 1.8 just a few days ago), only area and volume are supported at the moment, so masks and fractions, for example, are not considered cell measures. Maybe that is something to take up with CF here?

Maybe these can be added as ancillary variables? http://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/build/ch03s04.html
It looks like iris 3 will support those: https://github.com/SciTools/iris/blob/06e9e2b5eae6e903e486ce12162c242d25ee4e5d/lib/iris/coords.py#L684 and cell measures will be a special case of ancillary variables in iris 3: https://github.com/SciTools/iris/blob/master/lib/iris/coords.py#L777

@bouweandela
Copy link
Member

The other thing is that some preprocessors will deal better with this than others. Where we use directly iris functionality (as in using cube.intersection for extract_region) we will be fine, but where we do things more manually (as in weighting_landsea_fraction, or possibly regridding) this might be more involved.

Our design strategy is to rely as much on iris as possible. Therefore we may want to consider trying to get functionality that we need to implement outside of iris into iris. For example support 2D horizontal coordinates.

@mattiarighi mattiarighi added this to the v2.0.0 milestone Jun 5, 2020
@bouweandela bouweandela removed this from the v2.0.0 milestone Jun 8, 2020
@bouweandela
Copy link
Member

There will be much better support for ancillary variables in iris 3. Removed this from the v2.0.0 milestone, since the release of iris 3 is scheduled after the ESMValCore v2.0.0 release.

@sloosvel
Copy link
Contributor

sloosvel commented Feb 9, 2021

If iris cell measures work as expected, I would strongly prefer to add them to the cubes at load time (if needed for the preprocessor steps defined in the recipe) and take them through the preprocessing chain in that way.

I have worked a bit following this approach. Should I open a new pull request or push everything to #439 ? There are quite some changes though.

@bouweandela
Copy link
Member

Should I open a new pull request

Yes, please

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment