Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize derivation of variables #667

Merged
merged 25 commits into from
Oct 30, 2018

Conversation

schlunma
Copy link
Contributor

@schlunma schlunma commented Oct 17, 2018

As suggested in #643, this PR simplifies variable derivation by moving them into a designated directory with a python file for each variable.

I moved all existing derived variables into the new recipe_preprocessor_derive_test.yml recipe and compared the preproc files in the old and new derivation scheme, all files are identical.

Closes #643 and closes #685.

@schlunma
Copy link
Contributor Author

Fixing tests...

@schlunma
Copy link
Contributor Author

The remaining Codacy issues cannot be resolved. Is there a way to ignore some lines? # noqa is not working.

@schlunma
Copy link
Contributor Author

I've added the possibility to access the variable dictionary and config-user.yml for variable derivation (I will need that later).

Copy link
Member

@bouweandela bouweandela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it a good idea to organize the variable derivation module better, but it should not add extra complication. This PR adds 500 lines of code, almost doubling the size of the module, without adding new functionality. Can you try to reduce the amount of boilerplate/duplicated code and documentation so we get back a to a number of lines of code similar to what we had before?

Please put everything in the _derive module (i.e. make a directory _derive and put all files in there) instead of spreading derive functionality over two modules _derive and derived_variables.

Wouldn't it make more sense to organize the derived variables in the same way as the cmor tables, i.e. in a cmor_table/mip structure?

One file per derived variable seems very fine grained, some derivation functions are really are just one cube minus another one, that's just four lines of code.

For use in programs or Jupyter notebooks and documentation purposes, it would be good if there was a way of asking the _derive module which variables it can derive, instead of try some short name and hope you're lucky.

esmvaltool/_recipe.py Outdated Show resolved Hide resolved
self.variable = {}
else:
self.variable = dict(variable)
if 'short_name' not in self.variable:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems very unlikely to ever happen, because the only way to obtain a DerivedVariable subclass object is by knowing the 'short_name'.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is case indeed (and we hit it not long ago) but then the damn netCDF file wouldn't load at all 😁

from ._derived_variable import DerivedVariable


class clhmtisccp(DerivedVariable): # noqa
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use class names that conform with PEP8.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, we need to name the class the same way as the variable - this is what is in fixes as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can rename the base class to DerivedVariableBase and every child class DerivedVariable; the name of variable is already the name of the file/module. Is this an option?

@schlunma
Copy link
Contributor Author

This PR adds 500 lines of code, almost doubling the size of the module, without adding new functionality. Can you try to reduce the amount of boilerplate/duplicated code and documentation so we get back a to a number of lines of code similar to what we had before?

I hardly added any code (I did not change the actual derive functions), so those 500 lines more are mostly docstrings.

Wouldn't it make more sense to organize the derived variables in the same way as the cmor tables, i.e. in a cmor_table/mip structure?

One file per derived variable seems very fine grained, some derivation functions are really are just one cube minus another one, that's just four lines of code.

But which variables should be combined in one file? All of them? I thought we wanted to separate them?

@valeriupredoi
Copy link
Contributor

bunching the variables in mip files is a good idea but it will lead to confusion when introducing new variables and lead to large scripts back again - I thought the point of this PR was to shrink scripts that one needs to scroll down forever and ever

@valeriupredoi
Copy link
Contributor

bunching them in mip directories - now that's be a good idea 😁

@mattiarighi
Copy link
Contributor

bunching the variables in mip files is a good idea but it will lead to confusion when introducing new variables and lead to large scripts back again - I thought the point of this PR was to shrink scripts that one needs to scroll down forever and ever

and there may be also cases of derived variables resulting from the combination of variables from different mips.

@schlunma
Copy link
Contributor Author

I fixed all things except for the fx files, but that won't be a big problem. In addtion, I added the function

def get_all_derived_variables():

to get all derived variables. However, Codacy still complains about Similar lines in 3 files and I don't really know how to fix this...it refers to the headers of three different derived variables.

Regarding the variable files...should I order them in mip directorys? Right know, almost all derived variables are Amon, so this file would again be very large 😄

@mattiarighi
Copy link
Contributor

Regarding the variable files...should I order them in mip directorys?

I would not do that, as I said there may be cases of "inter-mips" derived varibles.

@bettina-gier
Copy link
Contributor

To cut down on code for all the variables which are only differences - would it be possible to write general derive_sum / derive_difference functions which only take the variables as input, and then uses the CMOR tables to supply the rest of the information?
The structure could then be a bit more similar to the original approach, using a dictionary in one file to list all the variables with derivations and their derive functions - or even put this into a yml file similar to config_developer for ease of reading, and then only requires separate files for the derive functions if they are not covered by basic sums/differences/multiplications. Those are usually the ones that take more space.

@schlunma
Copy link
Contributor Author

Are further changes needed for this PR to get merged? I also added a derived variable nbp_grid as an example on how to use fx files for variable derivation.

The ~800 lines I added in this PR include ~350 lines of new features (get all derived variables, include fx_files, complexer example recipe, ...) and about 450 lines of docstrings, so it's not really possible to shrink the number of lines in this PR.

@bouweandela
Copy link
Member

Are further changes needed for this PR to get merged?

Yes, I think you need to reflect on the design a bit to make it more readable. The PR contains a lot of duplicated documentation and code.

  • Docstrings are meant to be read by people, but if you copy and paste the same docstring over and over again, who is going to read it? I think the derive module only needs two functions with (numpy style) docstrings, one for the derive function and one for the get_required function and the rest can be removed because no-one will ever read that (unless it contains a good description of how/why the variable is derived of course). We may consider adding a way to describe how a variable is derived, but I'm not sure we want to go into that level of detail. The docstrings should then be presented on readthedocs, like this entry.

  • The get_required method of all derived variables is the same for almost all variables, it just uses different data. You could also write this method just once and make it return values from a dict that is different for each variable, e.g. on the DerivedVariableBase you could implement it as:

def get_required(self, frequency):
    return tuple((
        var['short_name'],
        var['field_prefix'] + frequency,
        var.get('fx_files'),
    ) for var in self._input)

and set the class property _input accordingly on the derived classes.

But which variables should be combined in one file?

Variables that use (almost) the same code for derivation, this avoids code duplication which leads to easier maintainability, testing and fewer bugs. For example:

  • You could put all variables that use a simple +, -, *, / operation on two input cubes in one file. Then write one function, that extracts the right cubes and applies the simple math operation as suggested by @bettina-gier, though I think that making a yaml configuration file may be a bit over the top.
  • It looks like all variables ending with isccp use more or less the same code, so put them in the same file and write one function with a few parameters that does the derivation.
  • Anything that doesn't look like any of the other derivations gets it's own file.

"""Automatically derive variables."""


import importlib
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can just use the variable short name (with the first letter in uppercase of course) as the class name and import everything here with a normal import statement? That would be much easier to read than all this importlib usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still use importlib to dynamically search the _derive directory for variables, otherwise I would need to add the import statements manually which is not really convenient in my opinion.

@valeriupredoi
Copy link
Contributor

ok guys, I have tested this PR with a recipe that involves var derivation and: it works and it produces identical results (previous derivation mechanism vs this derivation mechanism). So far so gut 😁 I do, however, have a question: do we need fx_files in variable derivations?

@valeriupredoi
Copy link
Contributor

Are further changes needed for this PR to get merged?

Yes, I think you need to reflect on the design a bit to make it more readable. The PR contains a lot of duplicated documentation and code.

* Docstrings are meant to be read by people, but if you copy and paste the same docstring over and over again, who is going to read it? I think the derive module only needs two functions with (numpy style) docstrings, one for the `derive` function and one for the `get_required` function and the rest can be removed because no-one will ever read that (unless it contains a good description of how/why the variable is derived of course). We may consider adding a way to describe how a variable is derived, but I'm not sure we want to go into that level of detail. The docstrings should then be presented on readthedocs, like [this](https://esmvaltool.readthedocs.io/en/version2_development/codedoc2/esmvaltool.preprocessor.html#esmvaltool.preprocessor.derive) entry.

agreed!

* The `get_required` method of all derived variables is the same for almost all variables, it just uses different data. You could also write this method just once and make it return values from a dict that is different for each variable, e.g. on the `DerivedVariableBase` you could implement it as:
def get_required(self, frequency):
    return tuple((
        var['short_name'],
        var['field_prefix'] + frequency,
        var.get('fx_files'),
    ) for var in self._input)

and set the class property _input accordingly on the derived classes.

But which variables should be combined in one file?

Variables that use (almost) the same code for derivation, this avoids code duplication which leads to easier maintainability, testing and fewer bugs. For example:

* You could put all variables that use a simple `+, -, *, /` operation on two input cubes in one file. Then write one function, that extracts the right cubes and applies the simple math operation as suggested by @bettina-gier, though I think that making a yaml configuration file may be a bit over the top.

* It looks like all variables ending with `isccp` use more or less the same code, so put them in the same file and write one function with a few parameters that does the derivation.

* Anything that doesn't look like any of the other derivations gets it's own file.

a big NO here, herr @bouweandela - we don't want to trade clarity and variable independence for less code duplication; this thing needs to be as clear as possible so that n00bs (scientists) can put their own variable derivations. Don't trust the fact that if you are a specialist and know exactly what you doing, the other (scientific devs) will as well; plus, it's the problem of variables from completely different mips that, just because they share the same derivation mechanism, they sit in the same file - that will be confusing.

I personally like the PR and find it very useful, but as Bouwe says, maybe reduce the docstringing. Aye, brewski time now 🍺

@schlunma
Copy link
Contributor Author

@bouweandela Thanks for your comments, I will address them now.

So far so gut 😁 I do, however, have a question: do we need fx_files in variable derivations?

@valeriupredoi Yes, I need sftlf (see the derived variable nbp_grid)! 😄

Copy link
Contributor

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I iz happy with it, if @bouweandela has more comments - up to you guys to sort them out 😁

@schlunma
Copy link
Contributor Author

schlunma commented Oct 29, 2018

@bouweandela

Docstrings are meant to be read by people, but if you copy and paste the same docstring over and over again, who is going to read it? I think the derive module only needs two functions with (numpy style) docstrings, one for the derive function and one for the get_required function and the rest can be removed because no-one will ever read that (unless it contains a good description of how/why the variable is derived of course). We may consider adding a way to describe how a variable is derived, but I'm not sure we want to go into that level of detail. The docstrings should then be presented on readthedocs, like this entry.

Done!

The get_required method of all derived variables is the same for almost all variables, it just uses different data. You could also write this method just once and make it return values from a dict that is different for each variable, e.g. on the DerivedVariableBase you could implement it as ... and set the class property _input accordingly on the derived classes.

Done, every derived class now contains the class member _required_variables, e.g.

_required_variables = {'vars': [('rlut', 'T2{frequency}s'),
('rlutcs', 'T2{frequency}s')]}

But which variables should be combined in one file?

variables that use (almost) the same code for derivation, this avoids code duplication which leads to easier maintainability, testing and fewer bugs.

Like @valeriupredoi I also think that for the sake of clarity we should put every derived variable in its own file. In my opinion it's not really useful to write a separate simple math function which is called for many variables if the variables are not connected in any way (in addition, I don't think this would reduce the amount of code since it's always one line...). However, for more complex functions shared among different (similar) derived variables I created the file _shared.py: For example, now all the *isccp variable code is located in there:

"""Auxiliary derivation functions used by multiple variables."""
import iris
def cloud_area_fraction(cubes, tau_constraint, plev_constraint):
"""Calculate cloud area fraction for different parameters."""
clisccp_cube = cubes.extract_strict(
iris.Constraint(name='isccp_cloud_area_fraction'))
new_cube = clisccp_cube
new_cube = new_cube.extract(tau_constraint & plev_constraint)
coord_names = [coord.standard_name for coord in new_cube.coords()
if len(coord.points) > 1]
if 'atmosphere_optical_thickness_due_to_cloud' in coord_names:
new_cube = new_cube.collapsed(
'atmosphere_optical_thickness_due_to_cloud', iris.analysis.SUM)
if 'air_pressure' in coord_names:
new_cube = new_cube.collapsed('air_pressure', iris.analysis.SUM)
return new_cube

fx files can now be simply accessed via the cubes parameter, see e.g here:

sftlf_cube = cubes.extract_strict(
Constraint(name='land_area_fraction'))

I hope everything is fine now, I removed ~500 lines of code compared to the last commit and also tested all derived variables again and got no differences to the old version.

@schlunma
Copy link
Contributor Author

The remaining Codacy issue is a known issue of pylint.

Copy link
Member

@bouweandela bouweandela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for cleaning up, it looks much more readable now.

@mattiarighi mattiarighi merged commit 43598ee into version2_development Oct 30, 2018
@mattiarighi mattiarighi deleted the version2_generalize_derive branch October 30, 2018 16:48
@schlunma schlunma mentioned this pull request Nov 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Variable derivation for multiple domains Move variables derivation functions to individual files
5 participants