Generalize derivation of variables #667

schlunma · 2018-10-17T14:32:10Z

As suggested in #643, this PR simplifies variable derivation by moving them into a designated directory with a python file for each variable.

I moved all existing derived variables into the new recipe_preprocessor_derive_test.yml recipe and compared the preproc files in the old and new derivation scheme, all files are identical.

Closes #643 and closes #685.

…on2_generalize_derive

…es' into version2_generalize_derive

schlunma · 2018-10-17T14:35:40Z

Fixing tests...

schlunma · 2018-10-18T08:27:18Z

The remaining Codacy issues cannot be resolved. Is there a way to ignore some lines? # noqa is not working.

…s to fx files)

… variable derivation)

schlunma · 2018-10-18T11:11:08Z

I've added the possibility to access the variable dictionary and config-user.yml for variable derivation (I will need that later).

…on2_generalize_derive

bouweandela

I agree that it a good idea to organize the variable derivation module better, but it should not add extra complication. This PR adds 500 lines of code, almost doubling the size of the module, without adding new functionality. Can you try to reduce the amount of boilerplate/duplicated code and documentation so we get back a to a number of lines of code similar to what we had before?

Please put everything in the _derive module (i.e. make a directory _derive and put all files in there) instead of spreading derive functionality over two modules _derive and derived_variables.

Wouldn't it make more sense to organize the derived variables in the same way as the cmor tables, i.e. in a cmor_table/mip structure?

One file per derived variable seems very fine grained, some derivation functions are really are just one cube minus another one, that's just four lines of code.

For use in programs or Jupyter notebooks and documentation purposes, it would be good if there was a way of asking the _derive module which variables it can derive, instead of try some short name and hope you're lucky.

esmvaltool/_recipe.py

bouweandela · 2018-10-19T07:37:56Z

esmvaltool/preprocessor/derived_variables/_derived_variable.py

+            self.variable = {}
+        else:
+            self.variable = dict(variable)
+        if 'short_name' not in self.variable:


This seems very unlikely to ever happen, because the only way to obtain a DerivedVariable subclass object is by knowing the 'short_name'.

yes, this is case indeed (and we hit it not long ago) but then the damn netCDF file wouldn't load at all 😁

bouweandela · 2018-10-19T07:38:31Z

esmvaltool/preprocessor/derived_variables/clhmtisccp.py

+from ._derived_variable import DerivedVariable
+
+
+class clhmtisccp(DerivedVariable):  # noqa


Please use class names that conform with PEP8.

no, we need to name the class the same way as the variable - this is what is in fixes as well

I can rename the base class to DerivedVariableBase and every child class DerivedVariable; the name of variable is already the name of the file/module. Is this an option?

schlunma · 2018-10-19T09:57:22Z

This PR adds 500 lines of code, almost doubling the size of the module, without adding new functionality. Can you try to reduce the amount of boilerplate/duplicated code and documentation so we get back a to a number of lines of code similar to what we had before?

I hardly added any code (I did not change the actual derive functions), so those 500 lines more are mostly docstrings.

Wouldn't it make more sense to organize the derived variables in the same way as the cmor tables, i.e. in a cmor_table/mip structure?

One file per derived variable seems very fine grained, some derivation functions are really are just one cube minus another one, that's just four lines of code.

But which variables should be combined in one file? All of them? I thought we wanted to separate them?

valeriupredoi · 2018-10-19T10:06:17Z

bunching the variables in mip files is a good idea but it will lead to confusion when introducing new variables and lead to large scripts back again - I thought the point of this PR was to shrink scripts that one needs to scroll down forever and ever

valeriupredoi · 2018-10-19T10:07:25Z

bunching them in mip directories - now that's be a good idea 😁

mattiarighi · 2018-10-19T10:07:40Z

bunching the variables in mip files is a good idea but it will lead to confusion when introducing new variables and lead to large scripts back again - I thought the point of this PR was to shrink scripts that one needs to scroll down forever and ever

and there may be also cases of derived variables resulting from the combination of variables from different mips.

…med _derive directory

schlunma · 2018-10-19T16:43:37Z

I fixed all things except for the fx files, but that won't be a big problem. In addtion, I added the function

ESMValTool/esmvaltool/preprocessor/_derive/__init__.py

Line 87 in 09ead44

def get_all_derived_variables():

to get all derived variables. However, Codacy still complains about Similar lines in 3 files and I don't really know how to fix this...it refers to the headers of three different derived variables.

Regarding the variable files...should I order them in mip directorys? Right know, almost all derived variables are Amon, so this file would again be very large 😄

mattiarighi · 2018-10-20T10:15:49Z

Regarding the variable files...should I order them in mip directorys?

I would not do that, as I said there may be cases of "inter-mips" derived varibles.

…on2_generalize_derive

bettina-gier · 2018-10-22T10:01:00Z

To cut down on code for all the variables which are only differences - would it be possible to write general derive_sum / derive_difference functions which only take the variables as input, and then uses the CMOR tables to supply the rest of the information?
The structure could then be a bit more similar to the original approach, using a dictionary in one file to list all the variables with derivations and their derive functions - or even put this into a yml file similar to config_developer for ease of reading, and then only requires separate files for the derive functions if they are not covered by basic sums/differences/multiplications. Those are usually the ones that take more space.

schlunma · 2018-10-25T12:29:35Z

Are further changes needed for this PR to get merged? I also added a derived variable nbp_grid as an example on how to use fx files for variable derivation.

The ~800 lines I added in this PR include ~350 lines of new features (get all derived variables, include fx_files, complexer example recipe, ...) and about 450 lines of docstrings, so it's not really possible to shrink the number of lines in this PR.

bouweandela · 2018-10-26T11:49:45Z

Are further changes needed for this PR to get merged?

Yes, I think you need to reflect on the design a bit to make it more readable. The PR contains a lot of duplicated documentation and code.

Docstrings are meant to be read by people, but if you copy and paste the same docstring over and over again, who is going to read it? I think the derive module only needs two functions with (numpy style) docstrings, one for the derive function and one for the get_required function and the rest can be removed because no-one will ever read that (unless it contains a good description of how/why the variable is derived of course). We may consider adding a way to describe how a variable is derived, but I'm not sure we want to go into that level of detail. The docstrings should then be presented on readthedocs, like this entry.
The get_required method of all derived variables is the same for almost all variables, it just uses different data. You could also write this method just once and make it return values from a dict that is different for each variable, e.g. on the DerivedVariableBase you could implement it as:

def get_required(self, frequency):
    return tuple((
        var['short_name'],
        var['field_prefix'] + frequency,
        var.get('fx_files'),
    ) for var in self._input)

and set the class property _input accordingly on the derived classes.

But which variables should be combined in one file?

Variables that use (almost) the same code for derivation, this avoids code duplication which leads to easier maintainability, testing and fewer bugs. For example:

You could put all variables that use a simple +, -, *, / operation on two input cubes in one file. Then write one function, that extracts the right cubes and applies the simple math operation as suggested by @bettina-gier, though I think that making a yaml configuration file may be a bit over the top.
It looks like all variables ending with isccp use more or less the same code, so put them in the same file and write one function with a few parameters that does the derivation.
Anything that doesn't look like any of the other derivations gets it's own file.

esmvaltool/preprocessor/_derive/__init__.py

bouweandela · 2018-10-26T11:57:00Z

esmvaltool/preprocessor/_derive/__init__.py

+"""Automatically derive variables."""
+
+
+import importlib


Maybe you can just use the variable short name (with the first letter in uppercase of course) as the class name and import everything here with a normal import statement? That would be much easier to read than all this importlib usage.

I still use importlib to dynamically search the _derive directory for variables, otherwise I would need to add the import statements manually which is not really convenient in my opinion.

esmvaltool/preprocessor/_derive/clhmtisccp.py

valeriupredoi · 2018-10-26T15:42:37Z

ok guys, I have tested this PR with a recipe that involves var derivation and: it works and it produces identical results (previous derivation mechanism vs this derivation mechanism). So far so gut 😁 I do, however, have a question: do we need fx_files in variable derivations?

valeriupredoi · 2018-10-26T15:50:25Z

Are further changes needed for this PR to get merged?

Yes, I think you need to reflect on the design a bit to make it more readable. The PR contains a lot of duplicated documentation and code.

* Docstrings are meant to be read by people, but if you copy and paste the same docstring over and over again, who is going to read it? I think the derive module only needs two functions with (numpy style) docstrings, one for the `derive` function and one for the `get_required` function and the rest can be removed because no-one will ever read that (unless it contains a good description of how/why the variable is derived of course). We may consider adding a way to describe how a variable is derived, but I'm not sure we want to go into that level of detail. The docstrings should then be presented on readthedocs, like [this](https://esmvaltool.readthedocs.io/en/version2_development/codedoc2/esmvaltool.preprocessor.html#esmvaltool.preprocessor.derive) entry.

agreed!

* The `get_required` method of all derived variables is the same for almost all variables, it just uses different data. You could also write this method just once and make it return values from a dict that is different for each variable, e.g. on the `DerivedVariableBase` you could implement it as:

def get_required(self, frequency):
    return tuple((
        var['short_name'],
        var['field_prefix'] + frequency,
        var.get('fx_files'),
    ) for var in self._input)

and set the class property _input accordingly on the derived classes.

But which variables should be combined in one file?

Variables that use (almost) the same code for derivation, this avoids code duplication which leads to easier maintainability, testing and fewer bugs. For example:

* You could put all variables that use a simple `+, -, *, /` operation on two input cubes in one file. Then write one function, that extracts the right cubes and applies the simple math operation as suggested by @bettina-gier, though I think that making a yaml configuration file may be a bit over the top.

* It looks like all variables ending with `isccp` use more or less the same code, so put them in the same file and write one function with a few parameters that does the derivation.

* Anything that doesn't look like any of the other derivations gets it's own file.

a big NO here, herr @bouweandela - we don't want to trade clarity and variable independence for less code duplication; this thing needs to be as clear as possible so that n00bs (scientists) can put their own variable derivations. Don't trust the fact that if you are a specialist and know exactly what you doing, the other (scientific devs) will as well; plus, it's the problem of variables from completely different mips that, just because they share the same derivation mechanism, they sit in the same file - that will be confusing.

I personally like the PR and find it very useful, but as Bouwe says, maybe reduce the docstringing. Aye, brewski time now 🍺

schlunma · 2018-10-29T09:04:49Z

@bouweandela Thanks for your comments, I will address them now.

So far so gut 😁 I do, however, have a question: do we need fx_files in variable derivations?

@valeriupredoi Yes, I need sftlf (see the derived variable nbp_grid)! 😄

valeriupredoi

I iz happy with it, if @bouweandela has more comments - up to you guys to sort them out 😁

schlunma · 2018-10-29T14:19:38Z

@bouweandela

Docstrings are meant to be read by people, but if you copy and paste the same docstring over and over again, who is going to read it? I think the derive module only needs two functions with (numpy style) docstrings, one for the derive function and one for the get_required function and the rest can be removed because no-one will ever read that (unless it contains a good description of how/why the variable is derived of course). We may consider adding a way to describe how a variable is derived, but I'm not sure we want to go into that level of detail. The docstrings should then be presented on readthedocs, like this entry.

Done!

The get_required method of all derived variables is the same for almost all variables, it just uses different data. You could also write this method just once and make it return values from a dict that is different for each variable, e.g. on the DerivedVariableBase you could implement it as ... and set the class property _input accordingly on the derived classes.

Done, every derived class now contains the class member _required_variables, e.g.

ESMValTool/esmvaltool/preprocessor/_derive/lwcre.py

Lines 13 to 14 in 73c1d31

    
           _required_variables = {'vars': [('rlut', 'T2{frequency}s'), 
        
                                           ('rlutcs', 'T2{frequency}s')]}

But which variables should be combined in one file?

variables that use (almost) the same code for derivation, this avoids code duplication which leads to easier maintainability, testing and fewer bugs.

Like @valeriupredoi I also think that for the sake of clarity we should put every derived variable in its own file. In my opinion it's not really useful to write a separate simple math function which is called for many variables if the variables are not connected in any way (in addition, I don't think this would reduce the amount of code since it's always one line...). However, for more complex functions shared among different (similar) derived variables I created the file _shared.py: For example, now all the *isccp variable code is located in there:

ESMValTool/esmvaltool/preprocessor/_derive/_shared.py

Lines 1 to 21 in 73c1d31

    
           """Auxiliary derivation functions used by multiple variables.""" 
        
           import iris 
        
           def cloud_area_fraction(cubes, tau_constraint, plev_constraint): 
        
               """Calculate cloud area fraction for different parameters.""" 
        
               clisccp_cube = cubes.extract_strict( 
        
                   iris.Constraint(name='isccp_cloud_area_fraction')) 
        
               new_cube = clisccp_cube 
        
               new_cube = new_cube.extract(tau_constraint & plev_constraint) 
        
               coord_names = [coord.standard_name for coord in new_cube.coords() 
        
                              if len(coord.points) > 1] 
        
               if 'atmosphere_optical_thickness_due_to_cloud' in coord_names: 
        
                   new_cube = new_cube.collapsed( 
        
                       'atmosphere_optical_thickness_due_to_cloud', iris.analysis.SUM) 
        
               if 'air_pressure' in coord_names: 
        
                   new_cube = new_cube.collapsed('air_pressure', iris.analysis.SUM) 
        
               return new_cube

fx files can now be simply accessed via the cubes parameter, see e.g here:

ESMValTool/esmvaltool/preprocessor/_derive/nbp_grid.py

Lines 32 to 33 in 73c1d31

    
           sftlf_cube = cubes.extract_strict( 
        
               Constraint(name='land_area_fraction'))

I hope everything is fine now, I removed ~500 lines of code compared to the last commit and also tested all derived variables again and got no differences to the old version.

schlunma · 2018-10-29T15:07:04Z

The remaining Codacy issue is a known issue of pylint.

bouweandela

Thank you for cleaning up, it looks much more readable now.

schlunma added 9 commits October 16, 2018 16:33

Wrote new base class for derived variables

1a32861

Merge remote-tracking branch 'public/version2_development' into versi…

1ddac0b

…on2_generalize_derive

Added derivation of toz

729565c

Added lwcre derivation

4ac098a

Added remaining variables

ad30a0d

Merge remote-tracking branch 'public/version2_development_getInstitut…

1d32c14

…es' into version2_generalize_derive

Fixed bug and adapted recipe

83543d8

Fixed bug in derivation of toz and added general derive test recipe

ea2b72a

Made DerivedVariable file private

e5503f6

schlunma added the enhancement label Oct 17, 2018

schlunma self-assigned this Oct 17, 2018

schlunma requested review from mattiarighi and valeriupredoi October 17, 2018 14:32

mattiarighi requested a review from bouweandela October 17, 2018 14:50

Fixed CI and Codacy issues

b0da3f3

Added variable dictionary to DerivedVariable class (e.g. to get acces…

a54c5b1

…s to fx files)

mattiarighi approved these changes Oct 18, 2018

View reviewed changes

Added function to access config-user.yml (e.g. to access fx files for…

41c1552

… variable derivation)

Merge remote-tracking branch 'public/version2_development' into versi…

b84a9fb

…on2_generalize_derive

bouweandela requested changes Oct 19, 2018

View reviewed changes

schlunma added 3 commits October 19, 2018 15:00

Simplified call signature of get_required, fixed PEP8 issues and rena…

cc36781

…med _derive directory

Fixed p_level_widths test

e5c323b

Updated docstrings

5a899e9

Moved functions get_all_derived_variables to _derive module

09ead44

Fixed PEP8 issue

81f3014

schlunma added 2 commits October 20, 2018 15:18

Added possibility to include fx files in variable derivation

d85bb5b

Merge remote-tracking branch 'public/version2_development' into versi…

10d0c5e

…on2_generalize_derive

bettina-gier mentioned this pull request Oct 22, 2018

Move variables derivation functions to individual files #643

Closed

mattiarighi mentioned this pull request Oct 23, 2018

Monthly sums of precipitation #639

Closed

schlunma added 2 commits October 25, 2018 12:31

Fixed documentation

2cf77da

Fixed docstring

4938c88

bouweandela reviewed Oct 26, 2018

View reviewed changes

esmvaltool/preprocessor/_derive/__init__.py Outdated Show resolved Hide resolved

bouweandela reviewed Oct 26, 2018

View reviewed changes

esmvaltool/preprocessor/_derive/clhmtisccp.py Outdated Show resolved Hide resolved

valeriupredoi approved these changes Oct 29, 2018

View reviewed changes

Simplified whole derivation process and cleaned code

73c1d31

schlunma added 2 commits October 29, 2018 15:47

Fixed codacy issues

fe0215f

Fixed CI issues

e08d5b0

schlunma mentioned this pull request Oct 30, 2018

Variable derivation for multiple domains #685

Closed

bouweandela approved these changes Oct 30, 2018

View reviewed changes

mattiarighi merged commit 43598ee into version2_development Oct 30, 2018

mattiarighi deleted the version2_generalize_derive branch October 30, 2018 16:48

schlunma mentioned this pull request Nov 27, 2018

Partly fixes #725 #728

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalize derivation of variables #667

Generalize derivation of variables #667

schlunma commented Oct 17, 2018 •

edited

Loading

schlunma commented Oct 17, 2018

schlunma commented Oct 18, 2018

schlunma commented Oct 18, 2018

bouweandela left a comment

bouweandela Oct 19, 2018

valeriupredoi Oct 19, 2018

bouweandela Oct 19, 2018

valeriupredoi Oct 19, 2018

schlunma Oct 19, 2018

schlunma commented Oct 19, 2018

valeriupredoi commented Oct 19, 2018

valeriupredoi commented Oct 19, 2018

mattiarighi commented Oct 19, 2018

schlunma commented Oct 19, 2018

mattiarighi commented Oct 20, 2018

bettina-gier commented Oct 22, 2018

schlunma commented Oct 25, 2018

bouweandela commented Oct 26, 2018

bouweandela Oct 26, 2018

schlunma Oct 29, 2018

valeriupredoi commented Oct 26, 2018

valeriupredoi commented Oct 26, 2018

schlunma commented Oct 29, 2018

valeriupredoi left a comment

schlunma commented Oct 29, 2018 •

edited

Loading

schlunma commented Oct 29, 2018

bouweandela left a comment

		from ._derived_variable import DerivedVariable


		class clhmtisccp(DerivedVariable): # noqa

Generalize derivation of variables #667

Generalize derivation of variables #667

Conversation

schlunma commented Oct 17, 2018 • edited Loading

schlunma commented Oct 17, 2018

schlunma commented Oct 18, 2018

schlunma commented Oct 18, 2018

bouweandela left a comment

Choose a reason for hiding this comment

bouweandela Oct 19, 2018

Choose a reason for hiding this comment

valeriupredoi Oct 19, 2018

Choose a reason for hiding this comment

bouweandela Oct 19, 2018

Choose a reason for hiding this comment

valeriupredoi Oct 19, 2018

Choose a reason for hiding this comment

schlunma Oct 19, 2018

Choose a reason for hiding this comment

schlunma commented Oct 19, 2018

valeriupredoi commented Oct 19, 2018

valeriupredoi commented Oct 19, 2018

mattiarighi commented Oct 19, 2018

schlunma commented Oct 19, 2018

mattiarighi commented Oct 20, 2018

bettina-gier commented Oct 22, 2018

schlunma commented Oct 25, 2018

bouweandela commented Oct 26, 2018

bouweandela Oct 26, 2018

Choose a reason for hiding this comment

schlunma Oct 29, 2018

Choose a reason for hiding this comment

valeriupredoi commented Oct 26, 2018

valeriupredoi commented Oct 26, 2018

schlunma commented Oct 29, 2018

valeriupredoi left a comment

Choose a reason for hiding this comment

schlunma commented Oct 29, 2018 • edited Loading

schlunma commented Oct 29, 2018

bouweandela left a comment

Choose a reason for hiding this comment

schlunma commented Oct 17, 2018 •

edited

Loading

schlunma commented Oct 29, 2018 •

edited

Loading