Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for a variable aliasing scheme (for use in model development) #1083

Closed
senesis opened this issue Apr 26, 2021 · 10 comments
Closed

Allow for a variable aliasing scheme (for use in model development) #1083

senesis opened this issue Apr 26, 2021 · 10 comments
Assignees
Labels
enhancement New feature or request

Comments

@senesis
Copy link
Contributor

senesis commented Apr 26, 2021

Hi, @jvegasbsc , @rswamina , @mattiarighi , @bouweandela , @bsolino

I am working with IPSL, in the context of IS-ENES3, for testing the feasibility of ESMValTool use in model development. I was impressed both by the clear design of the code and the extensive documentation. Congratulations !

For the goal above, one need more flexibility in the data_finder, for replacing {short_name} by a variable name which is let to user's or configurer's choice. This is useful when e.g. the model outputs are quite consistent with some CMIP project tables set, and the departures can be addressed by the fix_metadata and fix_data features.

This case is quite different from the 'variable_alt_names' scheme described in this other issue, and which, if I understood well, is devoted to the case when the same physical variables has different names in different 'standard' projects (as e.g. 'si' in CMIP == 'siconc' in CMIP6). The difference is that we would like, here, to avoid creating a tables set that would be specific to the model, but rather use an existing tables set

Digging in the code I found that, in function _find_input_files, variable['short_name'] is changed before calling _find_input_dirs and _get_filenames, this in order to use an alternate variable name in file naming. So I tested to set it using a short function, which queries the config for a new project entry named 'aliases' (see code below)

It works, and allows to further explore the overall goal.

Could I go forward that way toward a PR ?

And, by the way, where should I include commits that only deal with improving esmvalcore code docstrings, and logged messages text ?

def _get_variable_alias(variable, short_name) :
    """Provide an alternate value for short_name, in the variable's
    project, if it exists, else None .
    """
    cfg = get_project_config(variable['project'])
    aliases = cfg.get('aliases',{})
    return aliases.get(short_name,None)

@senesis senesis added the enhancement New feature or request label Apr 26, 2021
@valeriupredoi
Copy link
Contributor

valeriupredoi commented Apr 26, 2021

@senesis cheers for the issue! I would encourage you to provide us with a concrete used case - since this issue could be faceted in that a variable could be mapped/aliased to a CMOR variable, or not, or maybe: e.g a variable that is produced by some OBS model is the exact equivalent of pr but it's called precipitation something just because the model devs decided to call it that way and not adhere to CMOR standards but that is the perrfect equivalent of CMOR's pr - aliasing will work perfectly in this case, then you have a pr-like variable that needs a constant fudge factor, in this case a custom derived variable needs to be constructed, and the third case is when things really go bazooka and you have (like in the UM case) a stash code to reprsent a variable that doesn't really have any CMOR equivalent (to be able to map it) neither derivation will work well - in this case it's just a very custom variable that will have a custom table

@senesis
Copy link
Contributor Author

senesis commented Apr 26, 2021

OK. The use case, for now, is : model IPSL-CM6 has two native output formats, both NetCDF-based. Format 'TS' is composed of single-variable files, named e.g. :

CM61-LR-hist-03.1950_18500101_18591231_1M_t2m.nc

which includes a NetCDF variable 't2m', which is actually the exact equivalent of 'tas' variable

So, it matches your first case :

a variable could be mapped/aliased to a CMOR variable

There may be some issues with CMOR conformance, but that I intend to address either in a model-specific fix (e.g. here adding a height2m scalar coordinate) , or through built-in CMOR fixes (e.g. renaming the variable)

I will certainly also have some variable derivation issues to address (and I am not sure of how to best address re-constructing a CMOR standard variable by combining non-standard variables), but this will be another story.

@sloosvel
Copy link
Contributor

If you work with the config-developer projects to set the proper tags and set the project for the data to native6 , you should be able to find the files without having to modify the code. This is similar to what is done to read ERA5 or ORAS4 from the original files.

Or you can even create a new project in the config-developer.yml (which is what we do to work with our model data for monitoring purposes).

@senesis
Copy link
Contributor Author

senesis commented Apr 28, 2021

Thanks. I was able to create a new project quite successfuly for 'my' model output

But my goal goes beyond finding data by indicating in the recipe a project specific variable attribute such as 'era5_name'; I want to be able to apply all existing recipes (which together form the actual treasure of ESMValTool), to a mix of CMIP data and data which filenames are formed using non-standard variable name

Said otherwise : for any recipe requesting 'tas', the _data_finder shoud, for a dataset of my newly defined project, translate 'tas' to 't2m' for finding the file named "CM61-LR-hist-03.1950_18500101_18591231_1M_t2m.nc"

I see no other way than the code change described above

@senesis
Copy link
Contributor Author

senesis commented Apr 28, 2021

And _find_input_files would just have to be slightly changed :

def _find_input_files(variable, rootpath, drs):
__short_name = variable['short_name']
__variable['short_name'] = variable['original_short_name']
__# Use project's specific alias if any
__alias = _get_variable_alias(variable, short_name)
__if alias is not None :
______variable['short_name'] = alias

__input_dirs = _find_input_dirs(variable, rootpath, drs)
__filenames_glob = _get_filenames_glob(variable, drs)
__files = find_files(input_dirs, filenames_glob)
__variable['short_name'] = short_name
__return (files, input_dirs, filenames_glob)

@bouweandela
Copy link
Member

The idea we had on how to achieve this is described very shortly in #309, i.e. have a yaml file (path configurable per project in config-developer.yml) containing a mapping from CMIP6 variables to extra key-value pairs. Those extra key value pairs should then be added to the dict containing the variable-dataset description, for example right after this line esmvalcore/_recipe.py#L1099.

These extra keys could then be used to find the data using the directory structure defined in config-developer.yml without any modifications needed to the functions for finding input data.

@bouweandela
Copy link
Member

On a related note: having a separate 'project' per supported model would probably be OK as these are not so many, but we also had the idea of making the DRS definition in config-developer.yml a bit more flexible #970 (comment), because we would not like to have a separate project for every supported observational/reanalysis dataset as that would just be too many.

@senesis
Copy link
Contributor Author

senesis commented May 6, 2021

The idea we had on how to achieve this is described very shortly in #309, i.e. have a yaml file (path configurable per project in config-developer.yml) containing a mapping from CMIP6 variables to extra key-value pairs.

That sounds great.

Can we safely assume that such a keys can be either project-specific keys (such as 'label_for_variable_in_filename') or standard keys (such as 'dataset', that would drive the choice of a fix module)

Also : there is no description there of the specific 'recipe' mechanics that would allow to apply a python code for deriving variables. And I do not see how such a code would be provided with necessary input variables, while the 'standard' derived variable scheme allows nicely for that

@senesis
Copy link
Contributor Author

senesis commented May 6, 2021

On a related note: having a separate 'project' per supported model would probably be OK as these are not so many, but we also had the idea of making the DRS definition in config-developer.yml a bit more flexible #970 (comment), because we would not like to have a separate project for every supported observational/reanalysis dataset as that would just be too many.

The #970 (comment) introduces the concept of 'center' , which is new for ESMValTool. And I think it is worth thinking twice at what it would mean or drive.

It should so drive the choice of

  • the DRS
  • the cmor fixes, maybe at the level of choosing the fixes directory (instead of using the project for that)
  • and ?

@bsolino
Copy link
Contributor

bsolino commented May 10, 2021

The #970 (comment) introduces the concept of 'center' , which is new for ESMValTool. And I think it is worth thinking twice at what it would mean or drive.

Sorry for the confusion, I am not used to the ESMValTool vocabulary and it seems I chose the word poorly.

What I was calling "center" (for "datacenter") is not a new concept, as far as I know. I'm not sure how are they usually called, but it appears here with the name "key machines": https://docs.esmvaltool.org/projects/esmvalcore/en/latest/quickstart/configure.html#developer-configuration-file

I will edit the comment to avoid further confusion

@senesis senesis closed this as completed Jun 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants