Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using wildcards in recipes #1138

Closed
bouweandela opened this issue May 19, 2021 · 6 comments · Fixed by #1609
Closed

Using wildcards in recipes #1138

bouweandela opened this issue May 19, 2021 · 6 comments · Fixed by #1609
Labels
enhancement New feature or request

Comments

@bouweandela
Copy link
Member

The philosophy behind recipes is that they are as explicit as possible, so e.g. all datasets should be fully written out so it is clear what the recipe actually computes (and with which datasets it actually works).

However, this is not very user-friendly for people writing recipes and can become difficult to read too. See e.g. the requests for the use of wildcards '*' for specifying datasets in ESMValGroup/ESMValTool#671, #589, and #1082. A 'recipe filler' tool was also developed to deal with this.

Maybe we could compromise by allowing this kind of feature, but only accepting recipes that do not contain wildcards in the ESMValTool repository? For easy reproducibility, we could write out a copy of the recipe with all wildcards replaced by actual datasets in the output directory? Any opinions @ESMValGroup/esmvaltool-coreteam?

Maybe this is something for discussion at the next monthly meeting ESMValGroup/ESMValTool#2173 or one of the upcoming @ESMValGroup/technical-lead-development-team or @ESMValGroup/scientific-lead-development-team meetings?

@thomascrocker
Copy link
Contributor

I can't comment on the other use cases mentioned. But in the case of finding FX variables for preprocessor activities (#1082) the use of wildcards is necessary in order for ESMValTool to find the necessary files for the preprocessor. This is because for projects outside of CMIP5 (CORDEX and CMIP6 are the examples I'm working with) the location of fx files varies by institute. Some replicate them across all ensembles and experiments, whereas others do not, and in those cases the single ensemble and experiment the files are stored in is not consistent. Wildcards in the recipe, and then choosing the first file found if there are multiple, is one way around this.

I am hoping to publish a recipe soon that replicates (and expands on) some published research looking at differences between multimodel ensembles of GCM and RCM data. I ultimately am working with 30+ model datasets from each project, but don't want to have to write separate preprocessors in my recipe for each dataset to deal with the different locations of fx files. If we're having a strict "no wildcards" policy then I suppose my recipe would fall foul of this using the fix I developed in #1082.
I would argue this case is slightly different to matching multiple datasets using wildcards, since in my case the wildcard is used to facilitate searching, but only one FX file is used.

Alternatively, if we wanted to discourage wildcards completely, then a different approach to solving #1082 would be needed. This might require coding specific rules for each project regarding how to search for fx files. However, this would conflict with the current functionality that allows users to explicitly choose specific fx files in the preprocessor. (As an aside, there is already a slight conflict here as if a user specifies an FX ensemble member other than r0i0p0 for CMIP5 in their recipe, ESMValCore will override it back to r0i0p0.)

@bouweandela
Copy link
Member Author

If it is necessary to specify cell measures and ancillary variables per dataset, we should probably enable that with the special keys in the dataset section (instead of in the preprocessor section), called cell_measures and ancillary_variables that specify the required fx variables and override specific facets from the main variable/dataset that they belong to. Would this be a possibility?

diagnostics:
  example_diagnostic:
    description: This is an example
    variables:
      fgco2:
        short_name: fgco2
        preprocessor: global_ocean
        project: CMIP5
        mip: Omon
        exp: esmHistorical
        ensemble: r1i1p1
        start_year: 1960
        end_year: 2005
        additional_datasets:
          - {dataset: CanESM2, start_year: 1960 ,end_year: 2005,
             cell_measures: [{short_name: areacello, ensemble: r0i0p0}],
             ancillary_variables: [{short_name: sftof, ensemble: r0i0p0}]
            }
          - {dataset: CESM1-BGC, start_year: 1960 ,end_year: 2005
             cell_measures: [{short_name: areacello, ensemble: r1i2p3}],
             ancillary_variables: [{short_name: sftof}],
            }

We could then also allow automatic lookup using wildcards for convenience, but save the resulting recipe with the exact datasets used in the output directory for easy reproducibility.

@thomascrocker
Copy link
Contributor

Ahh. That looks like it might be a solution. I didn't realise that it was possible to specify explicit ancillary variables etc. as part of a dataset definition. I think keeping the wildcards functionality is still very useful (since otherwise the user needs to trawl the file structure to determine the exact criteria for the ancillary files for each dataset) but this provides a good way to also save explicitly the exact datasets to be used in a recipe.

@bouweandela
Copy link
Member Author

I didn't realise that it was possible to specify explicit ancillary variables etc. as part of a dataset definition.

It is not possible yet, but it could be implemented if someone has the time to do it. The above was just a proposal to do this in a way where we preserve both precise recipes and enable convenient recipe development.

@bouweandela
Copy link
Member Author

@thomascrocker @ledm We finally managed to implement this feature and it will be available in the upcoming v2.8 release. My apologies that it took a while. Documentation is available here:

Please have a look and let us know if it works/does not work for you.

@thomascrocker
Copy link
Contributor

@bouweandela thanks for the heads up! Moving into a new role recently means I've had a lot less usage of ESMValTool lately, but it's still very much relevant to a number of projects I may be involved in in the future, so it's really great to hear that this capability now exists. Cheers :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
2 participants