Using wildcards in recipes #1138

bouweandela · 2021-05-19T10:24:07Z

The philosophy behind recipes is that they are as explicit as possible, so e.g. all datasets should be fully written out so it is clear what the recipe actually computes (and with which datasets it actually works).

However, this is not very user-friendly for people writing recipes and can become difficult to read too. See e.g. the requests for the use of wildcards '*' for specifying datasets in ESMValGroup/ESMValTool#671, #589, and #1082. A 'recipe filler' tool was also developed to deal with this.

Maybe we could compromise by allowing this kind of feature, but only accepting recipes that do not contain wildcards in the ESMValTool repository? For easy reproducibility, we could write out a copy of the recipe with all wildcards replaced by actual datasets in the output directory? Any opinions @ESMValGroup/esmvaltool-coreteam?

Maybe this is something for discussion at the next monthly meeting ESMValGroup/ESMValTool#2173 or one of the upcoming @ESMValGroup/technical-lead-development-team or @ESMValGroup/scientific-lead-development-team meetings?

The text was updated successfully, but these errors were encountered:

thomascrocker · 2021-05-19T16:48:02Z

I can't comment on the other use cases mentioned. But in the case of finding FX variables for preprocessor activities (#1082) the use of wildcards is necessary in order for ESMValTool to find the necessary files for the preprocessor. This is because for projects outside of CMIP5 (CORDEX and CMIP6 are the examples I'm working with) the location of fx files varies by institute. Some replicate them across all ensembles and experiments, whereas others do not, and in those cases the single ensemble and experiment the files are stored in is not consistent. Wildcards in the recipe, and then choosing the first file found if there are multiple, is one way around this.

I am hoping to publish a recipe soon that replicates (and expands on) some published research looking at differences between multimodel ensembles of GCM and RCM data. I ultimately am working with 30+ model datasets from each project, but don't want to have to write separate preprocessors in my recipe for each dataset to deal with the different locations of fx files. If we're having a strict "no wildcards" policy then I suppose my recipe would fall foul of this using the fix I developed in #1082.
I would argue this case is slightly different to matching multiple datasets using wildcards, since in my case the wildcard is used to facilitate searching, but only one FX file is used.

Alternatively, if we wanted to discourage wildcards completely, then a different approach to solving #1082 would be needed. This might require coding specific rules for each project regarding how to search for fx files. However, this would conflict with the current functionality that allows users to explicitly choose specific fx files in the preprocessor. (As an aside, there is already a slight conflict here as if a user specifies an FX ensemble member other than r0i0p0 for CMIP5 in their recipe, ESMValCore will override it back to r0i0p0.)

bouweandela · 2021-05-20T15:05:30Z

If it is necessary to specify cell measures and ancillary variables per dataset, we should probably enable that with the special keys in the dataset section (instead of in the preprocessor section), called cell_measures and ancillary_variables that specify the required fx variables and override specific facets from the main variable/dataset that they belong to. Would this be a possibility?

diagnostics:
  example_diagnostic:
    description: This is an example
    variables:
      fgco2:
        short_name: fgco2
        preprocessor: global_ocean
        project: CMIP5
        mip: Omon
        exp: esmHistorical
        ensemble: r1i1p1
        start_year: 1960
        end_year: 2005
        additional_datasets:
          - {dataset: CanESM2, start_year: 1960 ,end_year: 2005,
             cell_measures: [{short_name: areacello, ensemble: r0i0p0}],
             ancillary_variables: [{short_name: sftof, ensemble: r0i0p0}]
            }
          - {dataset: CESM1-BGC, start_year: 1960 ,end_year: 2005
             cell_measures: [{short_name: areacello, ensemble: r1i2p3}],
             ancillary_variables: [{short_name: sftof}],
            }

We could then also allow automatic lookup using wildcards for convenience, but save the resulting recipe with the exact datasets used in the output directory for easy reproducibility.

thomascrocker · 2021-05-20T15:15:54Z

Ahh. That looks like it might be a solution. I didn't realise that it was possible to specify explicit ancillary variables etc. as part of a dataset definition. I think keeping the wildcards functionality is still very useful (since otherwise the user needs to trawl the file structure to determine the exact criteria for the ancillary files for each dataset) but this provides a good way to also save explicitly the exact datasets to be used in a recipe.

bouweandela · 2021-05-26T14:33:28Z

I didn't realise that it was possible to specify explicit ancillary variables etc. as part of a dataset definition.

It is not possible yet, but it could be implemented if someone has the time to do it. The above was just a proposal to do this in a way where we preserve both precise recipes and enable convenient recipe development.

bouweandela · 2023-02-27T09:49:23Z

@thomascrocker @ledm We finally managed to implement this feature and it will be available in the upcoming v2.8 release. My apologies that it took a while. Documentation is available here:

Please have a look and let us know if it works/does not work for you.

thomascrocker · 2023-02-27T11:42:52Z

@bouweandela thanks for the heads up! Moving into a new role recently means I've had a lot less usage of ESMValTool lately, but it's still very much relevant to a number of projects I may be involved in in the future, so it's really great to hear that this capability now exists. Cheers :)

bouweandela mentioned this issue May 19, 2021

Add option to load selected or all years available in an experiment #1120

Closed

bouweandela added the enhancement New feature or request label May 19, 2021

zklaus mentioned this issue May 21, 2021

Allow wildcard searches when specifying fx variables in preprocessor #1082

Closed

10 tasks

bouweandela mentioned this issue Jun 17, 2021

Store CMIP model version in the copy of the recipe that is written to the output directory #1185

Closed

bouweandela mentioned this issue Sep 14, 2021

Optional flag that stops the run if any of data files not found #1282

Open

zklaus mentioned this issue Sep 16, 2021

Fx files finding: CMIP6: look in experiment=piControl as a last measure if fx data is not found #1318

Closed

7 tasks

bouweandela mentioned this issue Oct 28, 2021

Prepare release 2.4.0 ESMValGroup/ESMValTool#2354

Closed

bouweandela mentioned this issue May 31, 2022

Support wildcards in the recipe and improve support for ancillary variables and dataset versioning #1609

Merged

9 tasks

remi-kazeroni closed this as completed in #1609 Feb 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using wildcards in recipes #1138

Using wildcards in recipes #1138

bouweandela commented May 19, 2021

thomascrocker commented May 19, 2021

bouweandela commented May 20, 2021

thomascrocker commented May 20, 2021

bouweandela commented May 26, 2021

bouweandela commented Feb 27, 2023

thomascrocker commented Feb 27, 2023

Using wildcards in recipes #1138

Using wildcards in recipes #1138

Comments

bouweandela commented May 19, 2021

thomascrocker commented May 19, 2021

bouweandela commented May 20, 2021

thomascrocker commented May 20, 2021

bouweandela commented May 26, 2021

bouweandela commented Feb 27, 2023

thomascrocker commented Feb 27, 2023