Support for different dataset versions #3452

axel-lauer · 2023-12-01T11:30:59Z

With new versions of observational datasets becoming available, new science studies might want to use the latest versions. In order to be able to reproduce published work, however, or to compare different versions of a dataset, also older versions are of interest.

This issue is meant to collect ideas on how to add support for different versions of a dataset to the CMORizers (config file, downloader, formatter) and to come up with a stragegy for technical implementation.

@ESMValGroup/esmvaltool-coreteam please add your thoughts here.

lukruh · 2023-12-04T09:33:27Z

I thought about this as well while updating an existing cmorizer and agree that supporting different versions of datasets would be a nice enhancement for the ESMValTool. Due to version parameter in filenames and recipes, it's already possible for diagnostics to support multiple versions.

In #3430 I suggested to add versioning to downloaders and formatters via an --versions commandline argument. Only the settings that change towards new versions can be kept in an optional versions subsection in the config files. It would just be an additional feature to make it easier to download and format older datasets without changing the current behaviour.
I created a draft PR #3454 to show an approach of how this could be implemented.

However, I'm not sure if we should keep all references and details in the documentation up to date for every version?

bouweandela · 2024-02-20T07:36:18Z

On ESGF, for example in the obs4MIPs project, multiple versions are supported by making the version part of the dataset name. For example, version 1.0, 2.0, and 2.1 of the AIRS dataset are represented as:

AIRS-1-0
AIRS-2-0
AIRS-2-1

and the actual version used on ESGF is usually a date (maybe our CMORizers do not need to provide this, we could leave that to ESGF).

See here for an overview of the available dataset names and versions.

Since we are hoping that the CMORized data can be published on ESGF at some point, we may want to align how we call our datasets with this practice. It would be good to check with the obs4MIPs / CREATE-IP folks if they are happy with this way of versioning data? Any input on this @gleckler1 @glpotter?

I would highly recommend that we sanitize the dataset and source version names for unwanted characters though (as opposed to how its done in some cases now). We should only allow upper and lower case characters A to Z, numbers 0 to 9, and a -. In particular, the characters _, ., / should be avoided, as these are used as separators between facets.

gleckler1 · 2024-02-20T23:24:07Z

@bouweandela thanks for checking in. It makes sense to me that you want to improve how dataset versions are defined for ESMValTool. Following CMIP, obs4MIPs constructs a source_id as below (please see draft ODS2.5 document due to be finalized at the end of this month). It would be great for the community if over time ESMValTool and obs4MIPs datasets could be further aligned.

source_id = <source_label>-<source_version_number> but substituting "-" for certain forbidden characters (including ".", “_”, “(“, “)”, “/”, and " ")

axel-lauer · 2024-03-07T13:35:19Z

After several discussions with @schlunma and @bouweandela, I propose the following strategy to add support for different dataset versions to our downloading and formatting scripts for observational datasets. If there are no strong opinions or vetoes, we could try to start implementing this soon. From my point of view, an important aim is to remain as backward compatible as possible.

Underscores in the facet "version" in the filenames of affected observational datasets (not many datasets actually) will be replaced with a minus (-) (see OBS datasets use underscores in the version number #3051). Affected recipes will be updated accordingly.
Updating DataCommand to support a --version switch as proposed in Proper version support for CMORizers #3430.
Updating read_cmor_config so each dataset version can have its own CMOR config-file with the version added to the filename as <dataset>_<version>.yml.
If there is no version supplied at the command line, the config-file w/o version (i.e. .yml) will be used if present. If not, all version strings added to the filenames of the cmor config-files for a given dataset will be sorted and the "largest" string will be used to define the version to be processed.
Downloading and formatting scripts can optionally handle specific versions individually (e.g. by calling different functions). The version will be supplied as additional argument to the downloading and formatting scripts depending on which cmor config-file has been used.
All keys for a dataset in datasets.yml will be sorted below a version key, e.g.

ESACCI-SST:
  version_2.2:
    tier: 2
    source: http://surftemp.net/regridding/index.html
    last_access: 2020-12-04
    info: |
      Download the data from:
        regridding service (http://surftemp.net/regridding/index.html)
      Put all files under a single directory (no subdirectories with years).
  version_L4-GHRSST-SSTdepth-OSTIA-GLOB:
    tier: 2
    source: ftp://anon-ftp.ceda.ac.uk/neodc/esacci/sst/data/
    last_access: 2019-02-01
    info: |
      Download the data from:
        lt/Analysis/L4/v01.1/
      Put all files under a single directory (no subdirectories with years).

In order to keep track of potential changes to the raw input data or the formatting script, a "last access date" for the raw input data and a "last changes date" for the formatting script are added as new global attributes to the reformatted dataset.

axel-lauer added enhancement observations labels Dec 1, 2023

katjaweigel self-assigned this Dec 1, 2023

axel-lauer mentioned this issue Dec 13, 2023

Update CMORizer CERES-EBAF to v4.2 #3360

Open

10 tasks

bettina-gier mentioned this issue Sep 9, 2024

Add CarboScope cmorizer #3745

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for different dataset versions #3452

Support for different dataset versions #3452

axel-lauer commented Dec 1, 2023

lukruh commented Dec 4, 2023 •

edited

Loading

bouweandela commented Feb 20, 2024 •

edited

Loading

gleckler1 commented Feb 20, 2024

axel-lauer commented Mar 7, 2024

Support for different dataset versions #3452

Support for different dataset versions #3452

Comments

axel-lauer commented Dec 1, 2023

lukruh commented Dec 4, 2023 • edited Loading

bouweandela commented Feb 20, 2024 • edited Loading

gleckler1 commented Feb 20, 2024

axel-lauer commented Mar 7, 2024

lukruh commented Dec 4, 2023 •

edited

Loading

bouweandela commented Feb 20, 2024 •

edited

Loading