Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for different dataset versions #3452

Open
axel-lauer opened this issue Dec 1, 2023 · 4 comments
Open

Support for different dataset versions #3452

axel-lauer opened this issue Dec 1, 2023 · 4 comments

Comments

@axel-lauer
Copy link
Contributor

With new versions of observational datasets becoming available, new science studies might want to use the latest versions. In order to be able to reproduce published work, however, or to compare different versions of a dataset, also older versions are of interest.

This issue is meant to collect ideas on how to add support for different versions of a dataset to the CMORizers (config file, downloader, formatter) and to come up with a stragegy for technical implementation.

@ESMValGroup/esmvaltool-coreteam please add your thoughts here.

@lukruh
Copy link
Contributor

lukruh commented Dec 4, 2023

I thought about this as well while updating an existing cmorizer and agree that supporting different versions of datasets would be a nice enhancement for the ESMValTool. Due to version parameter in filenames and recipes, it's already possible for diagnostics to support multiple versions.

In #3430 I suggested to add versioning to downloaders and formatters via an --versions commandline argument. Only the settings that change towards new versions can be kept in an optional versions subsection in the config files. It would just be an additional feature to make it easier to download and format older datasets without changing the current behaviour.
I created a draft PR #3454 to show an approach of how this could be implemented.

However, I'm not sure if we should keep all references and details in the documentation up to date for every version?

@bouweandela
Copy link
Member

bouweandela commented Feb 20, 2024

On ESGF, for example in the obs4MIPs project, multiple versions are supported by making the version part of the dataset name. For example, version 1.0, 2.0, and 2.1 of the AIRS dataset are represented as:

  • AIRS-1-0
  • AIRS-2-0
  • AIRS-2-1

and the actual version used on ESGF is usually a date (maybe our CMORizers do not need to provide this, we could leave that to ESGF).

See here for an overview of the available dataset names and versions.

Since we are hoping that the CMORized data can be published on ESGF at some point, we may want to align how we call our datasets with this practice. It would be good to check with the obs4MIPs / CREATE-IP folks if they are happy with this way of versioning data? Any input on this @gleckler1 @glpotter?

I would highly recommend that we sanitize the dataset and source version names for unwanted characters though (as opposed to how its done in some cases now). We should only allow upper and lower case characters A to Z, numbers 0 to 9, and a -. In particular, the characters _, ., / should be avoided, as these are used as separators between facets.

@gleckler1
Copy link

@bouweandela thanks for checking in. It makes sense to me that you want to improve how dataset versions are defined for ESMValTool. Following CMIP, obs4MIPs constructs a source_id as below (please see draft ODS2.5 document due to be finalized at the end of this month). It would be great for the community if over time ESMValTool and obs4MIPs datasets could be further aligned.

source_id = <source_label>-<source_version_number> but substituting "-" for certain forbidden characters (including ".", “_”, “(“, “)”, “/”, and " ")

@axel-lauer
Copy link
Contributor Author

After several discussions with @schlunma and @bouweandela, I propose the following strategy to add support for different dataset versions to our downloading and formatting scripts for observational datasets. If there are no strong opinions or vetoes, we could try to start implementing this soon. From my point of view, an important aim is to remain as backward compatible as possible.

  • Underscores in the facet "version" in the filenames of affected observational datasets (not many datasets actually) will be replaced with a minus (-) (see OBS datasets use underscores in the version number #3051). Affected recipes will be updated accordingly.
  • Updating DataCommand to support a --version switch as proposed in Proper version support for CMORizers #3430.
  • Updating read_cmor_config so each dataset version can have its own CMOR config-file with the version added to the filename as <dataset>_<version>.yml.
  • If there is no version supplied at the command line, the config-file w/o version (i.e. .yml) will be used if present. If not, all version strings added to the filenames of the cmor config-files for a given dataset will be sorted and the "largest" string will be used to define the version to be processed.
  • Downloading and formatting scripts can optionally handle specific versions individually (e.g. by calling different functions). The version will be supplied as additional argument to the downloading and formatting scripts depending on which cmor config-file has been used.
  • All keys for a dataset in datasets.yml will be sorted below a version key, e.g.
ESACCI-SST:
  version_2.2:
    tier: 2
    source: http://surftemp.net/regridding/index.html
    last_access: 2020-12-04
    info: |
      Download the data from:
        regridding service (http://surftemp.net/regridding/index.html)
      Put all files under a single directory (no subdirectories with years).
  version_L4-GHRSST-SSTdepth-OSTIA-GLOB:
    tier: 2
    source: ftp://anon-ftp.ceda.ac.uk/neodc/esacci/sst/data/
    last_access: 2019-02-01
    info: |
      Download the data from:
        lt/Analysis/L4/v01.1/
      Put all files under a single directory (no subdirectories with years).
  • In order to keep track of potential changes to the raw input data or the formatting script, a "last access date" for the raw input data and a "last changes date" for the formatting script are added as new global attributes to the reformatted dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants