Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instructions to support the cmorization of several versions of a dataset #2341

Closed
remi-kazeroni opened this issue Oct 11, 2021 · 7 comments · Fixed by #2730
Closed

Instructions to support the cmorization of several versions of a dataset #2341

remi-kazeroni opened this issue Oct 11, 2021 · 7 comments · Fixed by #2730

Comments

@remi-kazeroni
Copy link
Contributor

Is your feature request related to a problem? Please describe.
When a cmorizer script is updated to account for a newer version of a dataset, the cmorization of the previous version is usually no longer possible because the code and information are deleted (access to source files, cmorization script, options, ...). We discussed some time ago that we would prefer to extend cmorizers to support a newer data version rather than deleting the previous cmorizer. This would ensure reproducibility and ease comparison of results. We could support the cmorization several versions of a dataset as long as the data are available. This may also help in some cases when updating recipes with the newest version of a dataset is not straightforward.

To this end, we need to update our instructions to explain how to handle best several versions of a cmorizers. My suggestion to support dataset_v1 and dataset_v2 would be:

  • if possible use a single cmorizer script (cmorizer_obs_dataset.py) and a single config file (DATASET.yml). This should work fine if the structure of the data is not too different between versions. If the cmorization involves data fixing that is version dependent, one could apply different fixes based on the version key defined in the config file (version: v1 or version v2). To cmorize one or the other version of the data, one could simply comment and uncomment the relevant parts of the config file.
  • if the newer version of the data differs significantly from the previous one, it could be easier to use one cmorizer script and one config file per data version (e.g. cmorizer_obs_dataset_v1.py and DATASET_V1.yml)
  • if the previous version is no longer needed (data not available any more, data issue fixed in the newer dataset, ...) the author of the PR is welcome to comment on that when opening an issue related to the cmorizer update.

Would you be able to help out?
Would you have the time and skills to implement the solution yourself? Yes, I'm planning to do so.

@zklaus
Copy link

zklaus commented Oct 11, 2021

Great idea! I suggest getting #1657 in first, but after that, this will be a very nice addition!

@remi-kazeroni
Copy link
Contributor Author

This is rather independent from #1657. The documentation I plan to add would be marginally affected by #1657. I agree finishing #1657 is high priority but it is too big to be merged shortly before a release and should rather be merged after a release to allow ample testing by the community as I wrote in my review.

@zklaus
Copy link

zklaus commented Oct 12, 2021

Completely agree with everything you said. It's just a matter of where to invest the available time and #1657 must get in asap.

@schlunma
Copy link
Contributor

schlunma commented Feb 9, 2022

@remi-kazeroni Release v2.5 is approaching quickly (the feature freeze will be on 2022-02-21). Since this issue is marked for v2.5, could you briefly comment if you further intend to include this in the new release and/or if you are facing any problems with it? Thanks!!

@remi-kazeroni
Copy link
Contributor Author

Let's keep the milestone a bit. I'm not sure I'll have the time to take care of this documentation for the release. If not that will be in v2.6.

@remi-kazeroni remi-kazeroni modified the milestones: v2.5.0, v2.6.0 Mar 7, 2022
@sloosvel
Copy link
Contributor

Any chance this can make it to 2.6? There isn't any PR open.

@remi-kazeroni remi-kazeroni modified the milestones: v2.6.0, v2.7.0 Jul 19, 2022
@remi-kazeroni
Copy link
Contributor Author

Any chance this can make it to 2.6? There isn't any PR open.

I wanted to work on that soon but I'd rather not delay the release for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants