Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Application of the broken recipe policy for releases #3065

Closed
remi-kazeroni opened this issue Mar 1, 2023 · 11 comments · Fixed by #3129
Closed

Application of the broken recipe policy for releases #3065

remi-kazeroni opened this issue Mar 1, 2023 · 11 comments · Fixed by #3129

Comments

@remi-kazeroni
Copy link
Contributor

Is your feature request related to a problem? Please describe.

Since a few months, we have an official policy for broken recipes in our docs: https://docs.esmvaltool.org/en/latest/community/broken_recipe_policy.html. This states:
affected recipe will be listed under “broken recipes” in the Changelog, together with the version number of the last known release in which the recipe was still working. If a recipe continues to be broken for three releases of the ESMValTool (about one year) and no recipe maintainer could be found during this time, the affected recipe and diagnostics will be retired.

I'd like to discuss a bit here how to apply our policy in practice. We have a few cases where the recipes are known to be broken, issues have been opened with the next milestone, maintenance has not happened for various reasons and issues are bumped to the next milestone. This makes the job of the release manager harder as the person runs these recipes without knowing these are broken, needs to figure out the problem again, scroll through issues and comments to check if that was known before, ...

For v2.8, I'm planning to add a first "broken recipes" section to the Changelog and list broken recipes. This might not be so readable in the long run if lists of broken recipes are split across changelog for different releases.

We discussed yesterday in a Tech Lead Team Meeting that it could be useful to have a full list/table of broken recipes in our documentation. This would help users when they try to run broken recipes and are not sure if they did something wrong. This would also simplifies the release process a bit and hopefully increase the willingness to maintain broken recipes. The table could be linked at the bottom of the recipe index: https://docs.esmvaltool.org/en/latest/recipes/index.html
The table could look like:

Broken recipe Last known working version Problem GitHub issue
recipe_perfmetrics_land_CMIP5.yml v2.? ESMValCore issue #2594
recipe_check_obs.yml v2.4? ESMValCore issue ESMValGroup/ESMValCore#1388

Possible problems would be:

  • ESMValCore issue
  • Diagnostic issue
  • Missing data
  • Problem with dependency

If the table gets too long, we could think of having one table per type of problems.

Also, it could be good practice to open an issue listing newly broken recipes once a release is done. By using milestones, we can check whether these recipes have received attention before reaching the limit of 3 releases. An example: #2943

Would you be able to help out?
Would you have the time and skills to implement the solution yourself?

Yes, I could add this table to the documentation and update the release instructions accordingly. But I'd be happy to get feedback from @ESMValGroup/esmvaltool-coreteam first 🍻

@bettina-gier
Copy link
Contributor

I like this a lot Remi! One note would be to perhaps add another column of partially working or I don't know how to word it, seeing esmvalcore as the problem for both entries in the table sounds to most users as an unfixable issue, when reading the github issues reveals the problem in recipe_perfmetrics_land_CMIP5.yml to be in the handling of a single dataset which users might be fine with excluding for their purposes. Similarly, seems that only parts of recipe_check_obs.yml are affected by the issue of overwriting cmor table info when deriving custom variables. To me it seems these cases for users are easier to deal with if their focus wasn't on the broken parts, than when the problem lies in diagnostics not running leading to no output for everything.

On a side note if we change the file for the recipe index I'd be strongly in favour of adding the recipe names after the titles in brackets, as I know that at least I sometimes can't remember the titles but only recipe filenames making it harder to look them up =D

@schlunma
Copy link
Contributor

schlunma commented Mar 1, 2023

Thanks @remi-kazeroni, these are really good suggestions. This will make the life of the release manager so much easier, and also helps users in choosing which recipes to run.

Two suggestions:

  • I think it would be helpful to mark the recipe as "broken" in their doc, maybe a footnote in the recipe index or the actual recipe documentation that links to this table of broken recipes?
  • Could we put recipes with missing data (on ESGF) which otherwise work perfectly fine and are maintained into a different table? I happen to know some of these recipes, but would not like them to be listed as "broken" 😅 We could mark them with another footnote.

@remi-kazeroni
Copy link
Contributor Author

I have created a link that is set to point to the overview table of successful/failed recipe runs for the last release. That could also be used to inform our users which recipes can't be run with the stable release of ESMValTool: https://esmvaltool.dkrz.de/shared/esmvaltool/stable_release/debug.html

@alistairsellar
Copy link
Contributor

alistairsellar commented Mar 8, 2023

Thanks @remi-kazeroni it sounds like a good approach to me. To confirm, I am OK with recipe_autoassess_landsurface_soilmoisture being marked as broken under this policy, and accept the badge of shame that comes with it. :sad:. Let me know if I need to do anything associated with this, like updating the doc page.

I like the suggestion of adding relevant statement to doc page. In this case I could make it clear that the recipe only works on Jasmin, and that there is a plan to fix it for other machines.

@alistairsellar
Copy link
Contributor

Sorry, some weird side effect of copy-and-paste there - was not intending to close the issue! Was trying to reply constructively!

@remi-kazeroni
Copy link
Contributor Author

Thanks @remi-kazeroni it sounds like a good approach to me. To confirm, I am OK with recipe_autoassess_landsurface_soilmoisture being marked as broken under this policy, and accept the badge of shame that comes with it. :sad:. Let me know if I need to do anything associated with this, like updating the doc page.

I like the suggestion of adding relevant statement to doc page. In this case I could make it clear that the recipe only works on Jasmin, and that there is a plan to fix it for other machines.

Thanks a lot for your opinion on this @alistairsellar. I agree it would probably make sense to flag this recipe as broken with the explanation that it only runs on Jasmin at the moment. Hopefully, that could be improved in the future.

Would you know why the recipe_autoassess_landsurface_permafrost.yml runs fine (see test) even if the same climatology files seem to be required? Could it be that the climatology files are actually not used?

In both cases the recipes already mention that the files can only be accessed on Jasmin but the doc pages (here and here) could probably be clarified. It would be great if you could take a look.

@remi-kazeroni
Copy link
Contributor Author

After the first round of recipe testing for the release v2.8 (see #3076), it seems that we have only 2 families of recipes that can't be run due to not publicly available data: recipe_autoassess_landsurface_*.yml and recipe_schlund20jgr_gpp*. What is important from a release point of view is that these 2 families of recipes do not get testing again and again if that implies extra work for each release manager to find the missing data, contact the maintainers of these recipes, investigate the same issues during each round of recipe testing.

To handle these recipes, I would propose one of the two following approaches:

  1. flag the corresponding recipes as broken. This could give some incentive to fix the recipes (e.g. finding a solution to make all necessary files publicly available). This seems to be the approach favoured for recipe_autoassess_landsurface_*.yml (see this Application of the broken recipe policy for releases #3065 (comment))
  2. remove these recipes from the list of recipes to be tested for releases. This can be easily done in the scripts that we've started to use to run all recipes at once (see Add DKRZ/Levante batch scripts for release recipe running #2883). But in such cases, it should be properly documented in the online docs and in the recipe which files are missing and who should be contacted for questions (e.g. the maintainer). It would be also the responsibility of the maintainer to test their recipes with release candidate for the Core and the release manager does not have to do it any more. This seems to be the approach favoured for recipe_schlund20jgr_gpp* (see this Recipe testing for release 2.8.0 - Core release candidate rc1 #3076 (comment)).

I think we need to be pragmatic here. The goal is to make the release process as simple as possible. The discussion on how to deal with "missing data recipes" should be handled on a case by case basis between the maintainers, the Science and the Tech Lead Teams.

@valeriupredoi
Copy link
Contributor

Thanks @remi-kazeroni it sounds like a good approach to me. To confirm, I am OK with recipe_autoassess_landsurface_soilmoisture being marked as broken under this policy, and accept the badge of shame that comes with it. :sad:. Let me know if I need to do anything associated with this, like updating the doc page.

I like the suggestion of adding relevant statement to doc page. In this case I could make it clear that the recipe only works on Jasmin, and that there is a plan to fix it for other machines.

@alistairsellar @remi-kazeroni let's mark that as broken, ok. But for the love of motor racing gods, let's get that data be public before 2.9 - am looking at you, Met Office 😁

@alistairsellar
Copy link
Contributor

Would you know why the recipe_autoassess_landsurface_permafrost.yml runs fine (see test) even if the same climatology files seem to be required? Could it be that the climatology files are actually not used?

Yes, you are right. It turns out that it is not actually dependent on Jasmin - only the soilmoisture one is.

In both cases the recipes already mention that the files can only be accessed on Jasmin but the doc pages (here and here) could probably be clarified. It would be great if you could take a look.

Yes, they do need clarified: to emphasise the dependency in one case, and to remove the implication of a dependency in the other. I've opened #3103 for this. Soilmoisture documentation now mentions (in #3103) that the recipe is broken, as part of highlighting Jasmin dependency. But for application of broken recipe policy, should the word "broken" also be in bold or something at the top of the documentation page? Suggestions welcome (or feel free to edit branch directly).

@bouweandela
Copy link
Member

I think we need to be pragmatic here. The goal is to make the release process as simple as possible. The discussion on how to deal with "missing data recipes" should be handled on a case by case basis between the maintainers, the Science and the Tech Lead Teams.

It would be good to talk about this again at the next workshop and see if we can update our policy to capture the ideas people have on this.

@bouweandela
Copy link
Member

add another column of partially working or I don't know how to word it, seeing esmvalcore as the problem for both entries in the table sounds to most users as an unfixable issue, when reading the github issues reveals the problem in recipe_perfmetrics_land_CMIP5.yml to be in the handling of a single dataset which users might be fine with excluding for their purposes.

If a recipe is failing because of a single dataset, we could just remove it so the recipe works again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants