Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recipe testing and output comparison for release 2.9.0 #3239

Closed
bouweandela opened this issue Jun 22, 2023 · 41 comments
Closed

Recipe testing and output comparison for release 2.9.0 #3239

bouweandela opened this issue Jun 22, 2023 · 41 comments
Labels
Milestone

Comments

@bouweandela
Copy link
Member

bouweandela commented Jun 22, 2023

This issue documents recipe testing for the v2.9.0 release

Installation and environment

$ git clone [email protected]:ESMValGroup/ESMValTool
$ cd ~/ESMValTool
$ git checkout v2.9.x
$ mamba env create -n esmvaltool-v2.9.x-2023-06-20 -f environment.yml
$ conda activate esmvaltool-v2.9.x-2023-06-20
$ pip install .
$ cd ..
$ git clone [email protected]:scitools/iris
$ cd iris
$ git checkout v3.6.x
$ pip install .

Environment file

environment.txt

Config user file

All options at the default for Levante except search_esgf: when_missing.

@bouweandela
Copy link
Member Author

bouweandela commented Jun 22, 2023

Overview of the results

Webpage: https://esmvaltool.dkrz.de/shared/esmvaltool/v2.9.0rc1/debug.html

Successful: 135 out of 153

Recipes that failed with DiagnosticError

Recipes that failed of Missing Data

Recipes that failed because they used too much memory (probably due to SciTools/iris#5338)

  • recipe_climate_change_hotspot.yml - did not exit (killed by SLURM)
  • recipe_daily_era5.yml - did not exit (killed by SLURM)
  • recipe_eady_growth_rate.yml (diagnostic out of memory and killed by slurm)
  • recipe_lauer22jclim_fig3-4_zonal.yml - did not exit (killed by SLURM because of extract_levels preprocessor realizes source levels ESMValCore#2114)
  • recipe_lauer22jclim_fig5_lifrac.yml - did not exit (killed by SLURM)
  • recipe_mpqb_xch4.yml - did not exit (killed by SLURM)
  • recipe_sea_surface_salinity.yml - did not exit (killed by SLURM)
  • recipe_thermodyn_diagtool.yml - did not exit (killed by SLURM)

@bouweandela
Copy link
Member Author

And here are the results of the comparison with the results for v2.8.0. 61 recipes have different results, which seems like a lot. I will have a preliminary look at them and report some results here later.

@remi-kazeroni remi-kazeroni added this to the v2.9.0 milestone Jun 23, 2023
@valeriupredoi
Copy link
Contributor

61 is not a lot - it's in fact bang-on, I think I had some 50-60 myself for 2.7 and Remi had a bucketload too; Matplotlib changed so expect plots differing only. Great work, bud 🍻

@valeriupredoi
Copy link
Contributor

well, rainfarm died bc you didn't install Julia; where are @schlunma 's schlund*gpp ones that failed?

@valeriupredoi
Copy link
Contributor

carvalhais looks like a proper bug in iris - cube.core_data() when data is just a point, I'll have a closer look

@valeriupredoi
Copy link
Contributor

valeriupredoi commented Jun 23, 2023

Both lauer22 didn't actually exit - they seem to have been killed by SLRM mid-analysis, either node failure or memory-related? Same with mpqb and sea surface salinity = gonna update your list there

@remi-kazeroni
Copy link
Contributor

Great work with this recipe testing, @bouweandela!

Both lauer22 didn't actually exit - they seem to have been killed by SLRM mid-analysis, either node failure or memory-related? Same with mpqb and sea surface salinity = gonna update your list there

Indeed, we have 7 recipes (last section of this #3239 (comment)) who couldn't be run due to time/memory limitations even if those could be run in the previous release. Possible reasons:

  • Issues on Levante (filesystem, default slurm settings silently changed during maintenance (happened already), ...)
  • Iris issue with lazy aggregation (@schlunma's fix not yet merged)

And 4-5 diagnostic failures that we should try to fix together next week 🍻

@valeriupredoi
Copy link
Contributor

so @ledm has just reported to CEDA/JASMIN that a bunch of his runs get kicked out of SLURM on basis of resources but obv nothing out of the ordinary (ie mem consumption order 2GB) - he's still to confirm that he's using the latest Core, but the symptoms look identical to these recipes here that didn't exit, but rather, were kicked out by the scheduler

@valeriupredoi
Copy link
Contributor

carvalhais looks like a proper bug in iris - cube.core_data() when data is just a point, I'll have a closer look

lemme look into this now 🇵🇹

@valeriupredoi
Copy link
Contributor

OK @ledm is running with latest:

esmvalcore                2.9.0.dev28+g179312e41          pypi_0    pypi
esmvaltool                2.9.0.dev32+ge17d1c8f6          pypi_0    pypi

and is seeing exacly what Bouwe saw for lauer22 etc - bizarre kickouts

@ledm
Copy link
Contributor

ledm commented Jun 23, 2023

Okay, glad it's not just me. Though I think that some of my recent issues originated with #3240 and
ESMValGroup/ESMValCore#2110

@ledm
Copy link
Contributor

ledm commented Jun 23, 2023

@valeriupredoi
Copy link
Contributor

valeriupredoi commented Jun 23, 2023

@bouweandela carvalhais goes through fine with iris=3.6.0 (deployed) and 3.6.x (pip-installed) - that's either a fluke or something related to distributed (which I did not run with); same dask as yours too; try rerunning the recipe, methinks

@ledm
Copy link
Contributor

ledm commented Jun 26, 2023

I'm still struggling with this. I've changed the preprocessing stage of the recipe (removing custom order), I've tried a few older versions of esmvalcore (2.8.0, 2.8.1, and 2.8.1rc1)? No matter what, I'm still getting the "killed" message after 0.3GB of RAM usage.

@bouweandela
Copy link
Member Author

@ledm Would it be OK to open a separate issue for your recipe? This issue is about testing recipes for the upcoming v2.9 release. Re the recipes listed above that are running out of memory: this is probably due to SciTools/iris#5338.

@ledm

This comment was marked as off-topic.

@ledm

This comment was marked as off-topic.

@ledm

This comment was marked as off-topic.

@bouweandela

This comment was marked as off-topic.

@ledm

This comment was marked as off-topic.

@bouweandela
Copy link
Member Author

bouweandela commented Jun 26, 2023

I ran the comparison tool again because there was an issue with it. NaN values did not compare as equal in arrays, so the tool flagged too many recipe runs as changed. The new comparison results are available here. Now the number of recipe runs that are different from v2.8.0 is 61. I will add a more detailed comparison once it is ready.

@bouweandela
Copy link
Member Author

Detailed comparison, including all kinds of information on what exactly changed in the output is available
here. Please scroll through the document to find your recipe. The indentation indicates what recipe/variable/coordinate the information applies to.

@valeriupredoi
Copy link
Contributor

@bouweandela carvalhais goes through fine with iris=3.6.0 (deployed) and 3.6.x (pip-installed) - that's either a fluke or something related to distributed (which I did not run with); same dask as yours too; try rerunning the recipe, methinks

@bouweandela have you rerun this recipe to have it finish OK, by any chance?

@remi-kazeroni
Copy link
Contributor

@bouweandela carvalhais goes through fine with iris=3.6.0 (deployed) and 3.6.x (pip-installed) - that's either a fluke or something related to distributed (which I did not run with); same dask as yours too; try rerunning the recipe, methinks

@bouweandela have you rerun this recipe to have it finish OK, by any chance?

Just did with iris 3.6.1 and I'm still getting the same error.

@remi-kazeroni
Copy link
Contributor

By increasing the resources (from interactivate to compute partition, or from compute to compute 512G), I was able to run successfully a few more recipes: recipe_climate_change_hotspot, recipe_mpqb_xch4, recipe_sea_surface_salinity, recipe_thermodyn_diagtool. But these recipes have longer runtimes and memory consumption than found during v2.8 release (factor >2 ). I could run these recipes successfully before and after the latest iris PR was merged without any differences in terms of performance.

@schlunma
Copy link
Contributor

By increasing the resources (from interactivate to compute partition, or from compute to compute 512G), I was able to run successfully a few more recipes: recipe_climate_change_hotspot, recipe_mpqb_xch4, recipe_sea_surface_salinity, recipe_thermodyn_diagtool. But these recipes have longer runtimes and memory consumption than found during v2.8 release (factor >2 ). I could run these recipes successfully before and after the latest iris PR was merged without any differences in terms of performance.

I re-run some of these recipes with a distributed scheduler using the following .esmvaltool/dask.yml file:

cluster:
  type: distributed.LocalCluster
  n_workers: 8
  threads_per_worker: 4
  memory_limit: 32 GiB

(no idea how optimal those settings are).

  • recipe_thermodyn_diagtool: ran with pretty-much the exact same ressources as during the v2.8.0 testing (1:30:00 hr, ~30GiB memory). Note that >90% of the run time of that recipe is taken up by the diagnostic, so I wouldn't expect a lot of improvement with the distributed scheduler (assuming the diagnostic is not optimized).
  • recipe_climate_change_hotspot: failed, see below.
  • recipe_mpqb_xch4: failed, see below.
  • recipe_sea_surface_salinity: failed, see below.
  • recipe_daily_era5.yml: ran successfully in 18:00 min with 161 GiB of memory, but had a lot of warnings about gc.collect() took 1.933s. This is usually a sign that some tasks handle too many Python objects at the same time. Rechunking the work into smaller tasks might help. and Worker process still alive after 3.1999995422363288 seconds, killing. This is certainly not optimal (for v2.8.0, the recipe ran in 12 min and 31 GiB memory usage), but at least it runs. I am pretty sure this could be optimized with smarter dask settings.

The runs that failed exited with errors like concurrent.futures._base.CancelledError: compute_and_return_warnings-3ee342d0-affa-4e69-9f2a-09fc9e7cbe92 during the save step. I also see a lot of warnings like Sending large graph of size ... or full garbage collections took XX% CPU time recently in these runs, so my guess is that some of the preprocessors used in these recipes are not fully lazy yet.

Bottom line: I think for the fully lazy preprocessors we can achieve a similar (hopefully a much better with smarter configuration) run time and memory usage when using dask distributed, but the non-lazy preprocessors crash with that.

@bouweandela
Copy link
Member Author

I tested recipe_daily_era5.yml with both iris 3.4 and 3.6.1 without configuring Dask Distributed and it runs fine with iris 3.4 and it runs out of memory with 3.6.1.

@valeriupredoi
Copy link
Contributor

carvalhais goes through with dask=2023.3.0 but fails a couple lines down the diag at plotting:

Traceback (most recent call last):
  File "/home/b/b382109/ESMValTool/esmvaltool/diag_scripts/land_carbon_cycle/diag_global_turnover.py", line 819, in <module>
    main(config)
  File "/home/b/b382109/ESMValTool/esmvaltool/diag_scripts/land_carbon_cycle/diag_global_turnover.py", line 757, in main
    _plot_single_map(plot_path_mod, tau_ctotal,
  File "/home/b/b382109/ESMValTool/esmvaltool/diag_scripts/land_carbon_cycle/diag_global_turnover.py", line 654, in _plot_single_map
    plt.imshow(_get_data_to_plot(_dat.data),
  File "/home/b/b382109/miniconda3/envs/tool-latest/lib/python3.11/site-packages/matplotlib/pyplot.py", line 2695, in imshow
    __ret = gca().imshow(
            ^^^^^^^^^^^^^
  File "/home/b/b382109/miniconda3/envs/tool-latest/lib/python3.11/site-packages/cartopy/mpl/geoaxes.py", line 318, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/b/b382109/miniconda3/envs/tool-latest/lib/python3.11/site-packages/cartopy/mpl/geoaxes.py", line 1331, in imshow
    img, extent = warp_array(img,
                  ^^^^^^^^^^^^^^^
  File "/home/b/b382109/miniconda3/envs/tool-latest/lib/python3.11/site-packages/cartopy/img_transform.py", line 192, in warp_array
    array = regrid(array, source_native_xy[0], source_native_xy[1],
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/b/b382109/miniconda3/envs/tool-latest/lib/python3.11/site-packages/cartopy/img_transform.py", line 278, in regrid
    _, indices = kdtree.query(target_xyz, k=1)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "_ckdtree.pyx", line 795, in scipy.spatial._ckdtree.cKDTree.query
ValueError: 'x' must be finite, check for nan or inf values

so cartopy is breaking it now

@valeriupredoi
Copy link
Contributor

valeriupredoi commented Jun 28, 2023

the problem with downgrading dask is that we actually have to isolate a bug at their end, but beats me if I could do that - what I did was I saved the scalar cube that was causing the issue, and tried to repeat the operation that was not liked by dask:

import iris

def make_cube():
    """Make a scalar cube and save it."""
    cube = iris.load_cube("tauxxx.nc")
    return cube


def main():
    """Run the show."""
    cube = make_cube()
    cube.data = cube.lazy_data()
    print(cube.has_lazy_data())
    print(cube)
    x = float(cube.core_data())
    print(x)


if __name__ == "__main__":
    main()

problem is, that works well, no issue. Note that I needed to force the lazy data bc the cube in the diag has indeed lazy data, whereas if you load an already scalar cube it has realized data. Am stumped 😖

@valeriupredoi
Copy link
Contributor

valeriupredoi commented Jun 29, 2023

carvalhais goes through with dask=2023.3.0 but fails a couple lines down the diag at plotting:

Traceback (most recent call last):
  File "/home/b/b382109/ESMValTool/esmvaltool/diag_scripts/land_carbon_cycle/diag_global_turnover.py", line 819, in <module>
    main(config)
  File "/home/b/b382109/ESMValTool/esmvaltool/diag_scripts/land_carbon_cycle/diag_global_turnover.py", line 757, in main
    _plot_single_map(plot_path_mod, tau_ctotal,
  File "/home/b/b382109/ESMValTool/esmvaltool/diag_scripts/land_carbon_cycle/diag_global_turnover.py", line 654, in _plot_single_map
    plt.imshow(_get_data_to_plot(_dat.data),
  File "/home/b/b382109/miniconda3/envs/tool-latest/lib/python3.11/site-packages/matplotlib/pyplot.py", line 2695, in imshow
    __ret = gca().imshow(
            ^^^^^^^^^^^^^
  File "/home/b/b382109/miniconda3/envs/tool-latest/lib/python3.11/site-packages/cartopy/mpl/geoaxes.py", line 318, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/b/b382109/miniconda3/envs/tool-latest/lib/python3.11/site-packages/cartopy/mpl/geoaxes.py", line 1331, in imshow
    img, extent = warp_array(img,
                  ^^^^^^^^^^^^^^^
  File "/home/b/b382109/miniconda3/envs/tool-latest/lib/python3.11/site-packages/cartopy/img_transform.py", line 192, in warp_array
    array = regrid(array, source_native_xy[0], source_native_xy[1],
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/b/b382109/miniconda3/envs/tool-latest/lib/python3.11/site-packages/cartopy/img_transform.py", line 278, in regrid
    _, indices = kdtree.query(target_xyz, k=1)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "_ckdtree.pyx", line 795, in scipy.spatial._ckdtree.cKDTree.query
ValueError: 'x' must be finite, check for nan or inf values

so cartopy is breaking it now

this is signaled and raised here SciTools/cartopy#2199 and it looks like they already have a PR to fix it 🥳

@bouweandela
Copy link
Member Author

bouweandela commented Jun 29, 2023

Summary of relevant differences:

Recipes where the plots are identical, but (some) data differs

  • recipe_albedolandcover.yml
  • recipe_bock20jgr_fig_1-4.yml
  • recipe_bock20jgr_fig_6-7.yml
  • recipe_bock20jgr_fig_8-10.yml
  • recipe_clouds_bias.yml
  • recipe_clouds_ipcc.yml
  • recipe_extract_shape.yml
  • recipe_flato13ipcc_figure_98.yml
  • recipe_flato13ipcc_figures_92_95.yml
  • recipe_hydro_forcing.yml
  • recipe_ipccwg1ar6ch3_atmosphere.yml
  • recipe_ipccwg1ar6ch3_fig_3_43.yml
  • recipe_lauer22jclim_fig1_clim.yml
  • recipe_lauer22jclim_fig1_clim_amip.yml
  • recipe_lauer22jclim_fig6_interannual.yml
  • recipe_lauer22jclim_fig7_seas.yml
  • recipe_lauer22jclim_fig8_dyn.yml
  • recipe_perfmetrics_CMIP5.yml
  • recipe_perfmetrics_land_CMIP5.yml
  • recipe_smpi.yml
  • recipe_smpi_4cds.yml

For the above recipes, I suspect we might want to try if we can make the comparison if the data is the same more tolerant, since the plots are considered identical enough.

The recipes below need a check by the @ESMValGroup/esmvaltool-recipe-maintainers (please let us know if you are no longer interested in being a maintainer, e.g. by commenting in this issue):

Recipes where (some) plots differ, but not the data

(Some) plots and data are different

The results from running the recipes with v2.8.0 of the tool are available here and the results from running them with v2.9.0 of the tool are available here.

You can see very specifically what is different in the comparison tool output files: detailed comparison part 1 and detailed comparison part 2.

@bouweandela
Copy link
Member Author

bouweandela commented Jun 29, 2023

@ESMValGroup/technical-lead-development-team Because we're seeing a considerable performance regression for some recipes when using the basic/default Dask scheduler, I would propose that we make the release with iris pinned to v3.4 or greater instead of 3.6 or later. That would allow users who are really keen on using the basic scheduler to do so. Please let me know if there's objections to that.

@ledm
Copy link
Contributor

ledm commented Jun 29, 2023

Out of interest, what was the last ESMValTool version that uses iris 3.4?

@remi-kazeroni
Copy link
Contributor

Out of interest, what was the last ESMValTool version that uses iris 3.4?

For the release of v2.8.0, we used iris-3.4.1 to test all recipes

@katjaweigel
Copy link
Contributor

@bouweandela thanks for the list! The plot from recipe_spei.yml only changed the colour (because R chooses different ones every time, I should set fixed ones there but I guess there is not enough time to change that before the release now).
For recipe_flato13ipcc_figures_938_941_cmip3.yml, recipe_flato13ipcc_figures_938_941_cmip6.yml and recipe_weigel21gmd_figures_13_16.yml there is a somewhat different setting of the fx files now (see #3215), which explains the changed values.

@bouweandela
Copy link
Member Author

Thanks for checking @katjaweigel! I was planning to make the release on Thursday next week, so if you still want to fix the plotting in recipe_spei.yml before that time you can do it.

@remi-kazeroni
Copy link
Contributor

Regarding the list of "Recipes that failed because they used too much memory" (in this #3239 (comment)), I was able to run them successfully on Levante, except one, by increasing the computational resources required (see #3216). This does not really tell us where the problems come from but will nevertheless help the RM who gets less failures.

The biggest increase of resources required is for the 2 cloud recipes. Using 1024 GB compute nodes and max_parallel_tasks=3, I was able to run recipe_lauer22jclim_fig5_lifrac (Time for running the recipe was: 5:20:57.188249, Maximum memory used (estimate): 881.6 GB) but not recipe_lauer22jclim_fig3-4_zonal (ends with 1 tasks running, 1 tasks waiting for ancestors, 14/16 done after the 8h limit). FYI, @axel-lauer.

@bouweandela
Copy link
Member Author

bouweandela commented Jul 6, 2023

We did another round of test runs. The results are available here: https://esmvaltool.dkrz.de/shared/esmvaltool/v2.9.0/

Here is the environment.yml and here are the comparison results and the detailed comparison results. The comparison was done with the previous round of testing for this release. Because of disk space limitations, I had to remove the v2.9.0rc1 results from the webpage, but the results are still available in /work/bd0854/b381141/esmvaltool_output/esmvaltool-v2.9.x-2023-06-20 for those who have access to Levante.

All recipes ran successfully except for:

@valeriupredoi
Copy link
Contributor

@bouweandela here's your Portuguese issue, bud #3281

@valeriupredoi
Copy link
Contributor

valeriupredoi commented Aug 2, 2023

I think we can safely close this now that we have #3287 - @bouweandela @remi-kazeroni since you put so much effort in it, I'd feel bad closing it myself, will you do the service pls 🎖️

@maritsandstad
Copy link
Contributor

Hi, I've been away on vacation and didn't see the mention to check differing plots before now. To me it looks like only consecutive dry days have changed, and I wondered if it's just that the version of consecutive dry days has changed from the one that caps the number at the end of a year to the one that runs over multiple years when the longest dry period extends across calendar years. That would explain the plot discrepancy, they weird end of the original plot (in 2.8), and would anyway be the more sensible thing to have plotted this way. I.e. I think that this is an improvement, then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants