Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Several weekly GPU longruns are broken #2934

Closed
szy21 opened this issue Apr 20, 2024 · 23 comments
Closed

Several weekly GPU longruns are broken #2934

szy21 opened this issue Apr 20, 2024 · 23 comments
Assignees

Comments

@szy21
Copy link
Member

szy21 commented Apr 20, 2024

[build]((https://buildkite.com/clima/climaatmos-gpulongruns/builds/253) on 6/14
longrun_aquaplanet_clearsky_1M: unstable after ~17 weeks. t_end was extended to 120 days for this run so it is broken.
Status: not fixed
longrun_aquaplanet_rhoe_equil_55km_nz63_clearsky_tvinsol_0M_slabocean: stable, but conservation test failed due to the change to SSP.
Status: fixed by #3159, which ensures the surface state uses the correct precipitation tendency.

build on 6/14
longrun_aquaplanet_clearsky_1M: unstable after ~5 days. It seems 1M doesn't have a regression test now. Maybe related to #3084?
Status: fixed by #3095. Now it runs for longer and becomes unstable eventually, which is a separate issue.

build on 6/7
longrun_bw_rhoe_equil_highres. This is from PR #3074 which removes the reference state. Reordering the u3 tendency to make it similar to before in this commit fixes it. This suggests that we may be close to the instability regime. We will see if it works with SSPKnoth.
Status: fixed by switching to SSP
longrun_aquaplanet_rhoe_equil_55km_nz63_clearsky_0M_earth. This is from PR #3074 which removes the reference state.
Status: not fixed

build on 5/27
longrun_sphere_hydrostatic_balance_rhoe. This is from a change to a higher horizontal and vertical resolution (PR #3028)
Status: fixed by switching to SSP
longrun_aquaplanet_rhoe_equil_55km_nz63_gray_0M. PR #2855 is the only one that changes MSE this week.
Status: fixed by fixing a bug in PR #2855

build on 4/19
longrun_aquaplanet_rhoe_equil_55km_nz63_clearsky_0M_earth_sleve. cc @akshaysridhar . Changing SST to a function of surface height results in behavior changes in this job.
Status: not fixed
longrun_aquaplanet_clearsky_1M . cc @trontrytel . I don't know why this job has behavior changes.
Status: fixed (as in, it works now, I don't know why)

@szy21 szy21 added this to the Maintenance and Improvements milestone Apr 20, 2024
@Sbozzolo
Copy link
Member

SSP baroclinic wave also seems consistenly broken:
https://buildkite.com/clima/climaatmos-longruns/builds/1572#018f6607-605b-447a-b511-1111a4869090

@szy21
Copy link
Member Author

szy21 commented May 15, 2024

SSP baroclinic wave also seems consistenly broken: https://buildkite.com/clima/climaatmos-longruns/builds/1572#018f6607-605b-447a-b511-1111a4869090

Yeah, we have been ignoring this one. Hope we don't need it anymore after SSPKnoth:)

@trontrytel
Copy link
Member

Would it make sense to look at all of those cases after SSPKnoth lands?

@szy21
Copy link
Member Author

szy21 commented May 28, 2024

Would it make sense to look at all of those cases after SSPKnoth lands?

Yes, I just tagged you to let you know there is an open issue. Also make sure they don't break diagnostic EDMF:)

@akshaysridhar
Copy link
Member

akshaysridhar commented May 28, 2024

We will revisit this after SSPKnoth; I'm adding some fixes to a bycolumn op (in the use of column_mapreduce) in our cached vars to see if I can fix the 2 breaking cases as of Friday's run (these were both working as of April 12)

@Sbozzolo
Copy link
Member

Sbozzolo commented Jun 5, 2024

I found that PR #2855 breaks one run that works otherwise with SSPKnoth

@szy21
Copy link
Member Author

szy21 commented Jun 5, 2024

I found that PR #2855 breaks one run that works otherwise with SSPKnoth

Interesting, so that PR breaks two of the working runs, although I see no reason that it should cause physical changes. What is the job that failed?

@Sbozzolo
Copy link
Member

Sbozzolo commented Jun 5, 2024

I found that PR #2855 breaks one run that works otherwise with SSPKnoth

Interesting, so that PR breaks two of the working runs, although I see no reason that it should cause physical changes. What is the job that failed?

https://buildkite.com/clima/climaatmos-ci/builds/19073#018fe547-837c-4eab-a70f-2691b203a9fa

@szy21
Copy link
Member Author

szy21 commented Jun 5, 2024

Could someone take another look at the PR #2855 to see if the change in MSE is only from the change in the order of computation? Maybe @juliasloan25?

@Sbozzolo
Copy link
Member

Sbozzolo commented Jun 5, 2024

I am looking at it right now because it is blocking my work on Rosenbrock.

@Sbozzolo
Copy link
Member

Sbozzolo commented Jun 5, 2024

I took one integration step (for the job that is failing for me) and compared before and after the patch. The results are very different:

julia> working.integrator.p.precomputed.sfc_conditions.ρ_flux_uₕ
ClimaCore.Geometry.AxisTensor{Float32, 2, Tuple{ClimaCore.Geometry.CovariantAxis{(3,)}, ClimaCore.Geometry.CovariantAxis{(1, 2)}}, StaticArraysCore.SMatrix{1, 2, Float32, 2}}-valued Field:
  components: 
    data: 
      1: Float32[-319.072, -378.13, -442.548, -463.268, -337.013, -383.945, -441.563, -477.315, -352.969, -396.443  …  -384.247, -446.425, -296.626, -348.484, -427.235, -464.666, -319.072, -378.13, -442.548, -463.268]
      2: Float32[-28.7554, -25.6094, -20.8332, -12.6396, -32.5333, -30.2907, -28.3866, -19.9347, -42.1406, -43.148  …  -392.587, -418.496, -344.003, -365.153, -395.624, -399.4, -347.827, -377.507, -388.069, -371.485]

julia> greg.integrator.p.precomputed.sfc_conditions.ρ_flux_uₕ
ClimaCore.Geometry.AxisTensor{Float32, 2, Tuple{ClimaCore.Geometry.CovariantAxis{(3,)}, ClimaCore.Geometry.CovariantAxis{(1, 2)}}, StaticArraysCore.SMatrix{1, 2, Float32, 2}}-valued Field:
  components: 
    data: 
      1: Float32[-695.62, -742.976, -759.328, -734.481, -681.863, -710.018, -727.113, -730.004, -653.286, -684.591  …  -596.337, -668.212, -443.825, -503.773, -595.453, -632.766, -444.601, -514.036, -581.144, -592.805]
      2: Float32[347.81, 345.716, 315.054, 284.463, 306.303, 295.782, 269.1, 252.041, 241.099, 233.439  …  -180.496, -189.359, -199.373, -209.864, -220.374, -218.469, -222.301, -239.188, -241.123, -229.592]

@szy21
Copy link
Member Author

szy21 commented Jun 5, 2024

I suggest that we revert it, unless @akshaysridhar or @glwagner understand why there is a substantial change in surface fluxes from that PR?

@szy21
Copy link
Member Author

szy21 commented Jun 5, 2024

@Sbozzolo Could you check to see if ρ_flux_h_tot and ρ_flux_q_tot are very different as well?

@Sbozzolo
Copy link
Member

Sbozzolo commented Jun 5, 2024

Steps to reproduce:

First, checkout commit before/after change, e.g.

git checkout b51150ee4 

Then

import ClimaAtmos as CA
import SciMLBase
working = CA.get_simulation(CA.AtmosConfig("config/mpi_configs/mpi_sphere_aquaplanet_rhoe_equilmoist_clearsky.yml"))
SciMLBase.step!(working.integrator)
println(working.integrator.p.precomputed.sfc_conditions.ρ_flux_uₕ)

@Sbozzolo
Copy link
Member

Sbozzolo commented Jun 5, 2024

@Sbozzolo Could you check to see if ρ_flux_h_tot and ρ_flux_q_tot are very different as well?

After the first step, they are the same.

@Sbozzolo
Copy link
Member

Sbozzolo commented Jun 5, 2024

I suggest that we revert it, unless @akshaysridhar or @glwagner understand why there is a substantial change in surface fluxes from that PR?

(Maybe @dennisYatunin too)

@szy21
Copy link
Member Author

szy21 commented Jun 5, 2024

wait I think there is a bug in that PR.

f = C12(f₁₃ * xz + f₂₃ * xz, L)
here is should be f = C12(f₁₃ * xz + f₂₃ * yz, L)?

@szy21
Copy link
Member Author

szy21 commented Jun 5, 2024

@Sbozzolo could you try if that fixes the issue? (And we should somehow prevent this from happening in the future...)

@akshaysridhar
Copy link
Member

wait I think there is a bug in that PR.

f = C12(f₁₃ * xz + f₂₃ * xz, L)

here is should be f = C12(f₁₃ * xz + f₂₃ * yz, L)?

Yes, this is a bug

@Sbozzolo
Copy link
Member

Sbozzolo commented Jun 5, 2024

Will try this now.

@Sbozzolo
Copy link
Member

Sbozzolo commented Jun 5, 2024

Yes, that works!

@Sbozzolo
Copy link
Member

Sbozzolo commented Jun 5, 2024

Opening a PR now

@szy21
Copy link
Member Author

szy21 commented Jul 28, 2024

The longrun experiments have been cleaned up and most of these are irrelevant now. Closing the issue for now.

@szy21 szy21 closed this as completed Jul 28, 2024
@szy21 szy21 unpinned this issue Jul 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants