Unstable simulations are not caught by buildkite, LES_driven_SCM actually breaking #574

ilopezgp · 2021-11-18T07:50:34Z

I just realized that the MSE buildkite tests can pass without the simulation reaching the end time. This situation is frequently seen in the LES_driven_SCM test. Examples in closed PRs,

In PR Add iterator for traversing grid #567 , the simulation errored 97% into the simulation. The MSE comparison is at time t=21390s.
In PR Make more variables climacore fields #564 , the simulation errored 89% into the simulation. The MSE comparison is at time t=19710s.

I attach an example of this behavior from the buildkite of PR #564. This issue has two components,

The current integration tests are fault tolerant, they shouldn't be.
The default LES_driven_SCM case breaks.

jakebolewski · 2021-11-18T14:32:42Z

I think for this you probably want some sort of state dump upon aborting (ex. simulation time, number of steps, run parameters). Then the CI can see if this abort state dump file exists for a particular run and raises an proper error exit code at the end. This will help with calibration as well because the error will have some context with it's state so you don't have to do the extra work of trying to figure out which parameters caused the abort to happen at which time.

575: Add post-run tests, error if we do not run to t_max r=costachris a=charleskawczynski Closes #574. Co-authored-by: Charles Kawczynski <[email protected]>

charleskawczynski · 2021-11-19T01:06:14Z

Hmm, #575 adds a test on the ode integrator that we've reach t_max, and adds post-run tests (which ensure NaN is not in the solution). But maybe we just need the NaNs check? Alternatively we could return the a success flag and throw the error after the state dump / comparison / mse computations. Yeah, that's probably more helpful. Will open a PR.

581: Error on aborted simulations after solution export r=charleskawczynski a=charleskawczynski This PR moves the error on early aborted simulations to _after_ the solution is exported. This way we can more easily look at the solution / data to see what went wrong. [Discussed here](#574 (comment)). I'm hoping that this will shed light on #577. Co-authored-by: Charles Kawczynski <[email protected]>

ilopezgp added bug Something isn't working help wanted 👋 Extra attention is needed labels Nov 18, 2021

ilopezgp assigned charleskawczynski Nov 18, 2021

charleskawczynski mentioned this issue Nov 18, 2021

Add post-run tests, error if we do not run to t_max #575

Merged

bors bot added a commit that referenced this issue Nov 18, 2021

Merge #575

bbfc9eb

575: Add post-run tests, error if we do not run to t_max r=costachris a=charleskawczynski Closes #574. Co-authored-by: Charles Kawczynski <[email protected]>

bors bot closed this as completed in 74f799e Nov 19, 2021

charleskawczynski mentioned this issue Nov 19, 2021

Error on aborted simulations after solution export #581

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unstable simulations are not caught by buildkite, LES_driven_SCM actually breaking #574

Unstable simulations are not caught by buildkite, LES_driven_SCM actually breaking #574

ilopezgp commented Nov 18, 2021 •

edited

Loading

jakebolewski commented Nov 18, 2021 •

edited

Loading

charleskawczynski commented Nov 19, 2021

Unstable simulations are not caught by buildkite, LES_driven_SCM actually breaking #574

Unstable simulations are not caught by buildkite, LES_driven_SCM actually breaking #574

Comments

ilopezgp commented Nov 18, 2021 • edited Loading

jakebolewski commented Nov 18, 2021 • edited Loading

charleskawczynski commented Nov 19, 2021

ilopezgp commented Nov 18, 2021 •

edited

Loading

jakebolewski commented Nov 18, 2021 •

edited

Loading