some restart vector indices incorrect relative to io_idx_co_1st #1046

rgknox · 2023-06-23T17:14:50Z

When we translate from the site -> patch -> cohort system of memory to the vectorized restart system of memory, we make use of an index called "io_idx_co_1st". This index, is supposed to be the first index, in the global restart vector, for the current site of interest. We use the subscript "co" to indicate that this is for an allocation that can hold stuff up to the cohort level. So if we had 100 sites, and needed enough space to hold 1000 cohorts per sites, we would have a vector of 100x1000 at our disposal for reading and writing restart info. io_idx_co_1st would be 1 for the first site, and 1001 for the second site and so on.

However, we have co-opted this index to also give us a starter index on the patch level variables. However, after we increase this value, we then switch back and translate over some site level variables. This is problematic, we should not done things in this order.

See here: https://github.com/NGEET/fates/blob/sci.1.65.7_api.25.4.0/main/FatesRestartInterfaceMod.F90#L2360-L2380
and
See here: https://github.com/NGEET/fates/blob/sci.1.65.7_api.25.4.0/main/FatesRestartInterfaceMod.F90#L2423

This translation should happen before we execute the patch loop, specifically before we start updating io_idx_co_1st.

A user may get lucky, and they will not run out of vector space. If the run does not run out of vector space, then I do expect things to be bit for bit, because we make the same mistake on the retrieval and writing routine. However, I did just trigger a memory exceedance error on an SP run. I'm guessing because the number of cohorts per site is set to be so small.

I'm going to include a fix in my radiation refactor branch, but we can apply the fix earlier if need be.

rgknox · 2023-06-23T17:18:52Z

fyi: @rosiealice , in case you have been experiencing crashes during SP, this could potentially fix

rgknox · 2023-06-23T18:50:49Z

To get tests working, I needed to add another fix, we hadn't been calculating nlevsclass before its use in setting the restart allocation size.

This change addresses comments above: rgknox@5211412
This change addresses the nlevsclass order:
rgknox@5b2129a

rgknox · 2023-07-03T18:50:40Z

as of PR #958 , this has been fixed. For instance, see updated order here: https://github.com/NGEET/fates/blob/sci.1.66.0_api.25.5.0/main/FatesRestartInterfaceMod.F90#L2106-L2136

rgknox self-assigned this Jun 23, 2023

rgknox closed this as completed Jul 3, 2023

glemieux mentioned this issue Jul 3, 2023

Drought deciduous phenology updates #958

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

some restart vector indices incorrect relative to io_idx_co_1st #1046

some restart vector indices incorrect relative to io_idx_co_1st #1046

rgknox commented Jun 23, 2023

rgknox commented Jun 23, 2023

rgknox commented Jun 23, 2023 •

edited

Loading

rgknox commented Jul 3, 2023

some restart vector indices incorrect relative to io_idx_co_1st #1046

some restart vector indices incorrect relative to io_idx_co_1st #1046

Comments

rgknox commented Jun 23, 2023

rgknox commented Jun 23, 2023

rgknox commented Jun 23, 2023 • edited Loading

rgknox commented Jul 3, 2023

rgknox commented Jun 23, 2023 •

edited

Loading