Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some restart vector indices incorrect relative to io_idx_co_1st #1046

Closed
rgknox opened this issue Jun 23, 2023 · 3 comments · Fixed by #958
Closed

some restart vector indices incorrect relative to io_idx_co_1st #1046

rgknox opened this issue Jun 23, 2023 · 3 comments · Fixed by #958
Assignees

Comments

@rgknox
Copy link
Contributor

rgknox commented Jun 23, 2023

When we translate from the site -> patch -> cohort system of memory to the vectorized restart system of memory, we make use of an index called "io_idx_co_1st". This index, is supposed to be the first index, in the global restart vector, for the current site of interest. We use the subscript "co" to indicate that this is for an allocation that can hold stuff up to the cohort level. So if we had 100 sites, and needed enough space to hold 1000 cohorts per sites, we would have a vector of 100x1000 at our disposal for reading and writing restart info. io_idx_co_1st would be 1 for the first site, and 1001 for the second site and so on.

However, we have co-opted this index to also give us a starter index on the patch level variables. However, after we increase this value, we then switch back and translate over some site level variables. This is problematic, we should not done things in this order.

See here: https://github.com/NGEET/fates/blob/sci.1.65.7_api.25.4.0/main/FatesRestartInterfaceMod.F90#L2360-L2380
and
See here: https://github.com/NGEET/fates/blob/sci.1.65.7_api.25.4.0/main/FatesRestartInterfaceMod.F90#L2423

This translation should happen before we execute the patch loop, specifically before we start updating io_idx_co_1st.

A user may get lucky, and they will not run out of vector space. If the run does not run out of vector space, then I do expect things to be bit for bit, because we make the same mistake on the retrieval and writing routine. However, I did just trigger a memory exceedance error on an SP run. I'm guessing because the number of cohorts per site is set to be so small.

I'm going to include a fix in my radiation refactor branch, but we can apply the fix earlier if need be.

@rgknox rgknox self-assigned this Jun 23, 2023
@rgknox
Copy link
Contributor Author

rgknox commented Jun 23, 2023

fyi: @rosiealice , in case you have been experiencing crashes during SP, this could potentially fix

@rgknox
Copy link
Contributor Author

rgknox commented Jun 23, 2023

To get tests working, I needed to add another fix, we hadn't been calculating nlevsclass before its use in setting the restart allocation size.

This change addresses comments above: rgknox@5211412
This change addresses the nlevsclass order:
rgknox@5b2129a

@rgknox
Copy link
Contributor Author

rgknox commented Jul 3, 2023

as of PR #958 , this has been fixed. For instance, see updated order here: https://github.com/NGEET/fates/blob/sci.1.66.0_api.25.5.0/main/FatesRestartInterfaceMod.F90#L2106-L2136

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant