You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When we translate from the site -> patch -> cohort system of memory to the vectorized restart system of memory, we make use of an index called "io_idx_co_1st". This index, is supposed to be the first index, in the global restart vector, for the current site of interest. We use the subscript "co" to indicate that this is for an allocation that can hold stuff up to the cohort level. So if we had 100 sites, and needed enough space to hold 1000 cohorts per sites, we would have a vector of 100x1000 at our disposal for reading and writing restart info. io_idx_co_1st would be 1 for the first site, and 1001 for the second site and so on.
However, we have co-opted this index to also give us a starter index on the patch level variables. However, after we increase this value, we then switch back and translate over some site level variables. This is problematic, we should not done things in this order.
This translation should happen before we execute the patch loop, specifically before we start updating io_idx_co_1st.
A user may get lucky, and they will not run out of vector space. If the run does not run out of vector space, then I do expect things to be bit for bit, because we make the same mistake on the retrieval and writing routine. However, I did just trigger a memory exceedance error on an SP run. I'm guessing because the number of cohorts per site is set to be so small.
I'm going to include a fix in my radiation refactor branch, but we can apply the fix earlier if need be.
The text was updated successfully, but these errors were encountered:
When we translate from the site -> patch -> cohort system of memory to the vectorized restart system of memory, we make use of an index called "io_idx_co_1st". This index, is supposed to be the first index, in the global restart vector, for the current site of interest. We use the subscript "co" to indicate that this is for an allocation that can hold stuff up to the cohort level. So if we had 100 sites, and needed enough space to hold 1000 cohorts per sites, we would have a vector of 100x1000 at our disposal for reading and writing restart info. io_idx_co_1st would be 1 for the first site, and 1001 for the second site and so on.
However, we have co-opted this index to also give us a starter index on the patch level variables. However, after we increase this value, we then switch back and translate over some site level variables. This is problematic, we should not done things in this order.
See here: https://github.com/NGEET/fates/blob/sci.1.65.7_api.25.4.0/main/FatesRestartInterfaceMod.F90#L2360-L2380
and
See here: https://github.com/NGEET/fates/blob/sci.1.65.7_api.25.4.0/main/FatesRestartInterfaceMod.F90#L2423
This translation should happen before we execute the patch loop, specifically before we start updating io_idx_co_1st.
A user may get lucky, and they will not run out of vector space. If the run does not run out of vector space, then I do expect things to be bit for bit, because we make the same mistake on the retrieval and writing routine. However, I did just trigger a memory exceedance error on an SP run. I'm guessing because the number of cohorts per site is set to be so small.
I'm going to include a fix in my radiation refactor branch, but we can apply the fix earlier if need be.
The text was updated successfully, but these errors were encountered: