Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lib-4412 : UNRECOVERABLE library error of Cray compiler CCE/14.0.0 on Crusher #5001

Closed
grnydawn opened this issue May 31, 2022 · 15 comments · Fixed by #5369
Closed

lib-4412 : UNRECOVERABLE library error of Cray compiler CCE/14.0.0 on Crusher #5001

grnydawn opened this issue May 31, 2022 · 15 comments · Fixed by #5369
Labels

Comments

@grnydawn
Copy link
Contributor

The error message is:

6: lib-4412 : UNRECOVERABLE library error
6: An argument in the DEALLOCATE statement is a disassociated pointer, an
6: unallocated array, or a pointer not allocated as a pointer.

Test case name is:

SMS_Ld20.f45_f45.IELMFATES.crusher_crayclang.elm-fates_eca
SMS_Ld20.f45_f45.IELMFATES.crusher_crayclang.elm-fates_rd
ERS_Ld20.f45_f45.IELMFATES.crusher_crayclang.elm-fates

@grnydawn grnydawn added Crusher Cray Cray compiler related issues labels May 31, 2022
@sarats
Copy link
Member

sarats commented Jun 29, 2022

Please add location of the error for reference.

@grnydawn
Copy link
Contributor Author

@sarats , I could not locate the source location of this error because there is no location information in e3sm log file. I guess that locating the point of this error may be difficult because the error is occurred inside of a certain binary library.

@sarats
Copy link
Member

sarats commented Jul 1, 2022

Naive question: this is a runtime error, right?
Just a sanity check, did any component log file leave a pointer to where the issue might be?

@grnydawn
Copy link
Contributor Author

grnydawn commented Jul 1, 2022

@sarats , yes, it is a runtime error. The error message does not leave any pointer to where the issue occurred. One thing we know from the e3sm log file is that this error occurred after after printing many balance check warnings. Please see the part of the error in e3sm log file.
...
40: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
40: nstep = 138
40: errsol = -1.04153343727375614E-7
3:
3: lib-4412 : UNRECOVERABLE library error
3: An argument in the DEALLOCATE statement is a disassociated pointer, an
3: unallocated array, or a pointer not allocated as a pointer.
...

@amametjanov
Copy link
Member

IBM compiler on Ascent/Summit has a similar issue for these 3 cases. There's a work-around for IBM in NGEET/fates#824 .

Pinging @rgknox and @glemieux . Have you been able to get accounts on Summit (which would also enable access to Crusher)?

IBM: https://my.cdash.org/viewTest.php?onlyfailed&buildid=2223558
Cray: https://my.cdash.org/viewTest.php?onlyfailed&buildid=2223909

@glemieux
Copy link
Contributor

glemieux commented Oct 3, 2022

@amametjanov I've got access to Summit, but addressing NGEET/fates#702 has been low on the priority list. The solar radiation issue is a known issue as well (NGEET/fates#794). I'll chat with Ryan and Charlie about prioritizing this soon.

@sarats
Copy link
Member

sarats commented Jan 24, 2023

@amametjanov Does the above workaround you devised for IBM work for Cray compilers as well?
@grnydawn You can check out Az's fix and check as well.

@glemieux Getting this fix incorporated would be good to get things working on Crusher/Frontier.

cc @rljacob

@rljacob
Copy link
Member

rljacob commented Jan 24, 2023

@glemieux Crusher access is included with your OLCF/Summit access. So will Frontier when its available.

@grnydawn
Copy link
Contributor Author

@amametjanov @sarats The source file(biogeochem/EDCohortDynamicsMod.F90) does not exist in current master branch.

Az: can you point me which file I should look at?

@amametjanov
Copy link
Member

In the master version from Jan 20, those files are under components/elm/src/external_models/fates.

@grnydawn
Copy link
Contributor Author

@sarats @amametjanov Thanks for the info. FYI, I tried to copy your fixes into the latest E3SM, and still got similar error at one of deallocation statements in "PRTGenericMod.F90" as shown below.

do i_var = 1, prt_global%num_vars
deallocate( &
      & this%variables(i_var)%val, &
      & this%variables(i_var)%val0, &
      & this%variables(i_var)%net_alloc, &
      & this%variables(i_var)%turnover, &
      & this%variables(i_var)%burned, &
      & stat=istat, errmsg=smsg )
   if (istat/=0) call endrun(msg='deallocate stat/=0:'//trim(smsg)//errMsg(sourcefile, __LINE__))
end do

@amametjanov
Copy link
Member

Possibly something went wrong with copying. I just rebased my branch onto latest E3SM (FATES submodule hash def6b3e76f9ff3043150a777f403883b3e659374).
All 3 cases still pass with Cray compiler.
Just for reference, steps to reproduce:

cd components/elm/src/external_models/fates
git remote -v
git remote add newfork [email protected]:amametjanov/fates.git
git remote -v
git fetch newfork
git checkout azamat/fix-ibm-dealloc-errors
cd -
./cime/scripts/create_test
SMS_Ld20.f45_f45.IELMFATES.crusher_crayclang.elm-fates_eca 
SMS_Ld20.f45_f45.IELMFATES.crusher_crayclang.elm-fates_rd 
ERS_Ld20.f45_f45.IELMFATES.crusher_crayclang.elm-fates
-t 20230125-chk-fates-pr

@grnydawn
Copy link
Contributor Author

@amametjanov @sarats Yes, after following your directions, I could run the three test cases without failure using Cray compiler. However, a memory leak is detected with Amd compiler at following test case:

SMS_Ld20.f45_f45.IELMFATES.crusher_amdclang.elm-fates_rd

Even thought the memory leak issue with Amd compiler exists, I think that it is still better to have this fix implemented.

@glemieux
Copy link
Contributor

FYI, this should be fixed by NGEET/fates#824 the next time we update to point the fates submodule to tag sci.1.63.2_api.25.1.0.

@grnydawn
Copy link
Contributor Author

FYI, this should be fixed by NGEET/fates#824 the next time we update to point the fates submodule to tag sci.1.63.2_api.25.1.0.

@glemieux Thanks for the fix. I will try to test it at my side on Crusher when the fix is visible at E3SM master branch.

peterdschwartz added a commit that referenced this issue Mar 22, 2023
…pi' into next (PR #5369)

This pull request updates the ELM-FATES API to provide FATES with the lightning and population density data from FireMod.F90.
This provides ELM-FATES users access to the additional SPITFIRE run modes. The design is adapts the framework developed for CLM-FATES.
The design document discussing the background and general design is available in the FATES Developer's Guide.
All non-fates tests should be b4b as this PR only adds access to additional FATES modes which are not yet covered by existing tests.

This also updates the fates pointer to tag sci.1.63.2_api.25.1.0 bring in the fix to the cray and ibm pointer deallocation issue to resolve #5001.

Fixes #5001

[B4B]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants