-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trying to update GNU compilers on pm-cpu, encounter hang/FPE with certain tests #6516
Comments
It looks like when I simply try machine defaults, the tests that were failing are OK. Which is easy fix, but that means to upgrade other compilers (such as intel), we will need to have different versions of things like mpich depending on the compiler. Which is maybe fine -- will make branch/PR. Another issue is that I already tested and merged PR to scream repo that updates pm-gpu with the "ideal" version of GNU compiler. Nothing stopping us from having two different versions on the two different compute clusters, but of course, best to have them the same.
|
Well whadyaknow... coming back to this, I verified still see same hang with newer GCC with at least two of the tests above, and then trying again with So using |
With new slighshot software (s2.2 h11.0.1), now installed on Perlmutter, there were some hangs in init for certain cases at higher node counts. Using environment variable FI_MR_CACHE_MONITOR=kdreg2 avoids any issues so far. kdreg2 is another option for memory cache monitoring -- it is a Linux kernel module using open-source licensing. It comes with HPE Slingshot host software distribution (optionally installed) and may one day be the default. Regarding performance, it seems about the same. For one HR F-case at 256 nodes, using kdreg2 was about 1% slower. Fixes #6655 I also found some older issues (some with lower node-count) that this fixes: Fixes #6516 Fixes #6451 Fixes #6521 [bfb]
…' into next (PR #6702) On pm-cpu we were using updated Intel compiler and other module versions that were compatible, but had not yet updated the others due to #6516. After #6687, I think they are resolved and we can now update. The main change here is updating gcc compiler, but other module versions are also updated at the same time. Also, try to clean up the machine config settings across all NERSC machines to be more consistent. module current machine defaults (in this PR) gcc 12.2.0 12.3 (gcc-native) PrgEnv-gnu 8.3.3 8.5.0 cray-libsci 23.02.1.1 23.12.5 cray-mpich 8.1.25 8.1.28 craype 2.7.20 2.7.30 cray-hdf5-parallel 1.12.2.3 1.12.2.9 cray-netcdf-hdf5parallel 4.9.0.3 4.9.0.9 cray-parallel-netcdf 1.12.3.3 1.12.3.9 On muller-cpu, already tried updating the compiler versions and was testing a work-around with special compiler flags for GNU. With this PR, can remove that work-around. Removing FI_CXI_RX_MATCH_MODE=software for machines other than primary pm-cpu. Will remove this in another PR as the default FI_CXI_RX_MATCH_MODE=hybrid now seems fine. We might see some cases not BFB using GNU compiler. For the e3sm-developer tests (what we test nightly), the only test that did not pass baseline compare was ERP_Ld3.ne4pg2_oQU480.F2010.pm-cpu_gnu
I've been trying to update several of the module versions on pm-cpu to "ideal" versions. Most test (with intel,gnu,nvidia) seem OK, but a few are still problematic and I wanted to save some notes here. For GNU, wanting to update to 12.3 and NERSC calls the module
gcc-native/12.3
. Also updating these at same time (some are required as a package update) -- currently trying those in "ideal" below:Running e3sm_integration, the only tests with issues are only DEBUG built tests and unfortunately they hang during init. I managed to find two files that I can alter compiler flags and get a FPE instead, considered an improvement, but may or may not be the same issue causing the hang. For the following 2 files, if I add
-O
(ie, disable-O0
)I can get the following error with several tests:
So far, here are the tests showing the error:
I can also login to compute node (on similar machine muller-cpu) during a hang and view where one process is sitting:
where it looks like at
#14
it is error before MPI stack?Noting some tests that complete:
The text was updated successfully, but these errors were encountered: