Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying to update GNU compilers on pm-cpu, encounter hang/FPE with certain tests #6516

Closed
ndkeen opened this issue Jul 19, 2024 · 2 comments · Fixed by #6687
Closed

Trying to update GNU compilers on pm-cpu, encounter hang/FPE with certain tests #6516

ndkeen opened this issue Jul 19, 2024 · 2 comments · Fixed by #6687
Labels
GNU GNU compiler related issues Machine Files pm-cpu Perlmutter at NERSC (CPU-only nodes)

Comments

@ndkeen
Copy link
Contributor

ndkeen commented Jul 19, 2024

I've been trying to update several of the module versions on pm-cpu to "ideal" versions. Most test (with intel,gnu,nvidia) seem OK, but a few are still problematic and I wanted to save some notes here. For GNU, wanting to update to 12.3 and NERSC calls the module gcc-native/12.3. Also updating these at same time (some are required as a package update) -- currently trying those in "ideal" below:

module                      current        machine defaults   ideal
gcc                         11.2.0         12.2.0             12.3 (gcc-native)
PrgEnv-gnu                  8.3.3          8.5.0              8.5.0
cray-libsci                 23.02.1.1      23.02.1.1          23.12.5
cray-mpich                  8.1.25         8.1.25             8.1.28
craype                      2.7.20         2.7.20             2.7.30
cray-hdf5-parallel          1.12.2.3       1.12.2.3           1.12.2.9
cray-netcdf-hdf5parallel    4.9.0.3        4.9.0.3            4.9.0.9
cray-parallel-netcdf        1.12.3.3       1.12.3.3           1.12.3.9

Running e3sm_integration, the only tests with issues are only DEBUG built tests and unfortunately they hang during init. I managed to find two files that I can alter compiler flags and get a FPE instead, considered an improvement, but may or may not be the same issue causing the hang. For the following 2 files, if I add -O (ie, disable -O0)

+  eam/src/dynamics/se/inidat.F90
+  eam/src/dynamics/se/dyn_comp.F90

I can get the following error with several tests:

 561: Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
 561: 
 561: Backtrace for this error:
 561: #0  0x14c47a423372 in ???
 561: #1  0x14c47a422505 in ???
 561: #2  0x14c479853dbf in ???
 561: #3  0x1a35e91 in __edge_mod_base_MOD_edgevunpack_nlyr
 561:   at /global/cfs/cdirs/e3sm/ndk/repos/pr/ndk_mf_pm-cpu-update-compiler-versions/components/homme/src/share/edge_mod_base.F90:903
 561: #4  0x26be308 in __inidat_MOD_read_inidat
 561:   at /global/cfs/cdirs/e3sm/ndk/repos/pr/ndk_mf_pm-cpu-update-compiler-versions/components/eam/src/dynamics/se/inidat.F90:643
 561: #5  0x1cd9e85 in __startup_initialconds_MOD_initial_conds
 561:   at /global/cfs/cdirs/e3sm/ndk/repos/pr/ndk_mf_pm-cpu-update-compiler-versions/components/eam/src/control/startup_initialconds.F90:18
 561: #6  0x19dbe8e in __inital_MOD_cam_initial
 561:   at /global/cfs/cdirs/e3sm/ndk/repos/pr/ndk_mf_pm-cpu-update-compiler-versions/components/eam/src/dynamics/se/inital.F90:67
 561: #7  0x66335a in __cam_comp_MOD_cam_init
 561:   at /global/cfs/cdirs/e3sm/ndk/repos/pr/ndk_mf_pm-cpu-update-compiler-versions/components/eam/src/control/cam_comp.F90:162
 561: #8  0x651769 in __atm_comp_mct_MOD_atm_init_mct
 561:   at /global/cfs/cdirs/e3sm/ndk/repos/pr/ndk_mf_pm-cpu-update-compiler-versions/components/eam/src/cpl/atm_comp_mct.F90:371
 561: #9  0x49d9bc in __component_mod_MOD_component_init_cc
 561:   at /global/cfs/cdirs/e3sm/ndk/repos/pr/ndk_mf_pm-cpu-update-compiler-versions/driver-mct/main/component_mod.F90:248
 561: #10  0x484b80 in __cime_comp_mod_MOD_cime_init
 561:   at /global/cfs/cdirs/e3sm/ndk/repos/pr/ndk_mf_pm-cpu-update-compiler-versions/driver-mct/main/cime_comp_mod.F90:1488
 561: #11  0x4964d5 in cime_driver
 561:   at /global/cfs/cdirs/e3sm/ndk/repos/pr/ndk_mf_pm-cpu-update-compiler-versions/driver-mct/main/cime_driver.F90:122
 561: #12  0x496611 in main
 561:   at /global/cfs/cdirs/e3sm/ndk/repos/pr/ndk_mf_pm-cpu-update-compiler-versions/driver-mct/main/cime_driver.F90:23

So far, here are the tests showing the error:

SMS_D_Ld1.ne30pg2_EC30to60E2r2.F2010.pm-cpu_gnu
SMS_D_Ln5.ne30pg2_EC30to60E2r2.F2010.pm-cpu_gnu.eam-p3
SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.F2010
SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.CRYO1850-DISMF
SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.pm-cpu_gnu.allactive-wcprod
SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850
SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.pm-cpu_gnu.allactive-wcprodssp

I can also login to compute node (on similar machine muller-cpu) during a hang and view where one process is sitting:

#0  cxi_eq_peek_event (eq=0x22e12dc8) at /usr/include/cxi_prov_hw.h:1531
#1  cxip_ep_ctrl_eq_progress (ep_obj=0x22e25790, ctrl_evtq=0x22e12dc8, tx_evtq=true, ep_obj_locked=true) at prov/cxi/src/cxip_ctrl.c:318
#2  0x00001503828591dd in cxip_ep_progress (fid=<optimized out>) at prov/cxi/src/cxip_ep.c:186
#3  0x000015038285e969 in cxip_util_cq_progress (util_cq=0x22e15220) at prov/cxi/src/cxip_cq.c:112
#4  0x000015038283a301 in ofi_cq_readfrom (cq_fid=0x22e15220, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:232
#5  0x00001503860fa0f2 in MPIR_Wait_impl () from /opt/cray/pe/lib64/libmpi_intel.so.12
#6  0x0000150386c9b926 in MPIC_Wait () from /opt/cray/pe/lib64/libmpi_intel.so.12
#7  0x0000150386ca7685 in MPIC_Sendrecv () from /opt/cray/pe/lib64/libmpi_intel.so.12
#8  0x0000150386bd232d in MPIR_Alltoall_intra_brucks () from /opt/cray/pe/lib64/libmpi_intel.so.12
#9  0x00001503855bee8a in MPIR_Alltoall_intra_auto.part.0 () from /opt/cray/pe/lib64/libmpi_intel.so.12
#10 0x00001503855bf05c in MPIR_Alltoall_impl () from /opt/cray/pe/lib64/libmpi_intel.so.12
#11 0x00001503855bf83f in PMPI_Alltoall () from /opt/cray/pe/lib64/libmpi_intel.so.12
#12 0x0000150387c4364e in pmpi_alltoall__ () from /opt/cray/pe/lib64/libmpifort_intel.so.12
#13 0x0000000000bcad8f in mpialltoallint (sendbuf=..., sendcnt=1, recvbuf=..., recvcnt=1, comm=-1006632954) at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/control/wrap_mpi.F90:1143
#14 0x0000000002b93c02 in phys_grid::transpose_block_to_chunk (record_size=88, block_buffer=<error reading variable: value requires 2509056 bytes, which is more than max-value-size>, chunk_buffer=<error reading variable: value requires 2452032 bytes, which is more than max-value-size>,
    window=<error reading variable: Cannot access memory at address 0x0>) at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/physics/cam/phys_grid.F90:4137
#15 0x0000000005304965 in dp_coupling::d_p_coupling (phys_state=..., phys_tend=..., pbuf2d=0x26500aa0, dyn_out=...) at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/dynamics/se/dp_coupling.F90:242
#16 0x0000000003719020 in stepon::stepon_run1 (dtime_out=1800, phys_state=..., phys_tend=..., pbuf2d=0x26500aa0, dyn_in=..., dyn_out=...) at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/dynamics/se/stepon.F90:244
#17 0x0000000000948d7c in cam_comp::cam_run1 (cam_in=..., cam_out=...) at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/control/cam_comp.F90:251
#18 0x0000000000905530 in atm_comp_mct::atm_init_mct (eclock=..., cdata_a=..., x2a_a=..., a2x_a=..., nlfilename=..., .tmp.NLFILENAME.len_V$5bab=6) at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/components/eam/src/cpl/atm_comp_mct.F90:499
#19 0x00000000004a7045 in component_mod::component_init_cc (eclock=..., comp=..., infodata=..., nlfilename=..., seq_flds_x2c_fluxes=..., seq_flds_c2x_fluxes=..., .tmp.NLFILENAME.len_V$7206=6, .tmp.SEQ_FLDS_X2C_FLUXES.len_V$7209=4096, .tmp.SEQ_FLDS_C2X_FLUXES.len_V$720c=4096)
    at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/driver-mct/main/component_mod.F90:257
#20 0x000000000045d9d6 in cime_comp_mod::cime_init () at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/driver-mct/main/cime_comp_mod.F90:2370
#21 0x000000000049dfc2 in cime_driver () at /mscratch/sd/n/ndk/repos/ndk_mf_update-muller-module-versions/driver-mct/main/cime_driver.F90:122

where it looks like at #14 it is error before MPI stack?

Noting some tests that complete:

(all other tests in e3sm_integration)
SMS_D.ne30_oECv3_gis.IGELM_MLI.pm-cpu_gnu.elm-extrasnowlayers
SMS_D.ne30pg2_r05_IcoswISC30E3r5.GPMPAS-JRA.pm-cpu_gnu.mosart-rof_ocn_2way
SMS_D_Ln3.ne4pg2_ne4pg2.FAQP.pm-cpu_gnu
SMS_D.ne30pg2_ne30pg2.IELMTEST.pm-cpu_gnu
@ndkeen ndkeen added Machine Files GNU GNU compiler related issues pm-cpu Perlmutter at NERSC (CPU-only nodes) labels Jul 19, 2024
@ndkeen
Copy link
Contributor Author

ndkeen commented Jul 19, 2024

It looks like when I simply try machine defaults, the tests that were failing are OK. Which is easy fix, but that means to upgrade other compilers (such as intel), we will need to have different versions of things like mpich depending on the compiler. Which is maybe fine -- will make branch/PR.

Another issue is that I already tested and merged PR to scream repo that updates pm-gpu with the "ideal" version of GNU compiler. Nothing stopping us from having two different versions on the two different compute clusters, but of course, best to have them the same.

module                      current        machine defaults   ideal
gcc                         11.2.0         12.2.0             12.3 (gcc-native)
PrgEnv-gnu                  8.3.3          8.5.0              8.5.0
cray-libsci                 23.02.1.1      23.02.1.1          23.12.5
cray-mpich                  8.1.25         8.1.25             8.1.28
craype                      2.7.20         2.7.20             2.7.30
cray-hdf5-parallel          1.12.2.3       1.12.2.3           1.12.2.9
cray-netcdf-hdf5parallel    4.9.0.3        4.9.0.3            4.9.0.9
cray-parallel-netcdf        1.12.3.3       1.12.3.3           1.12.3.9

#6517

@ndkeen
Copy link
Contributor Author

ndkeen commented Oct 15, 2024

Well whadyaknow... coming back to this, I verified still see same hang with newer GCC with at least two of the tests above, and then trying again with kdreg2 (#6687), at least one case does not hang.

So using kdreg2 may help here as well. Will do more testing and can hopefully update GCC compiler.

ndkeen added a commit that referenced this issue Oct 19, 2024
With new slighshot software (s2.2 h11.0.1), now installed on Perlmutter, there were some hangs in init for
certain cases at higher node counts. Using environment variable FI_MR_CACHE_MONITOR=kdreg2 avoids any issues so far.
kdreg2 is another option for memory cache monitoring -- it is a Linux kernel module using open-source licensing.
It comes with HPE Slingshot host software distribution (optionally installed) and may one day be the default.

Regarding performance, it seems about the same. For one HR F-case at 256 nodes, using kdreg2 was about 1% slower.

Fixes #6655

I also found some older issues (some with lower node-count) that this fixes:
Fixes #6516
Fixes #6451
Fixes #6521

[bfb]
ndkeen added a commit that referenced this issue Oct 21, 2024
…' into next (PR #6702)

On pm-cpu we were using updated Intel compiler and other module versions that were compatible, but had not yet updated the others due to #6516. After #6687, I think they are resolved and we can now update.

The main change here is updating gcc compiler, but other module versions are also updated at the same time.
Also, try to clean up the machine config settings across all NERSC machines to be more consistent.

module                      current        machine defaults (in this PR)
gcc                         12.2.0         12.3 (gcc-native)
PrgEnv-gnu                  8.3.3          8.5.0
cray-libsci                 23.02.1.1      23.12.5
cray-mpich                  8.1.25         8.1.28
craype                      2.7.20         2.7.30
cray-hdf5-parallel          1.12.2.3       1.12.2.9
cray-netcdf-hdf5parallel    4.9.0.3        4.9.0.9
cray-parallel-netcdf        1.12.3.3       1.12.3.9
On muller-cpu, already tried updating the compiler versions and was testing a work-around with special compiler flags for GNU. With this PR, can remove that work-around.

Removing FI_CXI_RX_MATCH_MODE=software for machines other than primary pm-cpu. Will remove this in another PR as the default FI_CXI_RX_MATCH_MODE=hybrid now seems fine.

We might see some cases not BFB using GNU compiler. For the e3sm-developer tests (what we test nightly), the only test that did not pass baseline compare was ERP_Ld3.ne4pg2_oQU480.F2010.pm-cpu_gnu
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GNU GNU compiler related issues Machine Files pm-cpu Perlmutter at NERSC (CPU-only nodes)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant