Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Busyring crashes in Nernst on ARM (GH200 CPU) with non-power-of-two thread counts #2284

Open
thorstenhater opened this issue Jul 4, 2024 · 0 comments
Labels

Comments

@thorstenhater
Copy link
Contributor

Crash sometimes masquerade as MPI crash.

Example of crash in MPI

$ srun --exclusive -A zam -N 1 -n 1 --cpus-per-gpu=17 --gpus=1 --gpus-per-task=1 --gres=gpu:1 bin/busyring input.json
gpu:      yes
threads:  17
mpi:      yes
ranks:    1

start=1720081941
cell stats: 2048 cells; 303110 branches; 2831618 compartments;
#cpu=2048 #gpu=0
#cell=2048 #local=2048 #groups=17
model-init=1720081945
running simulation

  0% |                                                  |             0ms[1720081945.285889] [jpbot-001-20:559978:0]
        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[jpbot-001-20:559978] *** Process received signal ***
[jpbot-001-20:559978] Signal: Segmentation fault (11)
[jpbot-001-20:559978] Signal code: Address not mapped (1)
[jpbot-001-20:559978] Failing at address: 0x103bcae285ed0
[jpbot-001-20:559978:0:560001] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x103bd4e742890)
[jpbot-001-20:559978:1:559978] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3c0310f1d70)
[1720081945.285889] [jpbot-001-20:559978:1]           debug.c:1294 UCX  WARN  ucs_debug_disable_signal: signal 8 was not set in ucs
[1720081945.285927] [jpbot-001-20:559978:2]           debug.c:1294 UCX  WARN  ucs_debug_disable_signal: signal 8 was not set in ucs
[1720081945.285934] [jpbot-001-20:559978:0]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[1720081945.285941] [jpbot-001-20:559978:1]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[1720081945.285931] [jpbot-001-20:559978:3]           debug.c:1294 UCX  WARN  ucs_debug_disable_signal: signal 8 was not set in ucs
[1720081945.285954] [jpbot-001-20:559978:4]           debug.c:1294 UCX  WARN  ucs_debug_disable_signal: signal 8 was not set in ucs
[1720081945.285959] [jpbot-001-20:559978:5]           debug.c:1294 UCX  WARN  ucs_debug_disable_signal: signal 8 was not set in ucs
[1720081945.285953] [jpbot-001-20:559978:2]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[1720081945.285971] [jpbot-001-20:559978:5]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[1720081945.285964] [jpbot-001-20:559978:6]           debug.c:1294 UCX  WARN  ucs_debug_disable_signal: signal 8 was not set in ucs
[jpbot-001-20:559978:2:559992] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x103bd031d9a50)
[jpbot-001-20:559978:5:559991] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x103bd099f5b90)
[jpbot-001-20:559978:3:559987] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x103bc6491bf30)
[jpbot-001-20:559978:6:559988] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x103bc9653bd90)
[jpbot-001-20:559978:4:560002] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x103bc925f2e20)
[jpbot-001-20:559978] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffffbb9e07f0]
[jpbot-001-20:559978] [ 1] bin/busyring[0x4be968]
[jpbot-001-20:559978] [ 2] bin/busyring[0x4dbbb0]
[jpbot-001-20:559978] [ 3] bin/busyring[0x4fabb0]
[jpbot-001-20:559978] [ 4] bin/busyring[0x460bf0]
[jpbot-001-20:559978] [ 5] bin/busyring[0x467d44]
[jpbot-001-20:559978] [ 6] /p/software/jedi/stages/2024/software/GCCcore/12.3.0/lib64/libstdc++.so.6(+0xd693c)[0xffffbb6a693c]
[jpbot-001-20:559978] [ 7] /lib64/libc.so.6(+0x80698)[0xffffbb390698]
[jpbot-001-20:559978] [ 8] /lib64/libc.so.6(+0xeabdc)[0xffffbb3fabdc]
[jpbot-001-20:559978] *** End of error message ***
srun: error: jpbot-001-20: task 0: Segmentation fault (core dumped)

Same testcase, different number of tasks per GPU

$ srun --exclusive -A zam -N 1 -n 1 --cpus-per-gpu=16 --gpus=1 --gpus-per-task=1 --gres=gpu:1 bin/busyring input.json
gpu:      yes
threads:  16
mpi:      yes
ranks:    1

start=1720081984
cell stats: 2048 cells; 303110 branches; 2831618 compartments;
#cpu=2048 #gpu=0
#cell=2048 #local=2048 #groups=16
model-init=1720081988
running simulation

100% |--------------------------------------------------|            25ms
model-run=1720082064

2563 spikes generated at rate of 0.00975419 ms between spikes

---- meters -------------------------------------------------------------------------------
meter                         time(s)      memory(MB)
-------------------------------------------------------------------------------------------
model-init                      3.433        1826.736
model-run                      76.748          54.316
meter-total                    80.181        1881.051

Different stack trace showing the problem pointing at Arbor:

Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1154ccb34d740)
==== backtrace (tid: 277735) ====
 0 0x00000000004be968 arb::default_catalogue::kernel_nernst::compute_currents()  ???:0
 1 0x00000000004dbbb0 arb::fvm_lowered_cell_impl<arb::multicore::backend>::integrate()  ???:0
 2 0x00000000004fabb0 arb::cable_cell_group::advance()  ???:0
 3 0x0000000000460bf0 std::_Function_handler<void (), arb::threading::task_group::wrap<arb::threading::parallel_for::apply<arb::simulation_state::foreach_group_index<arb::simulation_state::run(double, double)::{lambda(arb::epoch)#2}::operator()(arb::epoch) const::{lambda(std::unique_ptr<arb::cell_group, std::default_delete<arb::cell_group> >&, int)#1}>(arb::simulation_state::run(double, double)::{lambda(arb::epoch)#2}::operator()(arb::epoch) const::{lambda(std::unique_ptr<arb::cell_group, std::default_delete<arb::cell_group> >&, int)#1}&&)::{lambda(int)#1}>(int, int, int, arb::threading::task_system*, arb::simulation_state::foreach_group_index<arb::simulation_state::run(double, double)::{lambda(arb::epoch)#2}::operator()(arb::epoch) const::{lambda(std::unique_ptr<arb::cell_group, std::default_delete<arb::cell_group> >&, int)#1}>(arb::simulation_state::run(double, double)::{lambda(arb::epoch)#2}::operator()(arb::epoch) const::{lambda(std::unique_ptr<arb::cell_group, std::default_delete<arb::cell_group> >&, int)#1}&&)::{lambda(int)#1})::{lambda()#1}> >::_M_invoke()  ???:0
 4 0x00000000004595b4 arb::threading::task_group::wait()  ???:0
 5 0x0000000000553de8 arb::simulation_state::run()  :0
 6 0x0000000000412c48 main()  ???:0
 7 0x0000000000027300 __libc_start_call_main()  ???:0
 8 0x00000000000273d8 __libc_start_main_alias_2()  :0
 9 0x00000000004179b0 _start()  ???:0
=================================
[jpbot-001-03:277735] *** Process received signal ***
[jpbot-001-03:277735] Signal: Segmentation fault (11)
[jpbot-001-03:277735] Signal code:  (-6)
[jpbot-001-03:277735] Failing at address: 0x191c00043ce7
[jpbot-001-03:277735] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffffadf007f0]
[jpbot-001-03:277735] [ 1] /p/project1/chpsadm/alvarez/tests_users/test_arbor_jedi/./busyring[0x4be968]
[jpbot-001-03:277735] [ 2] /p/project1/chpsadm/alvarez/tests_users/test_arbor_jedi/./busyring[0x4dbbb0]
[jpbot-001-03:277735] [ 3] /p/project1/chpsadm/alvarez/tests_users/test_arbor_jedi/./busyring[0x4fabb0]
[jpbot-001-03:277735] [ 4] /p/project1/chpsadm/alvarez/tests_users/test_arbor_jedi/./busyring[0x460bf0]
[jpbot-001-03:277735] [ 5] /p/project1/chpsadm/alvarez/tests_users/test_arbor_jedi/./busyring[0x4595b4]
[jpbot-001-03:277735] [ 6] /p/project1/chpsadm/alvarez/tests_users/test_arbor_jedi/./busyring[0x553de8]
[jpbot-001-03:277735] [ 7] /p/project1/chpsadm/alvarez/tests_users/test_arbor_jedi/./busyring[0x412c48]
[jpbot-001-03:277735] [ 8] /lib64/libc.so.6(+0x27300)[0xffffad857300]
[jpbot-001-03:277735] [ 9] /lib64/libc.so.6(__libc_start_main+0x98)[0xffffad8573d8]
[jpbot-001-03:277735] [10] /p/project1/chpsadm/alvarez/tests_users/test_arbor_jedi/./busyring[0x4179b0]
[jpbot-001-03:277735] *** End of error message ***
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant