You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We're seeing the following error message during finalization of an OpenSHMEM program running on top of UCX 1.3.0:
Caught signal 7 (Bus error: nonexistent physical address)
The generated core dump contains the following stack trace:
#0 uct_mm_ep_update_cached_tail (ep=0x119dd020, ep=0x119dd020) at sm/mm/mm_ep.c:202 #1 uct_mm_ep_flush (tl_ep=0x119dd020, flags=0, comp=) at sm/mm/mm_ep.c:420 #2 0x0000ffff7a095a5c in uct_ep_flush (comp=0x119fb788, flags=, ep=0x119dd020) at /home/hpp/ucx/src/uct/api/uct.h:2050 #3 ucp_ep_flush_progress (req=req@entry=0x119fb700) at rma/flush.c:48 #4 0x0000ffff7a095fa8 in ucp_ep_flush_internal (ep=ep@entry=0x119dcfb0, uct_flags=uct_flags@entry=0, req_cb=req_cb@entry=0x0, req_flags=req_flags@entry=0,
flushed_cb=flushed_cb@entry=0xffff7a089ac8 <ucp_ep_close_flushed_callback>) at rma/flush.c:215 #5 0x0000ffff7a08bfb4 in ucp_ep_close_nb (ep=0x119dcfb0, mode=mode@entry=1) at core/ucp_ep.c:614 #6 0x0000ffff7a2be698 in blocking_ep_disconnect (ep=) at ucx-init.c:279 #7 disconnect_all_endpoints () at ucx-init.c:317 #8 shmemc_ucx_finalize () at ucx-init.c:457 #9 0x0000ffff7a2be004 in shmemc_finalize () at shmemc-init.c:56 #10 0x0000ffff7a2f7038 in finalize_helper () at init.c:54 #11 shmem_finalize () at init.c:156 #12 0x0000000000400b8c in main ()
The OpenSHMEM program that triggered this error is a simple hello world:
#include <stdio.h>
#include <shmem.h>
int main(int argc, char **argv) {
shmem_init();
int pe = shmem_my_pe();
int npes = shmem_n_pes();
printf("Hi from %d / %d\n", pe, npes);
shmem_finalize();
return 0;
}
bus error usually means the remote segment was detached before another process finished writing.
probably deregister_memory_regions() should be called after some out-of-band barrier
Hi all,
We're seeing the following error message during finalization of an OpenSHMEM program running on top of UCX 1.3.0:
Caught signal 7 (Bus error: nonexistent physical address)
The generated core dump contains the following stack trace:
#0 uct_mm_ep_update_cached_tail (ep=0x119dd020, ep=0x119dd020) at sm/mm/mm_ep.c:202
#1 uct_mm_ep_flush (tl_ep=0x119dd020, flags=0, comp=) at sm/mm/mm_ep.c:420
#2 0x0000ffff7a095a5c in uct_ep_flush (comp=0x119fb788, flags=, ep=0x119dd020) at /home/hpp/ucx/src/uct/api/uct.h:2050
#3 ucp_ep_flush_progress (req=req@entry=0x119fb700) at rma/flush.c:48
#4 0x0000ffff7a095fa8 in ucp_ep_flush_internal (ep=ep@entry=0x119dcfb0, uct_flags=uct_flags@entry=0, req_cb=req_cb@entry=0x0, req_flags=req_flags@entry=0,
flushed_cb=flushed_cb@entry=0xffff7a089ac8 <ucp_ep_close_flushed_callback>) at rma/flush.c:215
#5 0x0000ffff7a08bfb4 in ucp_ep_close_nb (ep=0x119dcfb0, mode=mode@entry=1) at core/ucp_ep.c:614
#6 0x0000ffff7a2be698 in blocking_ep_disconnect (ep=) at ucx-init.c:279
#7 disconnect_all_endpoints () at ucx-init.c:317
#8 shmemc_ucx_finalize () at ucx-init.c:457
#9 0x0000ffff7a2be004 in shmemc_finalize () at shmemc-init.c:56
#10 0x0000ffff7a2f7038 in finalize_helper () at init.c:54
#11 shmem_finalize () at init.c:156
#12 0x0000000000400b8c in main ()
The OpenSHMEM program that triggered this error is a simple hello world:
We're running on top of the OSSS implementation of OpenSHMEM (https://bitbucket.org/sbuopenshmem/osss-ucx/commits/all?search=) using the xpmem transport. This is a small shared memory run, 4 PEs in a single box running on 4 ARM cores.
Please let me know what other information I can provide.
The text was updated successfully, but these errors were encountered: