Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bus error on UCX 1.3.0 #2601

Open
agrippa opened this issue May 13, 2018 · 3 comments
Open

Bus error on UCX 1.3.0 #2601

agrippa opened this issue May 13, 2018 · 3 comments

Comments

@agrippa
Copy link

agrippa commented May 13, 2018

Hi all,

We're seeing the following error message during finalization of an OpenSHMEM program running on top of UCX 1.3.0:

Caught signal 7 (Bus error: nonexistent physical address)

The generated core dump contains the following stack trace:

#0 uct_mm_ep_update_cached_tail (ep=0x119dd020, ep=0x119dd020) at sm/mm/mm_ep.c:202
#1 uct_mm_ep_flush (tl_ep=0x119dd020, flags=0, comp=) at sm/mm/mm_ep.c:420
#2 0x0000ffff7a095a5c in uct_ep_flush (comp=0x119fb788, flags=, ep=0x119dd020) at /home/hpp/ucx/src/uct/api/uct.h:2050
#3 ucp_ep_flush_progress (req=req@entry=0x119fb700) at rma/flush.c:48
#4 0x0000ffff7a095fa8 in ucp_ep_flush_internal (ep=ep@entry=0x119dcfb0, uct_flags=uct_flags@entry=0, req_cb=req_cb@entry=0x0, req_flags=req_flags@entry=0,
flushed_cb=flushed_cb@entry=0xffff7a089ac8 <ucp_ep_close_flushed_callback>) at rma/flush.c:215
#5 0x0000ffff7a08bfb4 in ucp_ep_close_nb (ep=0x119dcfb0, mode=mode@entry=1) at core/ucp_ep.c:614
#6 0x0000ffff7a2be698 in blocking_ep_disconnect (ep=) at ucx-init.c:279
#7 disconnect_all_endpoints () at ucx-init.c:317
#8 shmemc_ucx_finalize () at ucx-init.c:457
#9 0x0000ffff7a2be004 in shmemc_finalize () at shmemc-init.c:56
#10 0x0000ffff7a2f7038 in finalize_helper () at init.c:54
#11 shmem_finalize () at init.c:156
#12 0x0000000000400b8c in main ()

The OpenSHMEM program that triggered this error is a simple hello world:

#include <stdio.h>
#include <shmem.h>

int main(int argc, char **argv) {
shmem_init();
int pe = shmem_my_pe();
int npes = shmem_n_pes();
printf("Hi from %d / %d\n", pe, npes);
shmem_finalize();
return 0;
} 

We're running on top of the OSSS implementation of OpenSHMEM (https://bitbucket.org/sbuopenshmem/osss-ucx/commits/all?search=) using the xpmem transport. This is a small shared memory run, 4 PEs in a single box running on 4 ARM cores.

Please let me know what other information I can provide.

@agrippa
Copy link
Author

agrippa commented May 13, 2018

@hppritcha @tonycurtis

@yosefe
Copy link
Contributor

yosefe commented May 13, 2018

bus error usually means the remote segment was detached before another process finished writing.
probably deregister_memory_regions() should be called after some out-of-band barrier

@tonycurtis
Copy link
Contributor

Related to #2050 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants