Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure in uct_sm_ep_[put|get] on ThunderX with UCX + XPMEM #1798

Closed
gmegan opened this issue Aug 29, 2017 · 2 comments
Closed

Failure in uct_sm_ep_[put|get] on ThunderX with UCX + XPMEM #1798

gmegan opened this issue Aug 29, 2017 · 2 comments

Comments

@gmegan
Copy link

gmegan commented Aug 29, 2017

This failure occurs randomly on ThunderX processors when UCX is built on top of the XPMEM kernel module (https://github.com/hjelmn/xpmem).

I have seen the failure in uct_sm_ep_get_bcopy and uct_sm_ep_put_short. In each case, the function makes a call to memcpy resulting in:

[thunderx2p-1:83303:0] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace ====
===================
[thunderx2p-1:83303:0] Process frozen.

Running gtest will sometimes cause the failure, other times, all tests succeed. I have seen this with 1.2.0 and master branch.

I can reliably reproduce this bug using OpenSHMEM calls in OpenMPI 2.1.0 compiled over UCX+XPMEM. Below is an error log and backtrace for a shmem implementation of integer sorting.

$ cat /etc/os-release 
NAME="Ubuntu"
VERSION="16.04.3 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.3 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial

$ uname -a
Linux thunderx2p-1 4.4.0-92-generic #115-Ubuntu SMP Thu Aug 10 09:10:33 UTC 2017 aarch64 aarch64 aarch64 GNU/Linux

$ /home/ubuntu/local/opt/ompi-ucx-xpmem/bin/shmemrun -mca pml ucx -mca spml ucx --mca btl '^vader,tcp,openib' -n 2 /home/ubuntu/local/src/analytics-benchmarks/ISx-shmem-mpi/SHMEM/bin/isx.strong 134217728 /home/ubuntu/benchmark-output/shmem-bench/isx.strong-shmem-2.log
ISx v1.1
  Number of Keys per PE: 67108864
  Max Key Value: 32
  Bucket Width: 16
  Number of Iterations: 1
  Number of PEs: 2
  STRONG Scaling!
Rank 0: Initial Keys: 8 22 27 0 12 22 13 14 9 13 25 27 28 4 3 20 25 27 1 2 23 8 5 9 19 1 10 0 12 2 1 25 31 8 13 15 11 3 6 13 5 28 7 31 30 1 5 27 12 25 0 6 29 14 9 29 28 13 16 14 13 10 21 28 
Rank 1: Initial Keys: 17 20 23 19 22 25 6 19 2 11 25 11 29 17 4 8 12 9 14 4 23 2 13 16 5 18 27 25 17 10 14 5 19 21 6 31 12 16 23 22 20 18 0 12 3 14 17 9 6 19 22 27 28 4 11 26 31 19 18 15 17 16 4 5 
Rank 0: local bucket sizes: 33558219 33550645 
Rank 1: local bucket sizes: 33554081 33554783 
Rank 0: local bucket offsets: 0 33558219 
Rank 1: local bucket offsets: 0 33554081 
Rank 0: local bucketed keys: 8 0 12 13 14 9 13 4 3 1 2 8 5 9 1 10 0 12 2 1 8 13 15 11 3 6 13 5 7 1 5 12 0 6 14 9 13 14 13 10 4 6 8 7 15 10 9 6 1 8 14 11 14 4 15 9 2 8 8 4 13 9 13 12 
Rank 1: local bucketed keys: 6 2 11 11 4 8 12 9 14 4 2 13 5 10 14 5 6 12 0 12 3 14 9 6 4 11 15 4 5 4 2 13 5 1 8 7 7 8 13 15 1 11 7 5 1 7 8 4 12 5 8 14 2 15 3 11 15 9 0 2 2 12 10 1 
Rank: 0 Target: 1 Offset into target: 33554783 Offset into myself: 33558219 Send Size: 33550645
Rank: 1 Target: 0 Offset into target: 33558219 Offset into myself: 0 Send Size: 33554081
Rank 0: Bucket Size 67112300 | Total Keys Sent: 33550645 | Keys after exchange:8 0 12 13 14 9 13 4 3 1 2 8 5 9 1 10 0 12 2 1 8 13 15 11 3 6 13 5 7 1 5 12 0 6 14 9 13 14 13 10 4 6 8 7 15 10 9 6 1 8 14 11 14 4 15 9 2 8 8 4 13 9 13 12 
Rank 1: Bucket Size 67105428 | Total Keys Sent: 33554081 | Keys after exchange:17 20 23 19 22 25 19 25 29 17 23 16 18 27 25 17 19 21 31 16 23 22 20 18 17 19 22 27 28 26 31 19 18 17 16 25 31 18 22 26 18 21 31 21 22 23 29 25 27 27 21 19 23 18 24 28 23 19 20 29 22 26 22 26 
Rank 0: Bucket Size 67112300 | Local Key Counts:4191384 4194067 4196928 4196591 4195637 4192152 4192941 4193204 4193298 4196391 4193292 4194074 4197427 4193663 4194635 4196616 
Rank 1: Bucket Size 67105428 | Local Key Counts:4195621 4194277 4196886 4190317 4197286 4192834 4193278 4194636 4192128 4194087 4194998 4192014 4193419 4191925 4194659 4197063 
Rank 0: Initial Keys: 8 22 27 0 12 22 13 14 9 13 25 27 28 4 3 20 25 27 1 2 23 8 5 9 19 1 10 0 12 2 1 25 31 8 13 15 11 3 6 13 5 28 7 31 30 1 5 27 12 25 0 6 29 14 9 29 28 13 16 14 13 10 21 28 
Rank 1: Initial Keys: 17 20 23 19 22 25 6 19 2 11 25 11 29 17 4 8 12 9 14 4 23 2 13 16 5 18 27 25 17 10 14 5 19 21 6 31 12 16 23 22 20 18 0 12 3 14 17 9 6 19 22 27 28 4 11 26 31 19 18 15 17 16 4 5 
Rank 0: local bucket sizes: 33558219 33550645 
Rank 1: local bucket sizes: 33554081 33554783 
Rank 0: local bucket offsets: 0 33558219 
Rank 1: local bucket offsets: 0 33554081 
Rank 0: local bucketed keys: 8 0 12 13 14 9 13 4 3 1 2 8 5 9 1 10 0 12 2 1 8 13 15 11 3 6 13 5 7 1 5 12 0 6 14 9 13 14 13 10 4 6 8 7 15 10 9 6 1 8 14 11 14 4 15 9 2 8 8 4 13 9 13 12 
Rank 1: local bucketed keys: 6 2 11 11 4 8 12 9 14 4 2 13 5 10 14 5 6 12 0 12 3 14 9 6 4 11 15 4 5 4 2 13 5 1 8 7 7 8 13 15 1 11 7 5 1 7 8 4 12 5 8 14 2 15 3 11 15 9 0 2 2 12 10 1 
Rank: 0 Target: 1 Offset into target: 33554783 Offset into myself: 33558219 Send Size: 33550645
[thunderx2p-1:83302:0] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace ====
===================
[thunderx2p-1:83302:0] Process frozen...
Rank: 1 Target: 0 Offset into target: 33558219 Offset into myself: 0 Send Size: 33554081
[thunderx2p-1:83303:0] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace ====
===================
[thunderx2p-1:83303:0] Process frozen...

$ gdb -p 83302
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "aarch64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 83302
[New LWP 83304]
[New LWP 83305]
[New LWP 83308]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
0x0000ffff935276b4 in __libc_pause () at ../sysdeps/unix/sysv/linux/generic/pause.c:34
34	../sysdeps/unix/sysv/linux/generic/pause.c: No such file or directory.
(gdb) backtrace
#0  0x0000ffff935276b4 in __libc_pause () at ../sysdeps/unix/sysv/linux/generic/pause.c:34
#1  0x0000ffff91680b78 in ucs_debug_freeze () at ../../../src/ucs/debug/debug.c:709
#2  0x0000ffff9168103c in ucs_error_freeze (error_type=0xffff9168daf8 "nonexistent physical address", 
    message=0xffff915ef000 "Caught signal 7 (Bus error: nonexistent physical address)") at ../../../src/ucs/debug/debug.c:828
#3  0x0000ffff916818b0 in ucs_handle_error (error_type=0xffff9168daf8 "nonexistent physical address", message=0xffff9168dc28 "Caught signal %d (%s: %s%s)")
    at ../../../src/ucs/debug/debug.c:992
#4  0x0000ffff91681504 in ucs_debug_handle_error_signal (signo=7, cause=0xffff9168daf8 "nonexistent physical address", fmt=0xffff9168dc48 "")
    at ../../../src/ucs/debug/debug.c:932
#5  0x0000ffff916815ec in ucs_error_signal_handler (signo=7, info=0xffff915ef7a0, context=0xffff915ef820) at ../../../src/ucs/debug/debug.c:949
#6  <signal handler called>
#7  memcpy () at ../sysdeps/aarch64/memcpy.S:121
#8  0x0000ffff9163dc50 in uct_sm_ep_put_short (tl_ep=0x432ed770, buffer=0xfffec3bfeb3c, length=134202580, remote_addr=138499916, rkey=281470069276672)
    at ../../../src/uct/sm/base/sm_ep.c:20
#9  0x0000ffff916c6f4c in uct_ep_put_short (rkey=281470069276672, remote_addr=138499916, length=134202580, buffer=0xfffec3bfeb3c, ep=0x432ed770)
    at /home/ubuntu/local/src/maas-tools/shmem-setup/BUILD/ucx/BUILD/../src/uct/api/uct.h:1491
#10 ucp_put (ep=0x432ed6f0, buffer=0xfffec3bfeb3c, length=134202580, remote_addr=138499916, rkey=0x432f65d0) at ../../../src/ucp/rma/basic_rma.c:300
#11 0x0000ffff900b1554 in mca_spml_ucx_put () from /home/ubuntu/local/opt/ompi-ucx-xpmem/lib/openmpi/mca_spml_ucx.so
#12 0x0000ffff93576f3c in shmem_int_put () from /home/ubuntu/local/opt/ompi-ucx-xpmem/lib/liboshmem.so.20
#13 0x00000000004027b8 in exchange_keys (send_offsets=0x432f52e0, local_bucket_sizes=0x432f48e0, my_local_bucketed_keys=0xfffebbbfb010) at isx.c:446
#14 0x0000000000401c80 in bucket_sort () at isx.c:213
#15 0x0000000000401828 in main (argc=3, argv=0xffffd0e7a0b8) at isx.c:85
(gdb) frame 13
#13 0x00000000004027b8 in exchange_keys (send_offsets=0x432f52e0, local_bucket_sizes=0x432f48e0, my_local_bucketed_keys=0xfffebbbfb010) at isx.c:446
446	    shmem_int_put(&(my_bucket_keys[write_offset_into_target]), 
(gdb) list
441	#ifdef DEBUG
442	    printf("Rank: %d Target: %d Offset into target: %lld Offset into myself: %d Send Size: %d\n",
443	        my_rank, target_pe, write_offset_into_target, read_offset_from_self, my_send_size);
444	#endif
445	
446	    shmem_int_put(&(my_bucket_keys[write_offset_into_target]), 
447	                  &(my_local_bucketed_keys[read_offset_from_self]), 
448	                  my_send_size, 
449	                  target_pe);
450	
(gdb) frame 10
#10 ucp_put (ep=0x432ed6f0, buffer=0xfffec3bfeb3c, length=134202580, remote_addr=138499916, rkey=0x432f65d0) at ../../../src/ucp/rma/basic_rma.c:300
300	            status = UCS_PROFILE_CALL(uct_ep_put_short, ep->uct_eps[rkey->cache.rma_lane],
(gdb) l
295	        do {
296	            /* testing shows that for put message rate it is better to finish
297	             * put_short here instead of doing it once, getting NO_RESOURCE 
298	             * and continuing to ucp_rma_blocking()
299	             */
300	            status = UCS_PROFILE_CALL(uct_ep_put_short, ep->uct_eps[rkey->cache.rma_lane],
301	                                      buffer, length, remote_addr, rkey->cache.rma_rkey);
302	            if (ucs_likely(status != UCS_ERR_NO_RESOURCE)) {
303	                goto out_unlock;
304	            }
(gdb) frame 9
#9  0x0000ffff916c6f4c in uct_ep_put_short (rkey=281470069276672, remote_addr=138499916, length=134202580, buffer=0xfffec3bfeb3c, ep=0x432ed770)
    at /home/ubuntu/local/src/maas-tools/shmem-setup/BUILD/ucx/BUILD/../src/uct/api/uct.h:1491
1491	    return ep->iface->ops.ep_put_short(ep, buffer, length, remote_addr, rkey);
(gdb) list
1486	 * @brief
1487	 */
1488	UCT_INLINE_API ucs_status_t uct_ep_put_short(uct_ep_h ep, const void *buffer, unsigned length,
1489	                                             uint64_t remote_addr, uct_rkey_t rkey)
1490	{
1491	    return ep->iface->ops.ep_put_short(ep, buffer, length, remote_addr, rkey);
1492	}
1493	
1494	
1495	/**
(gdb) frame 8
#8  0x0000ffff9163dc50 in uct_sm_ep_put_short (tl_ep=0x432ed770, buffer=0xfffec3bfeb3c, length=134202580, remote_addr=138499916, rkey=281470069276672)
    at ../../../src/uct/sm/base/sm_ep.c:20
20	        memcpy((void *)(rkey + remote_addr), buffer, length);
(gdb) list
15	ucs_status_t uct_sm_ep_put_short(uct_ep_h tl_ep, const void *buffer,
16	                                 unsigned length, uint64_t remote_addr,
17	                                 uct_rkey_t rkey)
18	{
19	    if (ucs_likely(length != 0)) {
20	        memcpy((void *)(rkey + remote_addr), buffer, length);
21	        uct_sm_ep_trace_data(remote_addr, rkey, "PUT_SHORT [buffer %p size %u]",
22	                             buffer, length);
23	    } else {
24	        ucs_trace_data("PUT_SHORT [zero-length]");

@shamisp
Copy link
Contributor

shamisp commented Sep 14, 2017

@ghmegan is this still relevant ?

@gmegan
Copy link
Author

gmegan commented Sep 14, 2017

Closing this for now because it did not recreate on another thunderX/ubuntu system. I will reopen a new issue if I can narrow down the issue.

@gmegan gmegan closed this as completed Sep 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants