-
Notifications
You must be signed in to change notification settings - Fork 396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ofi+verbs;ofi_rxm segfault in vrb_poll_cq() #5653
Comments
Recompiled with --enable-debug and got a bit better trace this time. Looks like ctx->ep is NULL at the point of failure. The issue is much harder to reproduce with libfabric debug enabled Core was generated by `tests/iv_server -v 3'. |
Similar failure is seen on the client side: |
Do you know if this occurs at the end of the test that you're running? If you can display the contents of 'wc' when you hit this, it could help isolate the problem. I'll work on a change that might help. |
When servers crash it does appear to happen towards the end of the test when servers are instructed to shutdown/finalize. It's unclear at which stage client crashed as the test does following: launch 5 servers, client is then launched sequentially number of times with different arguments, each one sending specific RPC to the server; only 1 client is active at any given time. Segfault on client side happens randomly during one of RPC sends, but I don't know yet a precise point of clients operation. From servers coredump: From clients coredump: |
Fixes ofiwg#5653 If an operation completes in error, then the only valid field in a verbs work completion is the wr_id. In order to determine if a work completion corresponds to a send or receive, we need to track it ourselves. Allocate a vrb_context structure for all posted receive operations (similar to the send side). This allows casting the wr_id to a vrb_context always to access the necessary data stored prior to posting the work request. Signed-off-by: Sean Hefty <[email protected]>
I think PR #5661 may fix the issue. When the work completion status is an error, most of the other fields are uninitialized. However, we're relying on the opcode to be valid, when it's not. |
From your report:
this is the problem area. A 'flush_err' status is related to a posted receive operation. But the opcode indicates send. That results in the null-dereference that you're hitting. |
I've just tried it out using (git fetch origin pull/5661/head:TEST) and using that local branch, and so far in 20 runs have not hit the segfault issue anymore. In addition, a different problem of test taking extremely long time to complete (20seconds per rpc) seems to be also resolved now |
Fixes ofiwg#5653 If an operation completes in error, then the only valid field in a verbs work completion is the wr_id. In order to determine if a work completion corresponds to a send or receive, we need to track it ourselves. Allocate a vrb_context structure for all posted receive operations (similar to the send side). This allows casting the wr_id to a vrb_context always to access the necessary data stored prior to posting the work request. Signed-off-by: Sean Hefty <[email protected]>
Fixes ofiwg#5653 If an operation completes in error, then the only valid field in a verbs work completion is the wr_id. In order to determine if a work completion corresponds to a send or receive, we need to track it ourselves. Allocate a vrb_context structure for all posted receive operations (similar to the send side). This allows casting the wr_id to a vrb_context always to access the necessary data stored prior to posting the work request. Signed-off-by: Sean Hefty <[email protected]>
Fixes ofiwg#5653 If an operation completes in error, then the only valid field in a verbs work completion is the wr_id. In order to determine if a work completion corresponds to a send or receive, we need to track it ourselves. Allocate a vrb_context structure for all posted receive operations (similar to the send side). This allows casting the wr_id to a vrb_context always to access the necessary data stored prior to posting the work request. Signed-off-by: Sean Hefty <[email protected]>
When running single node test with 5 servers / 1 client using CaRT with ofi+verbs provider a frequent segfault is observed.
OFI: 955f3a0
Program terminated with signal 11, Segmentation fault.
#0 0x00007f711e707222 in vrb_poll_cq () from /home/aaoganez/github/liwei/cart/install/Linux/lib/libfabric.so.1
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 libatomic-4.8.5-28.el7_5.1.x86_64 libgcc-4.8.5-28.el7_5.1.x86_64 libibverbs-15-7.el7_5.x86_64 libnl3-3.2.28-4.el7.x86_64 librdmacm-15-7.el7_5.x86_64 libuuid-2.23.2-52.el7_5.1.x86_64 libyaml-0.1.4-11.el7_0.x86_64 numactl-libs-2.0.9-7.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0 0x00007f711e707222 in vrb_poll_cq () from /home/aaoganez/github/liwei/cart/install/Linux/lib/libfabric.so.1
#1 0x00007f711e7077e0 in vrb_cq_trywait () from /home/aaoganez/github/liwei/cart/install/Linux/lib/libfabric.so.1
#2 0x00007f711e70810d in vrb_trywait () from /home/aaoganez/github/liwei/cart/install/Linux/lib/libfabric.so.1
#3 0x00007f711e71b0c2 in rxm_ep_trywait_cq () from /home/aaoganez/github/liwei/cart/install/Linux/lib/libfabric.so.1
#4 0x00007f711e6dda87 in util_wait_fd_try () from /home/aaoganez/github/liwei/cart/install/Linux/lib/libfabric.so.1
#5 0x00007f711e6dddb8 in ofi_trywait () from /home/aaoganez/github/liwei/cart/install/Linux/lib/libfabric.so.1
#6 0x00007f711f9d4836 in fi_trywait (count=1, fids=0x7f7119bcc480, fabric=) at /home/aaoganez/github/liwei/cart/install/Linux/include/rdma/fi_eq.h:315
#7 na_ofi_poll_try_wait (na_class=, context=) at /home/aaoganez/github/liwei/cart/_build.external-Linux/mercury/src/na/na_ofi.c:4304
#8 0x00007f711f7ca728 in hg_poll_wait (poll_set=0xbe0950, timeout=timeout@entry=1, progressed=progressed@entry=0x7f7119bcc81f "")
at /home/aaoganez/github/liwei/cart/_build.external-Linux/mercury/src/util/mercury_poll.c:427
#9 0x00007f711fbf37a3 in hg_core_progress_poll (context=0xbdc580, timeout=1) at /home/aaoganez/github/liwei/cart/_build.external-Linux/mercury/src/mercury_core.c:3280
#10 0x00007f711fbf898c in HG_Core_progress (context=, timeout=timeout@entry=1) at /home/aaoganez/github/liwei/cart/_build.external-Linux/mercury/src/mercury_core.c:4877
#11 0x00007f711fbf046d in HG_Progress (context=context@entry=0xbdc560, timeout=timeout@entry=1) at /home/aaoganez/github/liwei/cart/_build.external-Linux/mercury/src/mercury.c:2243
#12 0x00007f71206f43ab in crt_hg_progress (hg_ctx=hg_ctx@entry=0xbd7318, timeout=timeout@entry=1000) at src/cart/crt_hg.c:1373
#13 0x00007f71206b615a in crt_progress (crt_ctx=0xbd7300, timeout=timeout@entry=1000, cond_cb=cond_cb@entry=0x0, arg=arg@entry=0x0) at src/cart/crt_context.c:1253
#14 0x0000000000402e62 in tc_progress_fn (data=0x612510 <g_main_ctx>) at src/test/tests_common.h:137
#15 0x00007f7120269e25 in start_thread () from /lib64/libpthread.so.0
#16 0x00007f711f4f8bad in clone () from /lib64/libc.so.6
For cross referencing purposes this is also tracked as
https://jira.hpdd.intel.com/browse/CART-853
The text was updated successfully, but these errors were encountered: