Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ofi+verbs;ofi_rxm segfault in vrb_poll_cq() #5653

Closed
frostedcmos opened this issue Feb 19, 2020 · 7 comments · Fixed by #5661
Closed

ofi+verbs;ofi_rxm segfault in vrb_poll_cq() #5653

frostedcmos opened this issue Feb 19, 2020 · 7 comments · Fixed by #5661

Comments

@frostedcmos
Copy link

frostedcmos commented Feb 19, 2020

When running single node test with 5 servers / 1 client using CaRT with ofi+verbs provider a frequent segfault is observed.
OFI: 955f3a0

Program terminated with signal 11, Segmentation fault.
#0 0x00007f711e707222 in vrb_poll_cq () from /home/aaoganez/github/liwei/cart/install/Linux/lib/libfabric.so.1
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 libatomic-4.8.5-28.el7_5.1.x86_64 libgcc-4.8.5-28.el7_5.1.x86_64 libibverbs-15-7.el7_5.x86_64 libnl3-3.2.28-4.el7.x86_64 librdmacm-15-7.el7_5.x86_64 libuuid-2.23.2-52.el7_5.1.x86_64 libyaml-0.1.4-11.el7_0.x86_64 numactl-libs-2.0.9-7.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0 0x00007f711e707222 in vrb_poll_cq () from /home/aaoganez/github/liwei/cart/install/Linux/lib/libfabric.so.1
#1 0x00007f711e7077e0 in vrb_cq_trywait () from /home/aaoganez/github/liwei/cart/install/Linux/lib/libfabric.so.1
#2 0x00007f711e70810d in vrb_trywait () from /home/aaoganez/github/liwei/cart/install/Linux/lib/libfabric.so.1
#3 0x00007f711e71b0c2 in rxm_ep_trywait_cq () from /home/aaoganez/github/liwei/cart/install/Linux/lib/libfabric.so.1
#4 0x00007f711e6dda87 in util_wait_fd_try () from /home/aaoganez/github/liwei/cart/install/Linux/lib/libfabric.so.1
#5 0x00007f711e6dddb8 in ofi_trywait () from /home/aaoganez/github/liwei/cart/install/Linux/lib/libfabric.so.1
#6 0x00007f711f9d4836 in fi_trywait (count=1, fids=0x7f7119bcc480, fabric=) at /home/aaoganez/github/liwei/cart/install/Linux/include/rdma/fi_eq.h:315
#7 na_ofi_poll_try_wait (na_class=, context=) at /home/aaoganez/github/liwei/cart/_build.external-Linux/mercury/src/na/na_ofi.c:4304
#8 0x00007f711f7ca728 in hg_poll_wait (poll_set=0xbe0950, timeout=timeout@entry=1, progressed=progressed@entry=0x7f7119bcc81f "")
at /home/aaoganez/github/liwei/cart/_build.external-Linux/mercury/src/util/mercury_poll.c:427
#9 0x00007f711fbf37a3 in hg_core_progress_poll (context=0xbdc580, timeout=1) at /home/aaoganez/github/liwei/cart/_build.external-Linux/mercury/src/mercury_core.c:3280
#10 0x00007f711fbf898c in HG_Core_progress (context=, timeout=timeout@entry=1) at /home/aaoganez/github/liwei/cart/_build.external-Linux/mercury/src/mercury_core.c:4877
#11 0x00007f711fbf046d in HG_Progress (context=context@entry=0xbdc560, timeout=timeout@entry=1) at /home/aaoganez/github/liwei/cart/_build.external-Linux/mercury/src/mercury.c:2243
#12 0x00007f71206f43ab in crt_hg_progress (hg_ctx=hg_ctx@entry=0xbd7318, timeout=timeout@entry=1000) at src/cart/crt_hg.c:1373
#13 0x00007f71206b615a in crt_progress (crt_ctx=0xbd7300, timeout=timeout@entry=1000, cond_cb=cond_cb@entry=0x0, arg=arg@entry=0x0) at src/cart/crt_context.c:1253
#14 0x0000000000402e62 in tc_progress_fn (data=0x612510 <g_main_ctx>) at src/test/tests_common.h:137
#15 0x00007f7120269e25 in start_thread () from /lib64/libpthread.so.0
#16 0x00007f711f4f8bad in clone () from /lib64/libc.so.6

For cross referencing purposes this is also tracked as
https://jira.hpdd.intel.com/browse/CART-853

@frostedcmos
Copy link
Author

Recompiled with --enable-debug and got a bit better trace this time. Looks like ctx->ep is NULL at the point of failure. The issue is much harder to reproduce with libfabric debug enabled

Core was generated by `tests/iv_server -v 3'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007f21b4f8f267 in vrb_poll_cq (cq=0x25a0bb0, wc=0x7f21b041f250) at prov/verbs/src/verbs_cq.c:242
242 ctx->ep->tx_credits++;
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 libatomic-4.8.5-28.el7_5.1.x86_64 libgcc-4.8.5-28.el7_5.1.x86_64 libibverbs-15-7.el7_5.x86_64 libnl3-3.2.28-4.el7.x86_64 libpsm2-10.3.8-3.el7.x86_64 librdmacm-15-7.el7_5.x86_64 libuuid-2.23.2-52.el7_5.1.x86_64 libyaml-0.1.4-11.el7_0.x86_64 numactl-libs-2.0.9-7.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0 0x00007f21b4f8f267 in vrb_poll_cq (cq=0x25a0bb0, wc=0x7f21b041f250) at prov/verbs/src/verbs_cq.c:242
#1 0x00007f21b4f8f8c3 in vrb_cq_trywait (cq=0x25a0bb0) at prov/verbs/src/verbs_cq.c:387
#2 0x00007f21b4f90f6b in vrb_trywait (fabric=0x258c6b0, fids=0x7f21b041f350, count=1) at prov/verbs/src/verbs_domain.c:345
#3 0x00007f21b4fb0444 in fi_trywait (fabric=0x258c6b0, fids=0x7f21b041f350, count=1) at ./include/rdma/fi_eq.h:315
#4 0x00007f21b4fb74c3 in rxm_ep_trywait_cq (arg=0x258db20) at prov/rxm/src/rxm_ep.c:2185
#5 0x00007f21b4f4ba4c in util_wait_fd_try (wait=0x2590a60) at prov/util/src/util_wait.c:250
#6 0x00007f21b4f4b3d2 in ofi_trywait (fabric=0x258a780, fids=0x7f21b041f460, count=1) at prov/util/src/util_wait.c:70
#7 0x00007f21b62c819e in fi_trywait (count=1, fids=0x7f21b041f460, fabric=) at /home/aaoganez/github/liwei/cart/install/Linux/include/rdma/fi_eq.h:315
#8 na_ofi_poll_try_wait (na_class=, context=) at /home/aaoganez/github/liwei/cart/_build.external-Linux/mercury/src/na/na_ofi.c:4298
#9 0x00007f21b60bd738 in hg_poll_wait (poll_set=0x25c9ab0, timeout=timeout@entry=1, progressed=progressed@entry=0x7f21b041f7ff "")
at /home/aaoganez/github/liwei/cart/_build.external-Linux/mercury/src/util/mercury_poll.c:427
#10 0x00007f21b64e8856 in hg_core_progress_poll (context=0x25c56f0, timeout=1) at /home/aaoganez/github/liwei/cart/_build.external-Linux/mercury/src/mercury_core.c:3151
#11 0x00007f21b64ee77f in HG_Core_progress (context=, timeout=timeout@entry=1) at /home/aaoganez/github/liwei/cart/_build.external-Linux/mercury/src/mercury_core.c:4476
#12 0x00007f21b64e4f22 in HG_Progress (context=context@entry=0x25c56d0, timeout=timeout@entry=1) at /home/aaoganez/github/liwei/cart/_build.external-Linux/mercury/src/mercury.c:1964
#13 0x00007f21b6fe9f8b in crt_hg_progress (hg_ctx=hg_ctx@entry=0x25c0488, timeout=timeout@entry=1000) at src/cart/crt_hg.c:1373
#14 0x00007f21b6fac10a in crt_progress (crt_ctx=0x25c0470, timeout=timeout@entry=1000, cond_cb=cond_cb@entry=0x0, arg=arg@entry=0x0) at src/cart/crt_context.c:1253
#15 0x0000000000402e62 in tc_progress_fn (data=0x612510 <g_main_ctx>) at src/test/tests_common.h:137
#16 0x00007f21b6b5fe25 in start_thread () from /lib64/libpthread.so.0
#17 0x00007f21b5debbad in clone () from /lib64/libc.so.6
(gdb) frame 0
#0 0x00007f21b4f8f267 in vrb_poll_cq (cq=0x25a0bb0, wc=0x7f21b041f250) at prov/verbs/src/verbs_cq.c:242
242 ctx->ep->tx_credits++;
(gdb) p ctx
$1 = (struct vrb_context *) 0x7f21aec00e20
(gdb) p ctx->ep
$2 = (struct vrb_ep *) 0x0
(gdb)

@frostedcmos
Copy link
Author

Similar failure is seen on the client side:
Core was generated by `tests/iv_client -o shutdown -r 4 -o shutdown -r 3 -o shutdown -r 2 -o shutdown'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007f40776af267 in vrb_poll_cq (cq=0x6c4b90, wc=0x7f4072d42da0) at prov/verbs/src/verbs_cq.c:242
242 ctx->ep->tx_credits++;
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 libatomic-4.8.5-28.el7_5.1.x86_64 libgcc-4.8.5-28.el7_5.1.x86_64 libibverbs-15-7.el7_5.x86_64 libnl3-3.2.28-4.el7.x86_64 libpsm2-10.3.8-3.el7.x86_64 librdmacm-15-7.el7_5.x86_64 libuuid-2.23.2-52.el7_5.1.x86_64 libyaml-0.1.4-11.el7_0.x86_64 numactl-libs-2.0.9-7.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0 0x00007f40776af267 in vrb_poll_cq (cq=0x6c4b90, wc=0x7f4072d42da0) at prov/verbs/src/verbs_cq.c:242
#1 0x00007f40776af58a in vrb_cq_read (cq_fid=0x6c4b90, buf=0x7f4072d42e80, count=1) at prov/verbs/src/verbs_cq.c:324
#2 0x00007f40776d9139 in fi_cq_read (cq=0x6c4b90, buf=0x7f4072d42e80, count=1) at ./include/rdma/fi_eq.h:386
#3 0x00007f40776de987 in rxm_ep_do_progress (util_ep=0x6b1b00) at prov/rxm/src/rxm_cq.c:1373
#4 0x00007f40776deb30 in rxm_ep_progress (util_ep=0x6b1b00) at prov/rxm/src/rxm_cq.c:1409
#5 0x00007f40776649db in ofi_cq_progress (cq=0x6b4980) at prov/util/src/util_cq.c:596
#6 0x00007f4077663c5e in ofi_cq_readfrom (cq_fid=0x6b4980, buf=0x7f4072d430f0, count=16, src_addr=0x7f4072d43070) at prov/util/src/util_cq.c:246
#7 0x00007f40789ee076 in fi_cq_readfrom (src_addr=0x7f4072d43070, count=16, buf=0x7f4072d430f0, cq=0x6b4980) at /home/aaoganez/github/liwei/cart/install/Linux/include/rdma/fi_eq.h:392
#8 na_ofi_cq_read (max_count=16, actual_count=, src_err_addrlen=, src_err_addr=, src_addrs=0x7f4072d43070,
cq_events=0x7f4072d430f0, context=0x6eb2e0) at /home/aaoganez/github/liwei/cart/_build.external-Linux/mercury/src/na/na_ofi.c:2535
#9 na_ofi_progress (na_class=na_class@entry=0x69c950, context=context@entry=0x6eb2e0, timeout=timeout@entry=0)
at /home/aaoganez/github/liwei/cart/_build.external-Linux/mercury/src/na/na_ofi.c:4349
#10 0x00007f40789efb1f in na_ofi_msg_send_unexpected (na_class=0x69c950, context=0x6eb2e0, callback=, arg=, buf=0x7f4072d46038, buf_size=76,
plugin_data=0x6ed6d0, dest_addr=0x9ad850, dest_id=0 '\000', tag=1, op_id=0x6ed678) at /home/aaoganez/github/liwei/cart/_build.external-Linux/mercury/src/na/na_ofi.c:3740
#11 0x00007f4078c08bda in NA_Msg_send_unexpected (op_id=0x6ed678, tag=, dest_id=, dest_addr=, plugin_data=,
buf_size=, buf=, arg=0x6ed550, callback=0x7f4078c0b300 <hg_core_send_input_cb>, context=, na_class=)
at /home/aaoganez/github/liwei/cart/_build.external-Linux/mercury/src/na/na.h:1506
#12 hg_core_forward_na (hg_core_handle=0x6ed550) at /home/aaoganez/github/liwei/cart/_build.external-Linux/mercury/src/mercury_core.c:1975
#13 0x00007f4078c0d03d in HG_Core_forward (handle=0x6ed550, callback=callback@entry=0x7f4078c015c0 <hg_core_forward_cb>, arg=arg@entry=0x6f0330, flags=0 '\000', payload_size=48)
at /home/aaoganez/github/liwei/cart/_build.external-Linux/mercury/src/mercury_core.c:4395
#14 0x00007f4078c04caa in HG_Forward (handle=0x6f0330, callback=callback@entry=0x7f40796ffa20 <crt_hg_req_send_cb>, arg=arg@entry=0x9ad4b0, in_struct=in_struct@entry=0x9ad4d0)
at /home/aaoganez/github/liwei/cart/_build.external-Linux/mercury/src/mercury.c:1899
#15 0x00007f4079706f89 in crt_hg_req_send (rpc_priv=rpc_priv@entry=0x9ad4b0) at src/cart/crt_hg.c:1199
#16 0x00007f4079744c2a in crt_req_send_immediately (rpc_priv=) at src/cart/crt_rpc.c:1030
#17 crt_req_send_internal (rpc_priv=rpc_priv@entry=0x9ad4b0) at src/cart/crt_rpc.c:1099
#18 0x00007f4079749143 in crt_req_hg_addr_lookup_cb (hg_addr=0x70c370, arg=0x9ad4b0) at src/cart/crt_rpc.c:513
#19 0x00007f40796fe0d2 in crt_hg_addr_lookup_cb (hg_cbinfo=) at src/cart/crt_hg.c:291
#20 0x00007f4078c016b5 in hg_core_addr_lookup_cb (callback_info=) at /home/aaoganez/github/liwei/cart/_build.external-Linux/mercury/src/mercury.c:447
#21 0x00007f4078c0d868 in hg_core_trigger_lookup_entry (hg_core_op_id=0x70c2e0) at /home/aaoganez/github/liwei/cart/_build.external-Linux/mercury/src/mercury_core.c:3308
#22 hg_core_trigger (context=0x6e9120, timeout=, timeout@entry=0, max_count=max_count@entry=4294967295, actual_count=actual_count@entry=0x7f4072d4389c)
at /home/aaoganez/github/liwei/cart/_build.external-Linux/mercury/src/mercury_core.c:3250
#23 0x00007f4078c0e81b in HG_Core_trigger (context=, timeout=timeout@entry=0, max_count=max_count@entry=4294967295, actual_count=actual_count@entry=0x7f4072d4389c)
at /home/aaoganez/github/liwei/cart/_build.external-Linux/mercury/src/mercury_core.c:4494
#24 0x00007f4078c04fd2 in HG_Trigger (context=context@entry=0x69ca70, timeout=timeout@entry=0, max_count=max_count@entry=4294967295, actual_count=actual_count@entry=0x7f4072d4389c)
at /home/aaoganez/github/liwei/cart/_build.external-Linux/mercury/src/mercury.c:1983
#25 0x00007f4079700baa in crt_hg_trigger (hg_ctx=hg_ctx@entry=0x6e3f68) at src/cart/crt_hg.c:1335
#26 0x00007f4079709efd in crt_hg_progress (hg_ctx=hg_ctx@entry=0x6e3f68, timeout=timeout@entry=1000) at src/cart/crt_hg.c:1368
#27 0x00007f40796cc10a in crt_progress (crt_ctx=0x6e3f50, timeout=timeout@entry=1000, cond_cb=cond_cb@entry=0x0, arg=arg@entry=0x0) at src/cart/crt_context.c:1253
#28 0x0000000000403db1 in progress_function (data=0x60a3f0 <g_crt_ctx>) at src/test/iv_client.c:396
#29 0x00007f407927fe25 in start_thread () from /lib64/libpthread.so.0
#30 0x00007f407850bbad in clone () from /lib64/libc.so.6

@shefty
Copy link
Member

shefty commented Feb 22, 2020

Do you know if this occurs at the end of the test that you're running? If you can display the contents of 'wc' when you hit this, it could help isolate the problem.

I'll work on a change that might help.

@frostedcmos
Copy link
Author

frostedcmos commented Feb 22, 2020

When servers crash it does appear to happen towards the end of the test when servers are instructed to shutdown/finalize. It's unclear at which stage client crashed as the test does following: launch 5 servers, client is then launched sequentially number of times with different arguments, each one sending specific RPC to the server; only 1 client is active at any given time. Segfault on client side happens randomly during one of RPC sends, but I don't know yet a precise point of clients operation.

From servers coredump:
(gdb) p wc
$1 = (struct ibv_wc *) 0x7f1c4fa8cd50
(gdb) p *wc
$2 = {wr_id = 139759551428240, status = IBV_WC_WR_FLUSH_ERR, opcode = IBV_WC_SEND, vendor_err = 244, byte_len = 32540, {imm_data = 39115168, invalidated_rkey = 39115168}, qp_num = 39785,
src_qp = 1336463264, wc_flags = 32540, pkey_index = 0, slid = 0, sl = 0 '\000', dlid_path_bits = 0 '\000'}

From clients coredump:
(gdb) p wc
$1 = (struct ibv_wc *) 0x7f3dfc46ce10
(gdb) p *wc
$2 = {wr_id = 139904191717632, status = IBV_WC_WR_FLUSH_ERR, opcode = 32573, vendor_err = 245, byte_len = 0, {imm_data = 20333280, invalidated_rkey = 20333280}, qp_num = 39997,
src_qp = 20342480, wc_flags = 0, pkey_index = 37256, slid = 310, sl = 0 '\000', dlid_path_bits = 0 '\000'}

shefty added a commit to shefty/libfabric that referenced this issue Feb 22, 2020
Fixes ofiwg#5653

If an operation completes in error, then the only valid field
in a verbs work completion is the wr_id.  In order to determine
if a work completion corresponds to a send or receive, we need
to track it ourselves.  Allocate a vrb_context structure for
all posted receive operations (similar to the send side).  This
allows casting the wr_id to a vrb_context always to access the
necessary data stored prior to posting the work request.

Signed-off-by: Sean Hefty <[email protected]>
@shefty
Copy link
Member

shefty commented Feb 22, 2020

I think PR #5661 may fix the issue. When the work completion status is an error, most of the other fields are uninitialized. However, we're relying on the opcode to be valid, when it's not.

@shefty
Copy link
Member

shefty commented Feb 22, 2020

From your report:

status = IBV_WC_WR_FLUSH_ERR, opcode = IBV_WC_SEND,

this is the problem area. A 'flush_err' status is related to a posted receive operation. But the opcode indicates send. That results in the null-dereference that you're hitting.

@frostedcmos
Copy link
Author

I've just tried it out using (git fetch origin pull/5661/head:TEST) and using that local branch, and so far in 20 runs have not hit the segfault issue anymore.

In addition, a different problem of test taking extremely long time to complete (20seconds per rpc) seems to be also resolved now

shefty added a commit to shefty/libfabric that referenced this issue Feb 24, 2020
Fixes ofiwg#5653

If an operation completes in error, then the only valid field
in a verbs work completion is the wr_id.  In order to determine
if a work completion corresponds to a send or receive, we need
to track it ourselves.  Allocate a vrb_context structure for
all posted receive operations (similar to the send side).  This
allows casting the wr_id to a vrb_context always to access the
necessary data stored prior to posting the work request.

Signed-off-by: Sean Hefty <[email protected]>
shefty added a commit to shefty/libfabric that referenced this issue Feb 24, 2020
Fixes ofiwg#5653

If an operation completes in error, then the only valid field
in a verbs work completion is the wr_id.  In order to determine
if a work completion corresponds to a send or receive, we need
to track it ourselves.  Allocate a vrb_context structure for
all posted receive operations (similar to the send side).  This
allows casting the wr_id to a vrb_context always to access the
necessary data stored prior to posting the work request.

Signed-off-by: Sean Hefty <[email protected]>
shefty added a commit to shefty/libfabric that referenced this issue Feb 24, 2020
Fixes ofiwg#5653

If an operation completes in error, then the only valid field
in a verbs work completion is the wr_id.  In order to determine
if a work completion corresponds to a send or receive, we need
to track it ourselves.  Allocate a vrb_context structure for
all posted receive operations (similar to the send side).  This
allows casting the wr_id to a vrb_context always to access the
necessary data stored prior to posting the work request.

Signed-off-by: Sean Hefty <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants