DAOS: verbs;rxm - fi_cancel() error handling issue #7287

frostedcmos · 2021-12-06T19:18:45Z

DAOS is hitting a problem with handling errors from fi_cancel().
Mercury will issue fi_cancel() when daos detects a 'timeout' situation (rpc did not respond in set amount of time) and expects FI_ECANCELED event. Based on this event mercury will then complete DAOS's RPC.

What appears to be happening is that fi_cancel() could return an error, but mercury expects an eventual completion event in such scenario, however it appears that it is not arriving, essentially blocking clients RPC indefinitely.

We are still in process of verifying that this is the actual situation that is happening; will update the ticket once we have more detailed logs.

shefty · 2021-12-06T20:21:19Z

The rxm cancel operation should always return success. Any return value is basically useless anyway. Cancel is only coded to handle receives. Send operations are expected to complete normally within some timeout window. I will check rxm/verbs to see how it handles failed send operations.

liuxuezhao · 2021-12-07T10:12:25Z

I did some tests and checked that the problem does look like fi_cancel() not behaved as expected.

The scenario is -
Target server_B being killed, server_A send a message (by calling fi_tsend()) to server_B.
On server_A no OFI completion event or error event reported for that send operation (the fi_tsend returns 0)
Later DAOS detects RPC timeout and try to abort the RPC, within the process server A fi_cancel() the send operation, the fi_cancel() returns 0 but no FI_ECANCELED err event reported.
That caused the upper layer's hang.

I also tested that the RECV operation can be canceled as expected - fi_cancel() returns 0 and following with a FI_ECANCELED err event reported.

I tried to work-around it at mercury level mercury-hpc/mercury#532 , with it my test can pass.
But looks like that it would be better if we can fix/refine it at OFI level, we still need to check more details in OFI and try to fix there.

shefty · 2021-12-07T15:57:21Z

Are you getting any completion for the send operation? You can't rely on an error value of FI_ECANCELED, but you should get something. I'm assuming not, and that's the underlying issue. But can you confirm?

shefty · 2021-12-07T17:03:08Z

I think you answered my question here: "On server_A no OFI completion event or error event reported for that send operation". There is no send completion of any sort.

shefty · 2021-12-07T20:01:20Z

Can you tell me the approximate size of the send?

frostedcmos · 2021-12-07T21:10:41Z

At least in one of scenarios (DAOS-9096 ticket) where client side experienced similar hang, RPC payload was ~64bytes + rpcs headers for cart/mercury so total size is roughly less than 200bytes

frostedcmos · 2021-12-07T21:11:07Z

i ll let @liuxuezhao comment on sizes of his case, as he was using some different app/test.

shefty · 2021-12-07T21:31:28Z

Verbs does not support/implement cancel. RxM only implements cancel for receive operations. The expectation is that a send, once posted to the QP, will generate some sort of completion, even if it's an error. Based on the issue description, that is not occurring.

I'm looking at adding a second layer of send tracking to either the rxm or verbs provider. That should allow SW to generate an error completion for any sends that are outstanding at the time a connection is closed. I need to consider this more, though, since as a general rule, libfabric does not guarantee that completions will be generated for outstanding operations on an endpoint that is closed. (e.g. usnic cannot provide this semantic, or really rdma QPs for that matter).

shefty · 2021-12-07T23:12:04Z

Is DAOS continuing to process the EQ during the time that it's waiting for the RPC to complete/cancel?

OR

Is auto-progress enabled (info->domain_attr->data_progress == FI_PROGRESS_AUTO) or (env var FI_OFI_RXM_DATA_AUTO_PROGRESS=true)?

If you having logging enabled, I'm looking for this text: "Starting auto-progress thread".

Based on reading the IB spec, any sends queued on the QP when it transitions into the error state (happens during shutdown) should generate an error completion.

frostedcmos · 2021-12-08T00:01:14Z

DAOS progresses manually all the time, so yes, during wait for the RPC to complete there will be progress.

Internally for verbs;rxm mercury sets progress mode to FI_PROGRESS_MANUAL

liuxuezhao · 2021-12-08T02:56:00Z

@shefty thanks very much for looking into the problem. a few more info below.

"Can you tell me the approximate size of the send"
The RPC is a meta-data operation, the RPC size is relative small, about 100~200 bytes.

For the progress, I tested to set "info->domain_attr->data_progress == FI_PROGRESS_AUTO" in mercury.
in the test it made a few differences, but it still does not work as expected.
The difference observed in the test is when fi_tsend() the RPC request msg to the killed target server -

in manual progress mode, the fi_tsend() returns 0
in AUTO progress mode, the fi_tsend() returns FI_EAGAIN (and always return FI_EAGAIN when retry the sending many times)

In both progress mode, OFI does not report the fi_tsend() OP's completion event, also does not report FI_ECANCELED err event after calling fi_cancel (although fi_cancel returned 0).

BTW, despite of mercury set it in MANUAL or AUTO progress mode, DAOS/mercury will always progress it (by calling fi_cq_read) during the test run.

liuxuezhao · 2021-12-08T07:11:41Z

a few FI logs during the failure (Manual progress mode) -

Killing target server
fi_tsend() the RPC request msg
libfabric:33969:1638946654:ofi_rxm:ep_data:rxm_ep_msg_normal_send():976 Posting send with length: 224 tag: 0x100000008
libfabric:33969:1638946654:ofi_rxm:ep_ctrl:rxm_process_shutdown():625 shutdown conn 0x3267810 (state 3)
libfabric:33969:1638946654:ofi_rxm:ep_ctrl:rxm_close_conn():64 closing conn 0x3267810
fi_tsend() returned 0, op_id: 0x2a7bbf0
libfabric:33969:1638946654:verbs:domain:vrb_ep_close():500 EP 0x329f340 is being closed
libfabric:33969:1638946654:ofi_rxm:ep_ctrl:rxm_free_conn():267 free conn 0x3267810
...
DAOS rpc timeout detect
Mercury cancel recv op 0x2a7be10 (the pre-posted RPC reply recv), fi_cancel() returned 0
Mercury cancel send op 0x2a7bbf0, fi_cancel() returned 0
mercury na_ofi_cq_read(): got FI_ECANCELED event on operation ID 0x2a7be10 (recv OP)
....

No completion event for send OP (0x2a7bbf0), also no FI_ECANCELED event.

liuxuezhao · 2021-12-08T07:49:13Z

The libfabric used in my above test is v1.14.0.

Just now I tried to use latest libfabric master branch (bb8bcc7), the test got a segfault -
(gdb) bt
#0 0x00007f1a10f4ffe5 in __strcasestr_sse42 () from /lib64/libc.so.6
#1 0x00007f1a0b4af9e8 in rxm_open_conn (conn=0x7f19ec555bb0, msg_info=0x7f19ec555110) at prov/rxm/src/rxm_conn.c:203
#2 0x00007f1a0b4b144e in rxm_process_connreq (ep=0x7f19ec117cf0, cm_entry=0x211a760) at prov/rxm/src/rxm_conn.c:677
#3 0x00007f1a0b4b18d2 in rxm_handle_event (ep=0x7f19ec117cf0, event=1, cm_entry=0x211a760, len=40) at prov/rxm/src/rxm_conn.c:758
#4 0x00007f1a0b4b19fe in rxm_conn_progress (ep=0x7f19ec117cf0) at prov/rxm/src/rxm_conn.c:784
#5 0x00007f1a0b4c592b in rxm_ep_do_progress (util_ep=0x7f19ec117cf0) at prov/rxm/src/rxm_cq.c:1941
#6 0x00007f1a0b4c5a0f in rxm_ep_progress (util_ep=0x7f19ec117cf0) at prov/rxm/src/rxm_cq.c:1961
#7 0x00007f1a0b43f008 in ofi_cq_progress (cq=0x7f19ec103450) at prov/util/src/util_cq.c:567
#8 0x00007f1a0b43e0e0 in ofi_cq_readfrom (cq_fid=0x7f19ec103450, buf=0x211aab0, count=16, src_addr=0x211aa30) at prov/util/src/util_cq.c:230
#9 0x00007f1a0fa8e45f in fi_cq_readfrom (src_addr=0x211aa30, count=16, buf=0x211aab0, cq=0x7f19ec103450) at /home/xliu9/src/daos_m/install/prereq/debug/ofi/include/rdma/fi_eq.h:400
#10 na_ofi_cq_read (max_count=16, actual_count=, src_err_addrlen=, src_err_addr=, src_addrs=0x211aa30, cq_events=0x211aab0,
context=0x7f19ec1067f0) at /home/xliu9/src/daos_m/build/external/debug/mercury/src/na/na_ofi.c:3235
...
(gdb) f 1
#1 0x00007f1a0b4af9e8 in rxm_open_conn (conn=0x7f19ec555bb0, msg_info=0x7f19ec555110) at prov/rxm/src/rxm_conn.c:203
203 if (!strcasestr(msg_info->fabric_attr->prov_name, "tcp")) {
(gdb) p msg_info->fabric_attr
$1 = (struct fi_fabric_attr *) 0x7f19ec555530
(gdb) p *msg_info->fabric_attr
$2 = {fabric = 0x0, name = 0x7f19ec555510 "IB-0xfe8", '0' <repeats 13 times>, prov_name = 0x0, prov_version = 7471104, api_version = 0}
(gdb) p msg_info->fabric_attr->prov_name
$3 = 0x0
(gdb) p msg_info->fabric_attr->prov_name[0]
Cannot access memory at address 0x0

By git blame, seems related with recent change c4862bf.
This is another different issue which is unrelated with this ticket, just FYI.

shefty · 2021-12-08T16:22:28Z

Thanks, the log helps. Copying the log:

libfabric:33969:1638946654:ofi_rxm:ep_data:rxm_ep_msg_normal_send():976 Posting send with length: 224 tag: 0x100000008
libfabric:33969:1638946654:ofi_rxm:ep_ctrl:rxm_process_shutdown():625 shutdown conn 0x3267810 (state 3)
libfabric:33969:1638946654:ofi_rxm:ep_ctrl:rxm_close_conn():64 closing conn 0x3267810
fi_tsend() returned 0, op_id: 0x2a7bbf0
libfabric:33969:1638946654:verbs:domain:vrb_ep_close():500 EP 0x329f340 is being closed
libfabric:33969:1638946654:ofi_rxm:ep_ctrl:rxm_free_conn():267 free conn 0x3267810

vrb_ep_close() should flush all outstanding send operations. I have a patch (#7291) in the works that manually forces this in case the NIC does not flush all sends for some reason.

From the log it looks like the tsend occurred just prior to shutdown being handled. I wonder if the QP was already in an error state at that point and verbs directed the send into the void...

Can you open a separate issue for for the problem you're seeing with main and the null provider name? That looks like a problem with the verbs provider not formatting the fi_info correctly.

shefty · 2021-12-08T23:00:37Z

I'm still working toward trying to reproduce the problem with a simpler app. But PR #7291 is now passing our CI and is targeted as a solution.

shefty · 2021-12-08T23:51:33Z

The changes in #7291 won't cherry-pick cleaning into v1.14.x. So, this should be a picked and updated version for v1.14.x:

https://github.com/shefty/libfabric/tree/v1.14.x

See the top 2 patches -- compile tested only.

liuxuezhao · 2021-12-09T03:01:20Z

@shefty thanks for providing the patch.

I tested the backported version on v1.14.0, my test can pass now without mercury workaround.

Killing target server
fi_tsend() to the killed target
libfabric:25533:1639016882:ofi_rxm:ep_data:rxm_ep_msg_normal_send():976 Posting send with length: 224 tag: 0x100000007
fi_tsend() returns 0, op_id 0x255b260
libfabric:25533:1639016882:ofi_rxm:ep_ctrl:rxm_process_shutdown():625 shutdown conn 0x2d476b0 (state 3)
libfabric:25533:1639016882:ofi_rxm:ep_ctrl:rxm_close_conn():64 closing conn 0x2d476b0
libfabric:25533:1639016882:verbs:domain:vrb_ep_close():539 EP 0x2d4c3e0 is being closed
libfabric:25533:1639016882:ofi_rxm:ep_ctrl:rxm_free_conn():267 free conn 0x2d476b0
mercury got FI_ECANCELED event on op_id 0x255b260

With the fix, when the target disconnect detected, ofi reports a FI_ECANCELED event for the SEND OP, even before mercury call fi_cancel() that SEND OP. That is fine and can be handled by mercury.

I did not test on master branch because the segfault mentioned above, I have created #7300 for it.

liuxuezhao · 2021-12-09T07:28:30Z

@shefty just confirm one point, in your patch #7291 only the send OP added to the list (ep->sq_list), do you think the RMA (fi_read/fi_write) operation need similar handling? Or the send OP handling can cover the RMA OP already?
in DAOS it will do bulk/rma cancel for some err handling, corresponding to fi_cancel() an outgoing RMA OP and un-register the MR. I know that RMA cannot be really canceled, but we need to ensure it can be completed similar as canceling SEND OP. Thanks.

johannlombardi · 2021-12-09T09:34:35Z

Thanks @shefty for the patch. Another question is whether it is considered now safe to integrate into our imminent DAOS release?

shefty · 2021-12-09T16:18:09Z

I think the changes work for all transmit operations. I'll verify.

It's still not clear to me what the underlying problem actually is. According to the IB spec (C10-42 compliance statement or something like that -- somewhere in chapter 10), the HCA is supposed to generate error completions (flush status) for all send operations when a QP is transitioned into an error state. That includes sends posted when the QP is in an error state. The report suggests that this is not happening, but I would not yet rule out some other issue in libfabric.

@liuxuezhao - The operation is completing as canceled because OFI detected that the connection has failed, and outstanding sends that were actively queued on the connection are being returned as failed. You brought up a point in that a better error code might be EIO rather than ECANCELED. The actual state of the transfer at the peer is unknown. Canceled suggests that it was never attempted. It's possible that the transfer was received and only the ACK from the peer was discarded.

@johannlombardi - No, I don't think this is ready yet. I need to verify what the right return code should be and verify this works for RMA.

shefty · 2021-12-09T16:31:33Z

Btw, I should have an updated version later today. IMO, the patch should be relatively safe to integrate, even if there might be some other issue.

shefty · 2021-12-09T19:34:13Z

Analyzing this more, I think the proposed changes are sufficient and should move forward. I've opened a PR with the v1.14.x changes.

The tracking works for any operation posted to the QP's send queue -- that includes RMA, atomics, etc.

According to C10-42 in the IB spec, any outstanding send operations should generate a completion with the status set to flush. The proposed changes match that behavior. So, the returned error code (ECANCELED) matches what an app will already see from flushed work requests.

shefty · 2021-12-09T22:52:49Z

Changes were merged into both main and v1.14.x (#7301)

gnailzenh · 2021-12-13T17:11:44Z

We actually ran into issues with 7301, we got a lot of error like this after killing a rank:
ofi_rxm:ep_ctrl:rxm_handle_error():655 fi_eq_readerr: err: Connection refused (111), prov_err: Unknown error -8 (-8)

Not sure it is something exposed by this patch, or it's issue of this patch. But I do have another question, PR-7301 only initialized a few members of ibv_wc in vrb_flush_sq(), does it have any impact if rest members are only zeroed?

shefty · 2021-12-13T17:30:22Z

Connection refused is going to happen if you've killed a rank. Sends targeting that rank will try to form a new connection, which will continue to fail. I can rate limit this message if needed.

Verbs itself only guarantees that 2 of the ibv_wc fields are initialized when an error occurs (wr_id and status). The other fields being zero is actually formatting those fields more than verbs would.

This was referenced Dec 9, 2021

prov/verbs: Flush any pending send requests on EP destruction #7291

Merged

[v1.14.x] prov/verbs: track transmit requests, so we can report failures on shutdown #7301

Merged

shefty closed this as completed Dec 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAOS: verbs;rxm - fi_cancel() error handling issue #7287

DAOS: verbs;rxm - fi_cancel() error handling issue #7287

frostedcmos commented Dec 6, 2021

shefty commented Dec 6, 2021

liuxuezhao commented Dec 7, 2021 •

edited

Loading

shefty commented Dec 7, 2021

shefty commented Dec 7, 2021

shefty commented Dec 7, 2021

frostedcmos commented Dec 7, 2021

frostedcmos commented Dec 7, 2021

shefty commented Dec 7, 2021

shefty commented Dec 7, 2021

frostedcmos commented Dec 8, 2021

liuxuezhao commented Dec 8, 2021

liuxuezhao commented Dec 8, 2021

liuxuezhao commented Dec 8, 2021

shefty commented Dec 8, 2021

shefty commented Dec 8, 2021

shefty commented Dec 8, 2021

liuxuezhao commented Dec 9, 2021 •

edited

Loading

liuxuezhao commented Dec 9, 2021 •

edited

Loading

johannlombardi commented Dec 9, 2021

shefty commented Dec 9, 2021

shefty commented Dec 9, 2021

shefty commented Dec 9, 2021

shefty commented Dec 9, 2021

gnailzenh commented Dec 13, 2021

shefty commented Dec 13, 2021

DAOS: verbs;rxm - fi_cancel() error handling issue #7287

DAOS: verbs;rxm - fi_cancel() error handling issue #7287

Comments

frostedcmos commented Dec 6, 2021

shefty commented Dec 6, 2021

liuxuezhao commented Dec 7, 2021 • edited Loading

shefty commented Dec 7, 2021

shefty commented Dec 7, 2021

shefty commented Dec 7, 2021

frostedcmos commented Dec 7, 2021

frostedcmos commented Dec 7, 2021

shefty commented Dec 7, 2021

shefty commented Dec 7, 2021

frostedcmos commented Dec 8, 2021

liuxuezhao commented Dec 8, 2021

liuxuezhao commented Dec 8, 2021

liuxuezhao commented Dec 8, 2021

shefty commented Dec 8, 2021

shefty commented Dec 8, 2021

shefty commented Dec 8, 2021

liuxuezhao commented Dec 9, 2021 • edited Loading

liuxuezhao commented Dec 9, 2021 • edited Loading

johannlombardi commented Dec 9, 2021

shefty commented Dec 9, 2021

shefty commented Dec 9, 2021

shefty commented Dec 9, 2021

shefty commented Dec 9, 2021

gnailzenh commented Dec 13, 2021

shefty commented Dec 13, 2021

liuxuezhao commented Dec 7, 2021 •

edited

Loading

liuxuezhao commented Dec 9, 2021 •

edited

Loading

liuxuezhao commented Dec 9, 2021 •

edited

Loading