-
Notifications
You must be signed in to change notification settings - Fork 390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS: verbs;rxm - fi_cancel() error handling issue #7287
Comments
The rxm cancel operation should always return success. Any return value is basically useless anyway. Cancel is only coded to handle receives. Send operations are expected to complete normally within some timeout window. I will check rxm/verbs to see how it handles failed send operations. |
I did some tests and checked that the problem does look like fi_cancel() not behaved as expected. The scenario is - I also tested that the RECV operation can be canceled as expected - fi_cancel() returns 0 and following with a FI_ECANCELED err event reported. I tried to work-around it at mercury level mercury-hpc/mercury#532 , with it my test can pass. |
Are you getting any completion for the send operation? You can't rely on an error value of FI_ECANCELED, but you should get something. I'm assuming not, and that's the underlying issue. But can you confirm? |
I think you answered my question here: "On server_A no OFI completion event or error event reported for that send operation". There is no send completion of any sort. |
Can you tell me the approximate size of the send? |
At least in one of scenarios (DAOS-9096 ticket) where client side experienced similar hang, RPC payload was ~64bytes + rpcs headers for cart/mercury so total size is roughly less than 200bytes |
i ll let @liuxuezhao comment on sizes of his case, as he was using some different app/test. |
Verbs does not support/implement cancel. RxM only implements cancel for receive operations. The expectation is that a send, once posted to the QP, will generate some sort of completion, even if it's an error. Based on the issue description, that is not occurring. I'm looking at adding a second layer of send tracking to either the rxm or verbs provider. That should allow SW to generate an error completion for any sends that are outstanding at the time a connection is closed. I need to consider this more, though, since as a general rule, libfabric does not guarantee that completions will be generated for outstanding operations on an endpoint that is closed. (e.g. usnic cannot provide this semantic, or really rdma QPs for that matter). |
Is DAOS continuing to process the EQ during the time that it's waiting for the RPC to complete/cancel? OR Is auto-progress enabled (info->domain_attr->data_progress == FI_PROGRESS_AUTO) or (env var FI_OFI_RXM_DATA_AUTO_PROGRESS=true)? If you having logging enabled, I'm looking for this text: "Starting auto-progress thread". Based on reading the IB spec, any sends queued on the QP when it transitions into the error state (happens during shutdown) should generate an error completion. |
DAOS progresses manually all the time, so yes, during wait for the RPC to complete there will be progress. Internally for verbs;rxm mercury sets progress mode to FI_PROGRESS_MANUAL |
@shefty thanks very much for looking into the problem. a few more info below. "Can you tell me the approximate size of the send" For the progress, I tested to set "info->domain_attr->data_progress == FI_PROGRESS_AUTO" in mercury.
In both progress mode, OFI does not report the fi_tsend() OP's completion event, also does not report FI_ECANCELED err event after calling fi_cancel (although fi_cancel returned 0). BTW, despite of mercury set it in MANUAL or AUTO progress mode, DAOS/mercury will always progress it (by calling fi_cq_read) during the test run. |
a few FI logs during the failure (Manual progress mode) - Killing target server No completion event for send OP (0x2a7bbf0), also no FI_ECANCELED event. |
The libfabric used in my above test is v1.14.0. Just now I tried to use latest libfabric master branch (bb8bcc7), the test got a segfault - By git blame, seems related with recent change c4862bf. |
Thanks, the log helps. Copying the log:
vrb_ep_close() should flush all outstanding send operations. I have a patch (#7291) in the works that manually forces this in case the NIC does not flush all sends for some reason. From the log it looks like the tsend occurred just prior to shutdown being handled. I wonder if the QP was already in an error state at that point and verbs directed the send into the void... Can you open a separate issue for for the problem you're seeing with main and the null provider name? That looks like a problem with the verbs provider not formatting the fi_info correctly. |
I'm still working toward trying to reproduce the problem with a simpler app. But PR #7291 is now passing our CI and is targeted as a solution. |
The changes in #7291 won't cherry-pick cleaning into v1.14.x. So, this should be a picked and updated version for v1.14.x: https://github.com/shefty/libfabric/tree/v1.14.x See the top 2 patches -- compile tested only. |
@shefty thanks for providing the patch. I tested the backported version on v1.14.0, my test can pass now without mercury workaround. Killing target server With the fix, when the target disconnect detected, ofi reports a FI_ECANCELED event for the SEND OP, even before mercury call fi_cancel() that SEND OP. That is fine and can be handled by mercury. I did not test on master branch because the segfault mentioned above, I have created #7300 for it. |
@shefty just confirm one point, in your patch #7291 only the send OP added to the list (ep->sq_list), do you think the RMA (fi_read/fi_write) operation need similar handling? Or the send OP handling can cover the RMA OP already? |
Thanks @shefty for the patch. Another question is whether it is considered now safe to integrate into our imminent DAOS release? |
I think the changes work for all transmit operations. I'll verify. It's still not clear to me what the underlying problem actually is. According to the IB spec (C10-42 compliance statement or something like that -- somewhere in chapter 10), the HCA is supposed to generate error completions (flush status) for all send operations when a QP is transitioned into an error state. That includes sends posted when the QP is in an error state. The report suggests that this is not happening, but I would not yet rule out some other issue in libfabric. @liuxuezhao - The operation is completing as canceled because OFI detected that the connection has failed, and outstanding sends that were actively queued on the connection are being returned as failed. You brought up a point in that a better error code might be EIO rather than ECANCELED. The actual state of the transfer at the peer is unknown. Canceled suggests that it was never attempted. It's possible that the transfer was received and only the ACK from the peer was discarded. @johannlombardi - No, I don't think this is ready yet. I need to verify what the right return code should be and verify this works for RMA. |
Btw, I should have an updated version later today. IMO, the patch should be relatively safe to integrate, even if there might be some other issue. |
Analyzing this more, I think the proposed changes are sufficient and should move forward. I've opened a PR with the v1.14.x changes. The tracking works for any operation posted to the QP's send queue -- that includes RMA, atomics, etc. According to C10-42 in the IB spec, any outstanding send operations should generate a completion with the status set to flush. The proposed changes match that behavior. So, the returned error code (ECANCELED) matches what an app will already see from flushed work requests. |
Changes were merged into both main and v1.14.x (#7301) |
We actually ran into issues with 7301, we got a lot of error like this after killing a rank: Not sure it is something exposed by this patch, or it's issue of this patch. But I do have another question, PR-7301 only initialized a few members of ibv_wc in vrb_flush_sq(), does it have any impact if rest members are only zeroed? |
Connection refused is going to happen if you've killed a rank. Sends targeting that rank will try to form a new connection, which will continue to fail. I can rate limit this message if needed. Verbs itself only guarantees that 2 of the ibv_wc fields are initialized when an error occurs (wr_id and status). The other fields being zero is actually formatting those fields more than verbs would. |
DAOS is hitting a problem with handling errors from fi_cancel().
Mercury will issue fi_cancel() when daos detects a 'timeout' situation (rpc did not respond in set amount of time) and expects FI_ECANCELED event. Based on this event mercury will then complete DAOS's RPC.
What appears to be happening is that fi_cancel() could return an error, but mercury expects an eventual completion event in such scenario, however it appears that it is not arriving, essentially blocking clients RPC indefinitely.
We are still in process of verifying that this is the actual situation that is happening; will update the ticket once we have more detailed logs.
The text was updated successfully, but these errors were encountered: