-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CommunicationTest.SendRecv/UCC hangs. #3120
Comments
@samnordmann would you mind taking a look? |
I am able to reproduce the issue on viking H100 dgx node and am able to give an explanation of what is going on. WhatThere is a known incompatibility between user's stream operations and UCX using nvLink over cuda-IPC, which can cause hangs. This is what we are seeing here. Both UCX and nvFuser post operations on the stream and this causes a deadlock. Temporary workaroundWe can disable the usage of cuda IPC in UCX by setting the flags
executes smoothly. With those flags, UCX will GPU-direct RDMA, so we probably need the node to have a capable NIC. GPU-direct RDMA is stream-less, therefore there is no deadlock issue Long Term fixUCX and UCC team are working on a solution, as part of POR: https://redmine.mellanox.com/issues/3831841 Backtraces, for the recordThreads involved:
backtrace of the main thread:
Backtrace of the ucc-progress thread:
|
@xwang233 I created a PR here but I am not managing to request your review. I might have done something wrong, let me know https://gitlab-master.nvidia.com/dl/pytorch/fuser-gh-mirror/-/merge_requests/13 |
LGTM. Thanks for the reminder. Feel free to cc me internally on MR in the future. 😄 |
POR ticket for the long term fix: https://redmine.mellanox.com/issues/3831841 |
It runs OK in GitHub CI, which runs with V100x4 and A100x4, but fails consistently on H100.
@csarofeen and I managed to reproduce this on
viking-prod-231
in partitionviking-prod-pjnl
.Other tests pass with
CommunicationTest.SendRecv/UCC
excluded.The text was updated successfully, but these errors were encountered: