-
Notifications
You must be signed in to change notification settings - Fork 22.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Distributed] P2P Operations on NCCL do not respect tag #125079
Comments
This is actually a known limitation. We should better document it though. NCCL's API does not support tags, so there isn't a clear way that we could make use of this, even though our APIs expose it (such that it can be used by backends that do support a tag). https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/p2p.html#c.ncclSend |
Existing documentation on isend/irecv also applies to send/recv. This PR copies the doc/warning to send/recv ops as well. Note: tag may be supplied, but will be ignored when used with nccl backend. Fixes #94819 #125079 ghstack-source-id: caf8308608ac82433d8d1c76d17524b7d0e2154d Pull Request resolved: #125278
Existing documentation on isend/irecv also applies to send/recv. This PR copies the doc/warning to send/recv ops as well. Note: tag may be supplied, but will be ignored when used with nccl backend. Fixes #94819 #125079 Pull Request resolved: #125278 Approved by: https://github.com/kwen2501
Closing as fixed by updating docs. |
Wanted to note that tagging is not intended for supporting out-of-order send/recv calls (in particular the blocking version). Neither NCCL nor MPI would be able to support the example in this issue. |
In that case wouldn't we at least expect hangs (with MPI)? Wouldn't both processes block on differently tagged send/recv? |
Existing documentation on isend/irecv also applies to send/recv. This PR copies the doc/warning to send/recv ops as well. Note: tag may be supplied, but will be ignored when used with nccl backend. Fixes pytorch#94819 pytorch#125079 Pull Request resolved: pytorch#125278 Approved by: https://github.com/kwen2501
🐛 Describe the bug
When using NCCL with Send/Recv operations we expect the tag argument to be respected for send/recv matching. This doesn't occur in practice.
Example program:
Here we expect tens to be a tensor of 1s and tens2 to be a tensor of 2s when received, or at least a hang. The opposite happens.
Should be the exact same issue as this: #94819 but for send/recv instead of isend and irecv.
Versions
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k
The text was updated successfully, but these errors were encountered: