Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

guard all2all from empty transfer #727

Merged

Conversation

mfoerste4
Copy link
Contributor

Adding guards to prevent zero sized sends/receives during NCCL all2all block. This helped to circumvent the issue observed in #722 .

@mfoerste4 mfoerste4 added the bug Something isn't working label Dec 5, 2022
@mfoerste4 mfoerste4 linked an issue Dec 5, 2022 that may be closed by this pull request
@magnatelee
Copy link
Contributor

@mfoerste4 looks like NCCL fixed the old issue that it required all the participants to participate the exchange; in the past I had to make sure every transfer is non-empty. glad to see that issue has been fixed.

@mfoerste4
Copy link
Contributor Author

@mfoerste4 looks like NCCL fixed the old issue that it required all the participants to participate the exchange; in the past I had to make sure every transfer is non-empty. glad to see that issue has been fixed.

Note that this is just a workaround for the issue. NCCL should also accept zero sized messages as long as the send/recv dimensions are pairwise consistent.

@magnatelee
Copy link
Contributor

Note that this is just a workaround for the issue. NCCL should also accept zero sized messages as long as the send/recv dimensions are pairwise consistent.

FWIW, I think there's a bigger question of whether an empty kernel should be allowed. You can't launch a kernel with empty launch domain in CUDA, and CUDA libraries may have chosen to be consistent with this behavior. Disallowing / under-specifying behavior of empty transfers can be one such example. I agree that it'd be certainly nicer if NCCL handled empty transfers gracefully.

@mfoerste4 mfoerste4 added category:bug-fix PR is a bug fix and will be classified as such in release notes and removed bug Something isn't working labels Dec 7, 2022
@mfoerste4 mfoerste4 merged commit 53778f3 into nv-legate:branch-22.12 Dec 14, 2022
@mfoerste4 mfoerste4 deleted the sort_guard_all2all_issue_722 branch December 14, 2022 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category:bug-fix PR is a bug fix and will be classified as such in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sort hangs due to stuck NCCL kernels in sort on 48GPUs on Summit
2 participants