-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][distributed] add stateless process group #10216
Conversation
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
merge for rlhf integration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
General question:
(1) Do you want the new PG to be able to handle object collectives? If so, it seems its method surface would be similar to today PG's method surface. If that's the case, can we (PyTorch) re-work the new_group
API so that it does not need calling init_process_group
before it? IIUC, init_process_group
is the main API that creates a global "state".
(2) If you don't need object collectives, would creating separate "Backend" objects work for you? I guess the "Backend" object would be similar to the PyncclCommunicator here.
assert dist.get_backend(group) != dist.Backend.NCCL, ( | ||
"PyNcclCommunicator should be attached to a non-NCCL group.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's by design, we only use gloo group to initialize this nccl communicator. gloo for control-plane communication, while nccl for data plane.
Signed-off-by: youkaichao <[email protected]>
yes that would be great!
not sure how that works. i remember pytorch's nccl backend has some strange operations, like bucketing the transfer, and some logging wrapper for debugging. we don't want these. |
Signed-off-by: youkaichao <[email protected]> Signed-off-by: OmerD <[email protected]>
Signed-off-by: youkaichao <[email protected]> Signed-off-by: Sumit Dubey <[email protected]>
No description provided.