Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure MG to have the same number of allreduce calls in mean_stddev for sparse matrix to avoid hanging #6141

Open
wants to merge 1 commit into
base: branch-24.12
Choose a base branch
from

Conversation

lijinf2
Copy link
Contributor

@lijinf2 lijinf2 commented Nov 22, 2024

The hanging occurs when one GPU gets a sparse matrix of all zero values, while other GPUs get-zero values.

@lijinf2 lijinf2 requested a review from a team as a code owner November 22, 2024 01:23
@lijinf2 lijinf2 added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change 2 - In Progress Currenty a work in progress labels Nov 22, 2024
@lijinf2 lijinf2 added the 3 - Ready for Review Ready for review by team label Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currenty a work in progress 3 - Ready for Review Ready for review by team CUDA/C++ improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant