Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What's the difference in allreduce for factor between the NCCL and NCCL-test? #274

Open
networkResearcher opened this issue Dec 9, 2024 · 2 comments

Comments

@networkResearcher
Copy link

I find there is difference for this change factor between the NCCL and NCCL-test.
For allreduce in NCCL:
If the algorithm is ring, busBw = 2(n-1)/n * algBw, which is same as NCCL-test.
If the algorithm is tree, busBw = 2 * algBw, which is different with NCCL-test.
How can I understand this question?

@sjeaugey
Copy link
Member

sjeaugey commented Dec 9, 2024

The AlgBw->BusBW is theoretical, on a flat hardware topology, using point-to-point transfers to execute the operation. NCCL perf tests is a benchmark, it cannot see inside NCCL and it does not detect the topology nor the hardware. So the BusBW is a simple re-calibration of the algorithm bandwidth to reflect the fact that we need to transmit a different amount of data depending on the number of ranks.

If BusBW peaks at say 50GB/s, it's telling you that your system is equivalent to a system where each GPU would be connected to a flat fabric at 50GB/s.

Tree is not bandwidth optimal so the real bus bandwidth may be higher in some places. Same if your hardware is hierarchical, if you're using NVLink SHARP or IB SHARP (as those are not based on point-to-point communication but on HW-offloaded collectives instead).
But you can still use the BusBW to compare between algorithms and see which one is faster.

@networkResearcher
Copy link
Author

Thank you very much !
We are trying to think which algorithm to choose for running AllReduce within a single machine. It seems that using the conversion formula between algorithm bandwidth and bus bandwidth is not correct for making this decision. Could you tell us which algorithm to choose for running AllReduce within a single machine, both with and without Nvlink?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants