What's the difference in allreduce for factor between the NCCL and NCCL-test? #274

networkResearcher · 2024-12-09T09:36:02Z

I find there is difference for this change factor between the NCCL and NCCL-test.
For allreduce in NCCL:
If the algorithm is ring, busBw = 2(n-1)/n * algBw, which is same as NCCL-test.
If the algorithm is tree, busBw = 2 * algBw, which is different with NCCL-test.
How can I understand this question?

The text was updated successfully, but these errors were encountered:

sjeaugey · 2024-12-09T09:48:30Z

The AlgBw->BusBW is theoretical, on a flat hardware topology, using point-to-point transfers to execute the operation. NCCL perf tests is a benchmark, it cannot see inside NCCL and it does not detect the topology nor the hardware. So the BusBW is a simple re-calibration of the algorithm bandwidth to reflect the fact that we need to transmit a different amount of data depending on the number of ranks.

If BusBW peaks at say 50GB/s, it's telling you that your system is equivalent to a system where each GPU would be connected to a flat fabric at 50GB/s.

Tree is not bandwidth optimal so the real bus bandwidth may be higher in some places. Same if your hardware is hierarchical, if you're using NVLink SHARP or IB SHARP (as those are not based on point-to-point communication but on HW-offloaded collectives instead).
But you can still use the BusBW to compare between algorithms and see which one is faster.

networkResearcher · 2024-12-09T10:09:02Z

Thank you very much !
We are trying to think which algorithm to choose for running AllReduce within a single machine. It seems that using the conversion formula between algorithm bandwidth and bus bandwidth is not correct for making this decision. Could you tell us which algorithm to choose for running AllReduce within a single machine, both with and without Nvlink?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's the difference in allreduce for factor between the NCCL and NCCL-test? #274

What's the difference in allreduce for factor between the NCCL and NCCL-test? #274

networkResearcher commented Dec 9, 2024

sjeaugey commented Dec 9, 2024

networkResearcher commented Dec 9, 2024

What's the difference in allreduce for factor between the NCCL and NCCL-test? #274

What's the difference in allreduce for factor between the NCCL and NCCL-test? #274

Comments

networkResearcher commented Dec 9, 2024

sjeaugey commented Dec 9, 2024

networkResearcher commented Dec 9, 2024