You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I find there is difference for this change factor between the NCCL and NCCL-test.
For allreduce in NCCL:
If the algorithm is ring, busBw = 2(n-1)/n * algBw, which is same as NCCL-test.
If the algorithm is tree, busBw = 2 * algBw, which is different with NCCL-test.
How can I understand this question?
The text was updated successfully, but these errors were encountered:
The AlgBw->BusBW is theoretical, on a flat hardware topology, using point-to-point transfers to execute the operation. NCCL perf tests is a benchmark, it cannot see inside NCCL and it does not detect the topology nor the hardware. So the BusBW is a simple re-calibration of the algorithm bandwidth to reflect the fact that we need to transmit a different amount of data depending on the number of ranks.
If BusBW peaks at say 50GB/s, it's telling you that your system is equivalent to a system where each GPU would be connected to a flat fabric at 50GB/s.
Tree is not bandwidth optimal so the real bus bandwidth may be higher in some places. Same if your hardware is hierarchical, if you're using NVLink SHARP or IB SHARP (as those are not based on point-to-point communication but on HW-offloaded collectives instead).
But you can still use the BusBW to compare between algorithms and see which one is faster.
Thank you very much !
We are trying to think which algorithm to choose for running AllReduce within a single machine. It seems that using the conversion formula between algorithm bandwidth and bus bandwidth is not correct for making this decision. Could you tell us which algorithm to choose for running AllReduce within a single machine, both with and without Nvlink?
I find there is difference for this change factor between the NCCL and NCCL-test.
For allreduce in NCCL:
If the algorithm is ring,
busBw = 2(n-1)/n * algBw
, which is same as NCCL-test.If the algorithm is tree,
busBw = 2 * algBw
, which is different with NCCL-test.How can I understand this question?
The text was updated successfully, but these errors were encountered: