You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are conducting a nccl_test experiment with varied network latency.
The topology is Server 1&2&3&4 connected to local network 1, Server 5&6&7&8 connected to local network 2, each local network latency is less than 5 us. Local network 1 and 2 are connected with a controllable cross network latency.
We are doing a ring Allreduce test with the ring topology of: Server 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8, i.e. a 64 GPU ring AR.
We can get maximum bus bandwidth at low cross network latency (say 20us between local network 1 and 2);
however when we increase the cross network latency to several hundreds of us, the bus bandwidth drops by almost 50%.
Another data point is that, when we reduce the Allreduce rank by half, i.e. a 32 GPU ring Allreduce test with the ring topology of: Server 1 - 2 - 5 - 6, we can always get maximum bus bandwidth regardless what's the bus bandwidth.
Could anyone give a hint on explaining why the busBW drops with 64 GPU ring and large cross network latency?
The text was updated successfully, but these errors were encountered:
We are conducting a nccl_test experiment with varied network latency.
The topology is Server 1&2&3&4 connected to local network 1, Server 5&6&7&8 connected to local network 2, each local network latency is less than 5 us. Local network 1 and 2 are connected with a controllable cross network latency.
We are doing a ring Allreduce test with the ring topology of: Server 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8, i.e. a 64 GPU ring AR.
Another data point is that, when we reduce the Allreduce rank by half, i.e. a 32 GPU ring Allreduce test with the ring topology of: Server 1 - 2 - 5 - 6, we can always get maximum bus bandwidth regardless what's the bus bandwidth.
Could anyone give a hint on explaining why the busBW drops with 64 GPU ring and large cross network latency?
The text was updated successfully, but these errors were encountered: