Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AllReduce Bus Bandwidth decreases with larger network latency #241

Open
chenzhu99 opened this issue Jul 29, 2024 · 0 comments
Open

AllReduce Bus Bandwidth decreases with larger network latency #241

chenzhu99 opened this issue Jul 29, 2024 · 0 comments

Comments

@chenzhu99
Copy link

chenzhu99 commented Jul 29, 2024

We are conducting a nccl_test experiment with varied network latency.

The topology is Server 1&2&3&4 connected to local network 1, Server 5&6&7&8 connected to local network 2, each local network latency is less than 5 us. Local network 1 and 2 are connected with a controllable cross network latency.

We are doing a ring Allreduce test with the ring topology of: Server 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8, i.e. a 64 GPU ring AR.

  1. We can get maximum bus bandwidth at low cross network latency (say 20us between local network 1 and 2);
  2. however when we increase the cross network latency to several hundreds of us, the bus bandwidth drops by almost 50%.

Another data point is that, when we reduce the Allreduce rank by half, i.e. a 32 GPU ring Allreduce test with the ring topology of: Server 1 - 2 - 5 - 6, we can always get maximum bus bandwidth regardless what's the bus bandwidth.

Could anyone give a hint on explaining why the busBW drops with 64 GPU ring and large cross network latency?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant