We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
We are building supercomputing infra for an internal GPU cluster up to thousands of expensive GPUs. We are looking to adopt mpi-operator, or slurm.
slurm is widely adopted in large-scale hpc computing, so its scalability is well tested.
Is there cases of mpi-operator's benchmark results on > 3000 gpus clsuter?
The text was updated successfully, but these errors were encountered:
@terrytangyuan @alculquicondor Do you have any benchmark results?
Sorry, something went wrong.
At that scale, the limitations don't come from mpi-operator, but from the network and how pods will land on it.
Do you have more details?
Agreed. I don't think there's anything from the controller side that blocks the scale. I don't have any public benchmark.
No branches or pull requests
We are building supercomputing infra for an internal GPU cluster up to thousands of expensive GPUs.
We are looking to adopt mpi-operator, or slurm.
slurm is widely adopted in large-scale hpc computing, so its scalability is well tested.
Is there cases of mpi-operator's benchmark results on > 3000 gpus clsuter?
The text was updated successfully, but these errors were encountered: