-
Notifications
You must be signed in to change notification settings - Fork 990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to run inference of a (very) large model across mulitple GPUs ? #2007
Comments
Please see the llama multiprocess example. The multi-GPU inference is used to create parellelized linear layers:
|
That example is for a single node. How about multiple nodes? Can we just run the example with Update: I guess I must modify the code to support the world rank for MPI. I think sticking to NCCL as a backend might be better, but then is there support in Cudarc for cross-node communication? Found this library https://github.com/oddity-ai/async-cuda |
I started a draft here for the splitting a model across multiple GPUs on different nodes. There is a mapping feature as I linked above on |
I am having the same question. |
It is mentioned on README that candle supports multi GPU inference, using NCCL under the hood. How can this be implemented ? I wonder if there is any available example to look at..
Also, I know PyTorch has things like DDP and FSDP, is candle support for multi GPU inference comparable to these techniques ?
The text was updated successfully, but these errors were encountered: