-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarification for multi GPU #4238
Comments
I'm not a maintainer here, but in case it helps: I think the instructions are in the READMEs too. Instructions to build llama are in the main readme here. You've quote the There's loads of different ways of using llama.cpp (e.g. python bindings, shell script, Rest server) etc - check examples directory here. If you then want to launch the server, instructions are at: here I think it will automatically use all GPUs that are visible to it. You can confirm by checking utilisation with |
@ruped no offense, but i am tired of useless answers, do you not think i have not looked at the readme? It does not automatically use all the GPUs, you have to use split tensors which treats VRAM as a resource. However, it runs very slow to date. |
I assume I'm being pinged due to a recent comment I made regarding splitting the model layer wise instead of tensor wise. That was mainly just a theory I had, I don't own any GPUs to test it on. Even if you could do so, it would only be an improvement over splitting tensor wise if the GPUs have significantly different performance characteristics, and even then you'd need to use tricks like speculation to utilize the GPUs simultaneously. Tensor wise splitting should be performant for most cases. Unfortunately, I'm not a maintainer and don't touch any of the GPU related code, so that's about as far as I can help. Perhaps some specific use cases, logs, and hardware specs could help other developers with your issue. |
I've an open enhancement discussion on layer splitting as well (#4055). |
The MPI backend is already pipeline parallel, so as soon as I finish fixing it, it would be trivial to run GPUs in pipeline parallel (run two MPI processes, each one only using a single GPU). However, that introduces a bit of overhead from MPI communication. I believe most popular runtimes will do shared memory communication when on the same host, so it's probably not significant. Keep in mind, however, that pipeline parallel is really only used in certain scenarios precisely because of what the CUDA developer mentioned: without specific architecture tricks, you will only be using one GPU at a time, and your performance will suffer compared to a single GPU due to communication and synchronization overhead. There are cases where it makes sense, but they are few and far between, so it's not surprising the CUDA maintainer doesn't want to add support directly. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
From what I can see there arent any docs that make it clear to leverage multiple GPUs, outside of disparate threads in issues.
This is all I have found, what are the actual instructions to use multi-gpu?
make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES=0,1 make -j
make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=compute_80
The text was updated successfully, but these errors were encountered: