Clarification for multi GPU #4238

jholla-atx · 2023-11-27T16:55:58Z

From what I can see there arent any docs that make it clear to leverage multiple GPUs, outside of disparate threads in issues.

This is all I have found, what are the actual instructions to use multi-gpu?

make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES=0,1 make -j

make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=compute_80

The text was updated successfully, but these errors were encountered:

ruped · 2023-11-27T22:32:49Z

I'm not a maintainer here, but in case it helps:

I think the instructions are in the READMEs too.

Instructions to build llama are in the main readme here. You've quote the make instructions - but you may find cmake instructions work better.

There's loads of different ways of using llama.cpp (e.g. python bindings, shell script, Rest server) etc - check examples directory here.

If you then want to launch the server, instructions are at: here

I think it will automatically use all GPUs that are visible to it.

You can confirm by checking utilisation with nvidia-smi

jholla-atx · 2023-11-28T15:25:01Z

@ruped no offense, but i am tired of useless answers, do you not think i have not looked at the readme? It does not automatically use all the GPUs, you have to use split tensors which treats VRAM as a resource. However, it runs very slow to date.

cc/ @tobi @richardkiss @TortoiseHam @AutonomicPerfectionist

AutonomicPerfectionist · 2023-11-28T17:16:33Z

I assume I'm being pinged due to a recent comment I made regarding splitting the model layer wise instead of tensor wise. That was mainly just a theory I had, I don't own any GPUs to test it on. Even if you could do so, it would only be an improvement over splitting tensor wise if the GPUs have significantly different performance characteristics, and even then you'd need to use tricks like speculation to utilize the GPUs simultaneously. Tensor wise splitting should be performant for most cases.

Unfortunately, I'm not a maintainer and don't touch any of the GPU related code, so that's about as far as I can help. Perhaps some specific use cases, logs, and hardware specs could help other developers with your issue.

cmp-nct · 2023-11-29T19:23:33Z

I assume I'm being pinged due to a recent comment I made regarding splitting the model layer wise instead of tensor wise. That was mainly just a theory I had, I don't own any GPUs to test it on. Even if you could do so, it would only be an improvement over splitting tensor wise if the GPUs have significantly different performance characteristics, and even then you'd need to use tricks like speculation to utilize the GPUs simultaneously. Tensor wise splitting should be performant for most cases.

Unfortunately, I'm not a maintainer and don't touch any of the GPU related code, so that's about as far as I can help. Perhaps some specific use cases, logs, and hardware specs could help other developers with your issue.

I've an open enhancement discussion on layer splitting as well (#4055).
It doesn't have the support of the current cuda main developer, adding it without that wouldn't get accepted or maintained.

AutonomicPerfectionist · 2023-11-29T22:00:00Z

The MPI backend is already pipeline parallel, so as soon as I finish fixing it, it would be trivial to run GPUs in pipeline parallel (run two MPI processes, each one only using a single GPU). However, that introduces a bit of overhead from MPI communication. I believe most popular runtimes will do shared memory communication when on the same host, so it's probably not significant. Keep in mind, however, that pipeline parallel is really only used in certain scenarios precisely because of what the CUDA developer mentioned: without specific architecture tricks, you will only be using one GPU at a time, and your performance will suffer compared to a single GPU due to communication and synchronization overhead. There are cases where it makes sense, but they are few and far between, so it's not surprising the CUDA maintainer doesn't want to add support directly.

github-actions · 2024-04-03T01:15:07Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

jholla-atx added the enhancement New feature or request label Nov 27, 2023

danbri mentioned this issue Dec 1, 2023

GPU numbering on Windows possibly in wrong order Mozilla-Ocho/llamafile#26

Open

github-actions bot added the stale label Mar 19, 2024

github-actions bot closed this as completed Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification for multi GPU #4238

Clarification for multi GPU #4238

jholla-atx commented Nov 27, 2023 •

edited

Loading

ruped commented Nov 27, 2023 •

edited

Loading

jholla-atx commented Nov 28, 2023 •

edited

Loading

AutonomicPerfectionist commented Nov 28, 2023

cmp-nct commented Nov 29, 2023

AutonomicPerfectionist commented Nov 29, 2023

github-actions bot commented Apr 3, 2024

Clarification for multi GPU #4238

Clarification for multi GPU #4238

Comments

jholla-atx commented Nov 27, 2023 • edited Loading

ruped commented Nov 27, 2023 • edited Loading

jholla-atx commented Nov 28, 2023 • edited Loading

AutonomicPerfectionist commented Nov 28, 2023

cmp-nct commented Nov 29, 2023

AutonomicPerfectionist commented Nov 29, 2023

github-actions bot commented Apr 3, 2024

jholla-atx commented Nov 27, 2023 •

edited

Loading

ruped commented Nov 27, 2023 •

edited

Loading

jholla-atx commented Nov 28, 2023 •

edited

Loading