-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow using multiple GPUs without tensor parallelism #1031
Comments
Pipeline parallelism is trash for performance (latency, for throughput it's probably the best). You can try it relatively easily by removing the capture of Expect bad latency imho (but I'd be happy to revisit my opinion if it's actually ok). Also there might be ways to actually allow the splitting on non divisble heads using some zero padding (you can check out TensorParallelEmbeddings and TensorParallelHead for ideas). If you're trying something like that we'd be happy to review ! |
Hi, I am wondering if it is possible to do data parallel with TGI? For example, I have 8 GPUS and I want to have 8 separate LLMs loaded in each of them. Would it be possible for TGI to handle this? Of course I could have 8 dockers, but I need a central control for the balance between each GPUs. Thank you in advance |
Would be great to see pipeline parallel in TGI for applications that require high throughput, but don't care about latency. Here is my intuition on why a cluster of 4080s / 4090s combined with pipeline parallel would achieve the best possible cost per token for larger models. Please correct me if I'm wrong:
If anyone has experimented with the method laid out by @Narsil, please share your results. Otherwise I'll be experimenting soon. |
@Hannibal046 Indeed that's what you need to do. @lapp0 For point 2. you can always use a GPTQ/AWQ version of those models on a single 4090, I think that's probably the best of solution. Also don't assume too many numbers based on theory, they are usually quite far from reality very fast. If you manage to experiment, I'd be glad to hear if you manage to pull something nice off. Godspeed ! |
Does it mean that if I have 8 GPUs and I deploy a model that doesn't support Tensor Parallel. Then I should start 8 TGI instances and then do another layer of load balancing myself? |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
@scse-l unfortunately it appears that TGI doesn't fall back to pipeline parallel under the conditions Narsil described. In my review of the code and documentation a few months ago I found that TGI cannot support "true" pipeline parallel. I didn't take good notes. Here's some resources though |
@lapp0 Get it. I'll check the refs. Thanks a lot.
|
Feature request
Currently to use multiple GPUs, one must set
--num-shards
to >1. The enables tensor parallelism but using multiple GPUs can be done in other ways as well.In fact, in the code
from_pretrained
already have an argumentdevice_map
set to"auto"
which would use multiple GPUs if the single shard had them available. This means that most likely it's not much work to rework TGI to allow that.Motivation
This would allow more customization of the LLM deployment.
Also some models don't work with tensor parallelism. Eg.
falcon-7b-instruct
has 71 heads, what means that it can work only on 1 or 71 shards. With eg. two Nvidia Tesla T4 available, Falcon 7b won't fit on a single one, it would fit on two but we can't do it with TGI.Your contribution
I'm happy to test the solution.
The text was updated successfully, but these errors were encountered: