-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Usage]: How to use Multi-instance in Vllm? (Model replication on multiple GPUs) #6155
Comments
It works fine with the online mode - you just create multiple servers (even reusing the same gpus!), but indeed it doesn't work with the offline mode. Here is an example on a 8x H100 node
and then:
and it hangs while initializing the 2nd model:
|
The problem seems to be in some internal state that is not being isolated, even if I do:
it still hangs in the init of the 2nd model. While this |
@stas00 at least the latter case I have been debugging and will open a fix today. Can see if it also works with concurrent llms but I expect there mat be additional isolation changes needed for that. |
Thanks a lot for working on that, @njhill - that will help with disagrregation type of offline use of vllm. |
@stas00 I wonder if it's possible to create multiple servers in the same gpu if the gpu memory is not an issue? |
with online setup yes it'd work, but this is an offline recipe please read #6155 (comment) |
@stas00 did the patch from njhill fix the issue you raised? |
@njhill didn't post the update since his last note #6155 (comment), so I wasn't able to validate it - do you have the PR link - would be happy to re-test. |
Thank you for the update, @njhill - there was a progress and it no longer hangs in the init of the 2nd model, but now it hangs in:
tb:
I'm running this example #6155 (comment) (same problem w/ or w/o 0.6.3post1 here. |
😢 thanks @stas00, we can keep this open and I'll try to get to it soon. |
Thank you, @njhill - my example can serve as a repro. And of course we want to have more than one For example one small model could be used to augment the prompt and then the larger model could do the normal generation using the extended prompt. |
I would like to use techniques such as Multi-instance Support supported by the tensorrt-llm backend. In the documentation, I can see that multiple models are served using modes like Leader mode and Orchestrator mode. Does vLLM support this functionality separately? Or should I implement it similarly to the tensorrt-llm backend?
Here is for reference url : https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#leader-mode
The text was updated successfully, but these errors were encountered: