-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When there are multiple GPU, only one GPU is used #7664
Comments
Hi @gyr66, thanks for raising this issue and thanks for trying the Triton CLI! As Olga mentioned, yes the default configs produced are currently for a "quickstart" path and are pre-defined as a single Triton model instance of For multiple model instances, it would require further knowledge and of the TRT-LLM backend, and may not work exactly the same as other backends due to its use of MPI for communication in the current implementation. There is a guide with more comprehensive details and documentation on the various components involved to serve multiple TRT-LLM model instances, please check it out: https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama_multi_instance.md#running-multiple-instances-of-llama-model-on-multiple-gpus. Hopefully the Triton CLI generated configs give you a good functional starting point for a single instance, and then can be tweaked afterwards by following this guide to support multi-instance. CC @Tabrizian for viz |
@gyr66 , let us know if there's anything else we can help you with. Feel free to close this issue |
Thank you so much for your patient and detailed responses! I am wondering, if I don't use TP, could I simply start an independent server process for each GPU and place an NGINX load balancer in front? Would this be consistent with the leader mode? |
Description
When there are multiple GPU, only one GPU is used.
Triton Information
Container: nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3
To Reproduce
Follow the instrcution at https://github.com/triton-inference-server/tutorials/blob/main/Popular_Models_Guide/Llama2/trtllm_guide.md
The model configuration file (/root/models/llama-3.1-8b-instruct/config.pbtxt) is:
Clearly I have specified it to use gpu 0 and gpu 1.
The postprocessing, preprocessing and tensorrt_llm are left unchanged.
Expected behavior
The model should be loaded on gpu 0 and gpu 1, and can deal with requests based on load.
Here are what I got:
See the model only loaded on gpu 0.
When I do benchmark, there is also just gpu 0 is used:
Here is the running log of triton:
The text was updated successfully, but these errors were encountered: