-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REQUEST]Does deepspeed support dynamic batch during inference? #3455
Comments
model size: 6B When I use deepspeed for single-card inference, the qps does not exceed 2, and the utilization rate of gpu is about 52%. When will deepspeed support dynamic batch size and improve the utilization rate of gpu? |
You can check NVIDIA Triton which supports dynamic batching and other stuff to increase your GPU utilisation. Not sure where/how DeepSpeed could support such stuff, without extending its scope considerably. |
You can look at AWS's DeepJavaLibrary Serving - https://github.com/deepjavalibrary/djl-serving It uses netty/java to dispatch the requests to inference, and can be configured to batch the requests dynamically based on a time window. https://djl.ai/ - umbrella project These are some great tutorials on how to use djl serving / deepspeed for inference - https://github.com/aws/amazon-sagemaker-examples/tree/main/inference/generativeai/llm-workshop |
thx. |
@colynhn @trianxy @stan-kirdey if not too late: I am building dynamic batch sizes (and corresponding LR scaling) on deepspeed in PR 5237, as part of the data analysis module. Stay tuned. |
No description provided.
The text was updated successfully, but these errors were encountered: