-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Misc] benchmark: Add option to set max concurrency #9390
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general LGTM. One question I have is how should this be used together with request rate (QPS)? Are we expecting only one of them being set at a time?
For me, maximum concurrency is mainly used to avoid request buffer overflow at inference engine side (IIRC if we send >1000 requests to vLLM, TGI or other inference engines, there will be request failure due to request buffer overflow). This should always be set when benchmarking many request under very high QPS to prevent request failure. |
benchmarks/benchmark_serving.py
Outdated
parser.add_argument("--max-concurrency", | ||
type=int, | ||
default=None, | ||
help="Maximum number of concurrent requests.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be great if you can explain more on why setting --max-concurrency
given that there is already --request-rate
available. The main reason for me is to avoid request failure due to sending too many request to the inference engine, but I'm not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To give some context - I think the main motivation here is that sometimes inference server are setup with max concurrency as a metric for autoscaling.
Currently on vLLM, we don't have a way to "reject" requests depending on the server load, so very often users set up this concurrency control at a higher level, thus it would be great if we can simulate this kind of setup in benchmark framework as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be great to add more information for this argument based on my reply to Cody's comment above if that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expanded the help text. Let me know what you think!
Add a new flag to `benchmark_serving.py` that allows you to specify the maximum number of concurrent requests. If not specified, it defaults to the current behavior of unbounded concurrency. Signed-off-by: Russell Bryant <[email protected]>
The message looks good to me. In addition, please also update other parts such as logging and dumped file name. Please search |
Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
I added some logging and added I wasn't sure about the default filename, though. |
Just make it simple like the following? max_concurrency_str = f"-concurrency{args.max_concurrency}" if args.max_concurrency is not None else ""
file_name = f"{backend}-{args.request_rate}qps{max_concurrency_str}-{base_model_id}-{current_dt}.json" #noqa |
Signed-off-by: Russell Bryant <[email protected]>
sure, that works. done! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
CI is currently failing on the use of asyncio support was added to nullcontext in 3.10: python/cpython#85715 |
Signed-off-by: Russell Bryant <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: charlifu <[email protected]>
Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Vinay Damodaran <[email protected]>
Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Alvant <[email protected]>
Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Amit Garg <[email protected]>
Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: qishuai <[email protected]>
Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Sumit Dubey <[email protected]>
Add a new flag to
benchmark_serving.py
that allows you to specifythe maximum number of concurrent requests. If not specified, it
defaults to the current behavior of unbounded concurrency.
Closes #3127
Signed-off-by: Russell Bryant [email protected]