Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Misc] benchmark: Add option to set max concurrency #9390

Merged
merged 5 commits into from
Oct 18, 2024

Conversation

russellb
Copy link
Collaborator

@russellb russellb commented Oct 15, 2024

Add a new flag to benchmark_serving.py that allows you to specify
the maximum number of concurrent requests. If not specified, it
defaults to the current behavior of unbounded concurrency.

Closes #3127

Signed-off-by: Russell Bryant [email protected]

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

Copy link
Collaborator

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general LGTM. One question I have is how should this be used together with request rate (QPS)? Are we expecting only one of them being set at a time?

benchmarks/benchmark_serving.py Outdated Show resolved Hide resolved
@KuntaiDu
Copy link
Collaborator

In general LGTM. One question I have is how should this be used together with request rate (QPS)? Are we expecting only one of them being set at a time?

For me, maximum concurrency is mainly used to avoid request buffer overflow at inference engine side (IIRC if we send >1000 requests to vLLM, TGI or other inference engines, there will be request failure due to request buffer overflow). This should always be set when benchmarking many request under very high QPS to prevent request failure.

Comment on lines 781 to 784
parser.add_argument("--max-concurrency",
type=int,
default=None,
help="Maximum number of concurrent requests.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great if you can explain more on why setting --max-concurrency given that there is already --request-rate available. The main reason for me is to avoid request failure due to sending too many request to the inference engine, but I'm not sure.

Copy link
Member

@ywang96 ywang96 Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To give some context - I think the main motivation here is that sometimes inference server are setup with max concurrency as a metric for autoscaling.

Currently on vLLM, we don't have a way to "reject" requests depending on the server load, so very often users set up this concurrency control at a higher level, thus it would be great if we can simulate this kind of setup in benchmark framework as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be great to add more information for this argument based on my reply to Cody's comment above if that makes sense.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expanded the help text. Let me know what you think!

Add a new flag to `benchmark_serving.py` that allows you to specify
the maximum number of concurrent requests. If not specified, it
defaults to the current behavior of unbounded concurrency.

Signed-off-by: Russell Bryant <[email protected]>
@comaniac
Copy link
Collaborator

The message looks good to me. In addition, please also update other parts such as logging and dumped file name. Please search request_rate in this file and make sure max_concurrency is reflected to these places as well.

@russellb
Copy link
Collaborator Author

The message looks good to me. In addition, please also update other parts such as logging and dumped file name. Please search request_rate in this file and make sure max_concurrency is reflected to these places as well.

I added some logging and added max_concurrency to the output json data.

I wasn't sure about the default filename, though. request-rate always has a value (defaults to inf), while max-concurrency defaults to None. At some point, it seems reasonable to just fall back to the ability to set custom filenames. If you feel strongly about this though, I'll update this to conditionally include the concurrency setting when it is set.

@comaniac
Copy link
Collaborator

The message looks good to me. In addition, please also update other parts such as logging and dumped file name. Please search request_rate in this file and make sure max_concurrency is reflected to these places as well.

I added some logging and added max_concurrency to the output json data.

I wasn't sure about the default filename, though. request-rate always has a value (defaults to inf), while max-concurrency defaults to None. At some point, it seems reasonable to just fall back to the ability to set custom filenames. If you feel strongly about this though, I'll update this to conditionally include the concurrency setting when it is set.

Just make it simple like the following?

max_concurrency_str = f"-concurrency{args.max_concurrency}" if args.max_concurrency is not None else ""
file_name = f"{backend}-{args.request_rate}qps{max_concurrency_str}-{base_model_id}-{current_dt}.json"  #noqa

@russellb
Copy link
Collaborator Author

Just make it simple like the following?

sure, that works. done!

Copy link
Collaborator

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@comaniac comaniac added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 16, 2024
@russellb
Copy link
Collaborator Author

CI is currently failing on the use of contextlib.nullcontext(). It looks like it worked for me because I was using a newer version of Python, while it's failing with 3.9 in CI.

asyncio support was added to nullcontext in 3.10: python/cpython#85715

Copy link
Collaborator

@KuntaiDu KuntaiDu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@comaniac comaniac merged commit 7dbe738 into vllm-project:main Oct 18, 2024
38 checks passed
charlifu pushed a commit to charlifu/vllm that referenced this pull request Oct 23, 2024
vrdn-23 pushed a commit to vrdn-23/vllm that referenced this pull request Oct 23, 2024
Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
garg-amit pushed a commit to garg-amit/vllm that referenced this pull request Oct 28, 2024
FerdinandZhong pushed a commit to FerdinandZhong/vllm that referenced this pull request Oct 29, 2024
sumitd2 pushed a commit to sumitd2/vllm that referenced this pull request Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Benchmarking script does not limit the maximum concurrency
4 participants