New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Make _prepare_sample non blocking and pin memory of CPU input buffers #2207

Merged

WoosukKwon merged 9 commits into vllm-project:main from hanzhi713:less_blocking

Dec 20, 2023

Contributor

hanzhi713 commented Dec 19, 2023 •

edited

Loading

Hide _prepare_sample latency with model execution since it looks like it doesn't depend on model forward.

We can use a copy stream for h2ds in _prepare_sample, but we probably don't really need to because these h2ds are very short.


          less blocking

8de791a

Collaborator

Yard1 commented Dec 19, 2023

Looks good - unfortunately we cannot use pinned memory if we are in wsl, can we add a check for that?

Yard1 reviewed

View reviewed changes

vllm/worker/model_runner.py Outdated Show resolved Hide resolved

hanzhi713 added 2 commits

December 19, 2023 23:08


          add wsl check

30337bd


          format

f131669

Contributor Author

hanzhi713 commented Dec 19, 2023

Looks good - unfortunately we cannot use pinned memory if we are in wsl, can we add a check for that?

Just added. Please check

Yard1 reviewed

View reviewed changes

vllm/worker/model_runner.py Outdated Show resolved Hide resolved

Collaborator

WoosukKwon commented Dec 19, 2023

@hanzhi713 A quick question: do you happen to evaluate the performance impact of this change? Just wondering, because the input preparation part only takes 1~3% of the overall running time in our profiling results.

Contributor Author

hanzhi713 commented Dec 19, 2023 •

edited

Loading

@hanzhi713 A quick question: do you happen to evaluate the performance impact of this change? Just wondering, because the input preparation part only takes 1~3% of the overall running time in our profiling results.

About 1% for 70B tp4 bs=64. Just a minor optimization. Merge at your discretion 😃


          Update vllm/worker/model_runner.py

39cb1bf

Co-authored-by: Antoni Baum <[email protected]>

WoosukKwon reviewed

View reviewed changes

vllm/worker/model_runner.py Outdated Show resolved Hide resolved


          add comma and format

57a87bb

WoosukKwon reviewed

View reviewed changes

vllm/worker/model_runner.py Outdated Show resolved Hide resolved

WoosukKwon reviewed

View reviewed changes

vllm/worker/model_runner.py Outdated Show resolved Hide resolved

hanzhi713 and others added 4 commits

December 19, 2023 16:40


          Update vllm/worker/model_runner.py

f967c54

Co-authored-by: Woosuk Kwon <[email protected]>


          name fix

4d5f604


          Update vllm/worker/model_runner.py

3f5fff5

Co-authored-by: Woosuk Kwon <[email protected]>


          format

97e309e

WoosukKwon approved these changes

View reviewed changes

Collaborator

WoosukKwon left a comment

@hanzhi713 LGTM! Thanks for the PR!

WoosukKwon merged commit 31bff69 into vllm-project:main

2 checks passed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request


          Make _prepare_sample non-blocking and use pinned memory for input buf…

93c7ace

…fers (vllm-project#2207)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet