Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Add MultiprocessingGPUExecutor #4539

Merged
merged 10 commits into from
May 14, 2024

Conversation

njhill
Copy link
Member

@njhill njhill commented May 1, 2024

This introduces a python multiprocessing-based executor that can be used as an alternative to Ray for single-node inferencing.

With the changes in this PR, Ray will continue to be used for parallel workers if it's installed, otherwise vanilla python multiprocessing is used. It can also be overridden with --no-worker-use-ray --distributed-executor-backed=mp.

By default, worker process are started using spawn. This can be changed to fork by setting env var VLLM_WORKER_MULTIPROC_METHOD=fork. fork mode has a benefit of starting faster.

The existing distributed tests have been updated to run with/without Ray.

Worker processes are shut down when the LLMEngine is garbage collected.

This replaces original PRs #3466 #2898 and #4345. It was originally co-authored by @sahilsuneja1.

This introduces a python multiprocessing-based executor that can be used as an alternative to Ray for single-node inferencing.

With the changes in this PR, Ray will continue to be used for parallel workers if it's installed, otherwise vanilla python multiprocessing is used. It can also be overridden with --no-worker-use-ray.

The existing distributed tests have been updated to run with/without Ray.

Worker processes are shut down when the LLMEngine is garbage collected.

Co-authored-by: SAHIL SUNEJA <[email protected]>
@vrdn-23
Copy link
Contributor

vrdn-23 commented May 1, 2024

This is really great work @njhill. Thanks for all the effort!
Will this change also enable ray to become an optional dependency?

@youkaichao
Copy link
Member

I think --no-worker-use-ray is bad. I suggest something like --distributed-executor-backend, which can be either ray or mp , and we might have more in the future.

@njhill
Copy link
Member Author

njhill commented May 1, 2024

I think --no-worker-use-ray is bad. I suggest something like --distributed-executor-backend, which can be either ray or mp , and we might have more in the future.

@youkaichao that sounds very reasonable, but maybe it could be a separate PR? This is not actually a newly introduced arg - there is already a boolean --worker-use-ray arg, it's just the (argparse-standard) way of specifying a "false" value for that arg.

@njhill
Copy link
Member Author

njhill commented May 1, 2024

This is really great work @njhill. Thanks for all the effort! Will this change also enable ray to become an optional dependency?

Yes, although ray is already optional if you are only using single-GPU. I do have some changes to make it an optional "extra" from a python package installation pov but was thinking of a follow-on PR to avoid making this one bigger.

@youkaichao
Copy link
Member

If it is possible, I suggest add --distributed-executor-backend in this pr, and route --worker-use-ray to --distributed-executor-backend=ray . This PR can be enabled under --distributed-executor-backend=multiprocessing .

@njhill
Copy link
Member Author

njhill commented May 2, 2024

@youkaichao any idea why the ray distributed CI test might be failing now due to a gloo timeout? I think it's something to do with a second engine using TP being created in the same pytest process after the first one is shut down (the test now runs with mp executor followed by ray executor). This wasn't a problem with an earlier version of this PR, but I know you've made changes in this area. I will dig in more but just wanted to check if it's anything obvious to you.

@youkaichao
Copy link
Member

Try to merge the main branch in? I'm not sure, but the latest commit i merge into main can pass the ci test

@njhill
Copy link
Member Author

njhill commented May 2, 2024

@youkaichao fyi the problem is still there after pulling in your latest fix commit, I'll try to narrow it down tomorrow.

@youkaichao
Copy link
Member

My suspection is improper clean up. You can try to have one test for mp, and another for ray. Then they will not have interference.

@youkaichao
Copy link
Member

@njhill maybe we should cancel the test for this pr, until you figure it out locally? otherwise the ci will be blocked.

@rkooo567 rkooo567 self-assigned this May 2, 2024
njhill added 2 commits May 2, 2024 09:51
…c-gpu-executor

# Conflicts:
#	tests/distributed/test_basic_distributed_correctness.py
@vrdn-23
Copy link
Contributor

vrdn-23 commented May 7, 2024

Sorry to keep asking, but is there any update on this?
@njhill @rkooo567

@njhill
Copy link
Member Author

njhill commented May 14, 2024

My suspection is improper clean up. You can try to have one test for mp, and another for ray. Then they will not have interference.

@youkaichao I've updated this now to run in separate tests. Do you think it would be worth opening a separate issue to address the distributed cleanup issue? Currently it seems you can't create an LLM with tensor parallel, delete it, and then create another one in the same process.

If it is possible, I suggest add --distributed-executor-backend in this pr, and route --worker-use-ray to --distributed-executor-backend=ray . This PR can be enabled under --distributed-executor-backend=multiprocessing .

I've now made this update as requested, can enable with --distributed-executor-backend=mp.

@youkaichao @rkooo567 hopefully this is now ready to merge? 🙏 the failing tests look unrelated (same failures on main branch).

We could discuss in a follow-on whether it makes sense to change the default from ray to mp.

@rkooo567
Copy link
Collaborator

I will finish review it by today!

@youkaichao
Copy link
Member

Do you think it would be worth opening a separate issue to address the distributed cleanup issue? Currently it seems you can't create an LLM with tensor parallel, delete it, and then create another one in the same process.

I have a comment on this: #4508 (comment)

TL;DR is Python garbage collection is unreliable. If we want to address the distributed cleanup issue, we need some user-interface change, like context manager to explicitly control the cleanup.

Copy link
Member

@youkaichao youkaichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the efforts!

@njhill
Copy link
Member Author

njhill commented May 14, 2024

Thanks again @youkaichao @rkooo567 @zhuohan123! And thanks for your patience @vrdn-23!

@njhill njhill merged commit 676a999 into vllm-project:main May 14, 2024
52 of 55 checks passed
@njhill njhill deleted the multiproc-gpu-executor branch May 14, 2024 17:39
@kerthcet
Copy link
Contributor

Hi all, thanks for the efforts, just have one question, is there any performance difference between python multiprocessing and ray? Thanks in advance.

@njhill
Copy link
Member Author

njhill commented Jun 12, 2024

@kerthcet we found multiprocessing to be faster, but since #4894 the difference probably isn't very significant, especially for larger models.

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants