Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] remove cupy dependency #3625

Merged
merged 29 commits into from
Mar 27, 2024
Merged

Conversation

youkaichao
Copy link
Member

Separate code from #3442 , only remove cupy dependency.

@youkaichao
Copy link
Member Author

Support evidence from #3617 : this PR solved cupy initialization issue 👍

@youkaichao youkaichao changed the title [WIP][Core] remove cupy dependency [Core] remove cupy dependency Mar 26, 2024
@WoosukKwon
Copy link
Collaborator

Can we get confirmation from AMD people that we can do the same thing to replace PyTorch RCCL? While I believe we don't have to do this in the current PR, I'd like to get confirmation and find someone who takes that.

Copy link
Collaborator

@simon-mo simon-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i made an initial pass and it looks good overall. I think we are maybe lacking a bit more tests in the test_pynccl.py?

CMakeLists.txt Outdated Show resolved Hide resolved
.buildkite/test-pipeline.yaml Outdated Show resolved Hide resolved
tests/distributed/test_comm_ops.py Outdated Show resolved Hide resolved
tests/distributed/test_comm_ops.py Show resolved Hide resolved
tests/distributed/test_pynccl.py Show resolved Hide resolved
vllm/model_executor/parallel_utils/pynccl.py Outdated Show resolved Hide resolved
vllm/model_executor/parallel_utils/pynccl.py Outdated Show resolved Hide resolved
vllm/model_executor/parallel_utils/pynccl.py Outdated Show resolved Hide resolved
@youkaichao
Copy link
Member Author

@WoosukKwon @simon-mo thanks for the review! I talked to amd folks to ask for review. Let's see if we can get feedback today. If not, we can merge first, and let them send fix PRs later.

@simon-mo simon-mo added the release-blocker This PR/issue blocks the next release, therefore deserves highest priority label Mar 27, 2024
Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great work! Left some questions.

vllm/model_executor/parallel_utils/pynccl.py Show resolved Hide resolved
vllm/model_executor/parallel_utils/pynccl.py Show resolved Hide resolved
Comment on lines 849 to 854
# Delete the CUDA graphs before deleting the CuPy NCCL communicator.
# Delete the CUDA graphs before deleting the pynccl communicator.
# NOTE(woosuk): This is necessary because otherwise deadlocks can
# happen.
# FIXME(woosuk): This is a bit hacky. Find a more robust solution.
self.graph_runners.clear()
self.cupy_nccl_backend = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this part can be deleted as we remove CuPy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why this code was needed before. So I don't want to change it 🤣 If you have more context, we can discuss if we can delete this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using CuPy + CUDA graphs, deadlocks happen when the CuPy backend is deleted before the CUDA graphs using it are deleted. I actually don't know the reason for this, but this doesn't happen when using NCCL through PyTorch probably because the NCCL communicator managed by PyTorch is deleted at the very end.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@youkaichao could you check whether we can delete this? You can simply run python examples/llm_engine_example.py -tp 2 and see if the process hangs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, will have a try.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I do agree with you that we can remove this code some time later. If you'd like to do so, could you please add TODO in the code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the code (removing the above two lines) for 100 times for i in {1..100}; do python3 examples/llm_engine_example.py -tp=2 && echo "Test passed $i times" || break; done , and I don't see any deadlocks. This gives us confidence in removing the code later. I left a comment there to remove it in v0.4.1 .

vllm/worker/model_runner.py Show resolved Hide resolved
vllm/model_executor/parallel_utils/pynccl.py Show resolved Hide resolved
@youkaichao
Copy link
Member Author

Overall I think we should take a small step for every release. Distributed related bugs like hang/deadlock are highly unstable and difficult to test.

My plan is to use pynccl as an equivalent replacement of cupy in v0.4.0 , and collect user feedback to decide the next move for v0.4.1 (e.g. removing tricky code wrote for cupy).

cc @WoosukKwon

Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for the great work!

@jojac47
Copy link

jojac47 commented Jul 10, 2024

Hello everyone,
I was able to get this resolved myself as well without removing cuppy as a dependency. What I had to do was Instead of using the VLLM or Nvidia base containers, I used the official Cupy(which includes the dependencies for the Nvidia drivers) image as my new base image and followed the Linux install instructions inside the container and committed it to a new image.

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-blocker This PR/issue blocks the next release, therefore deserves highest priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants