[WIP][Core] try to manage nccl version #3802

youkaichao · 2024-04-02T18:51:21Z

first step to manage nccl version

youkaichao · 2024-04-02T21:55:27Z

Tried to manage nccl version in the framework under pip, it is very difficult.

version management and specification is a disaster: The code https://github.com/vllm-project/vllm-nccl itself has a version, and we have nccl version, and also cuda version. It ends up with something like vllm-nccl==0.1.0.nccl2.18.3.cuda11, or specify via environment, e.g. VLLM_INSTALL_NCCL=2.18+cu12 pip install vllm-nccl .
We want pip to be able to remove the data, then we have to use data_files. But that makes nccl inside the wheel, which again hits the 100MB size limit of pypi.

Finally, I will go for the cupy approach, i.e. provide a script for users to download nccl if they want. Then we can have as many args as we want.

python -m vllm.tools.install_nccl --cuda 11 --nccl 2.18.3

simon-mo · 2024-04-03T18:05:25Z

Constraints:

We cannot vendor nccl into vLLM bdist because of size issue.
We cannot install the PyPI nccl from NVIDIA because PyTorch already depends on it.
We cannot redistribute nccl to PyPI because size issue.

Goal: the out of the box experience for users should be no memory increase.

Therefore, we can upload two packages: vllm-nccl-cu11=2.18.3, vllm-nccl-cu12=2.18.3 to PyPI.

The workflow:
vLLM (bdist) -> vllm-nccl (sdist) -> download the so from nvidia site (but in case nvidia breaks the downloads, we can upload the file to our own storage bucket).

vLLM by default should pin to vllm-nccl-cu12=2.18.3 (and torch=2.2, doesn't really matter now)

vLLM cu11 distribution should pin to vllm-nccl-cu11 is possible, if not, just update the docs to add a new line

# Install vLLM with CUDA 11.8.
export VLLM_VERSION=0.4.0
export PYTHON_VERSION=39
pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
+ pip install vllm-nccl-cu11=2.18.3

The sdist will just download the nccl so into ~/.config/vllm/nccl/cu{11,12}/*.so. Then vLLM is running, vLLM knows the cuda version running right now just through torch.cuda.... and it can append the right path to load so file for nncl.

To incorporate vLLM version as wheel (this feel a bit over-engineer to me): vllm-nccl-cu12=2.18.3.0.4.1 which is the combo of nccl version and vLLM version. https://packaging.python.org/en/latest/specifications/version-specifiers/#final-releases

youkaichao added 6 commits April 2, 2024 11:50

first step for managing nccl version

9b681c2

fix isort

257fc6e

add git+ prefix

dd1cf70

add requests

689fa1f

fix requests location

306e8e2

fix dockerfile

fceb029

youkaichao closed this Apr 2, 2024

youkaichao mentioned this pull request Apr 2, 2024

[Core] manage nccl via a pypi package & upgrade to pt 2.2.1 #3805

Merged

youkaichao deleted the manage_nccl branch April 2, 2024 22:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][Core] try to manage nccl version #3802

[WIP][Core] try to manage nccl version #3802

youkaichao commented Apr 2, 2024

youkaichao commented Apr 2, 2024

simon-mo commented Apr 3, 2024

[WIP][Core] try to manage nccl version #3802

[WIP][Core] try to manage nccl version #3802

Conversation

youkaichao commented Apr 2, 2024

youkaichao commented Apr 2, 2024

simon-mo commented Apr 3, 2024