-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dockerfile #1350
Add dockerfile #1350
Conversation
Hi @skrider Great work so far. This is a good start. Few comments:
|
@simon-mo - thanks!
Noted, will do. I am working on some tooling to help prune images automatically here: https://github.com/skrider/pip-prune . So far have been able to reduce pip install footprint from 6 to 3 gb. Regarding entrypoint - I considered adding this but thought that sometimes users might want to start the container with ray and then exec Regarding the dev/prod container - should these install vllm from pypi or the repo directly? Apologies for the late reply - need to fix github notifications |
ee4ef93
to
9f9c659
Compare
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left few comments. Trying to reconcile this with #1415
Dockerfile
Outdated
COPY vllm vllm | ||
COPY pyproject.toml pyproject.toml | ||
COPY README.md README.md | ||
COPY MANIFEST.in MANIFEST.in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally here we will copy only csrc folder and build only the c++ code. If you copy vllm folder any changes to the python code causes slow rebuild of the c++ code, most of the time this is not needed.
Also the README.md file is kind of required during the build of the c++ code, but we can copy empty README.me while build the c++ code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Do you think it would be possible to build the extensions separate from the wheel, so that the wheel building step only bundles everything without having to rebuild?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to build the wheel with already build extensions, but it might be possible. I don't think we need to build the wheel. If you want we can do another container path for building the wheel if you want this to be used to publish the wheel files to pip?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I just saw you did another stage to build the wheel. This is ok I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for now no need to build the wheel in the dockerfile, can do that later if decide to consolidate docker/CI
I don't think we need this. Users who want particular version would be able to just |
@NikolaBorisov Took your commentary into account - should be ready to merge. I am still working on pruning the image but that is not critical. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Just left 2 more suggestions.
Dockerfile
Outdated
FROM api_server AS openai_api_server | ||
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FROM api_server AS openai_api_server | |
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"] | |
FROM api_server AS openai_api_server | |
# extra dependencies for openapi server | |
RUN pip install accelerate fschat | |
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"] |
|
||
.. code-block:: console | ||
|
||
$ DOCKER_BUILDKIT=1 docker build . --target prod --tag vllm --build-arg max_jobs=8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$ DOCKER_BUILDKIT=1 docker build . --target prod --tag vllm --build-arg max_jobs=8 | |
$ DOCKER_BUILDKIT=1 docker build . --target api_server --tag vllm --build-arg max_jobs=8 | |
$ DOCKER_BUILDKIT=1 docker build . --target openai_api_server --tag vllm-openai --build-arg max_jobs=8 |
@simon-mo I think this is ready to merge after @skrider fixes 2 small things. I'm going to close #1415 in favor of this one. The only think other thing we might want to fix is the names of the last to stages in the file. Instead of 'api_server' and 'openai_api_server' I think 'vllm' and 'vllm-openai' are better names. |
@NikolaBorisov thank you for your feedback! @skrider it would be great to address them. I also created |
@simon-mo Yes will do- should be done EOD |
Did you check that is working for multiGPUs inference also? (a model loaded on at least two GPUs) |
@@ -0,0 +1,72 @@ | |||
FROM nvidia/cuda:11.8.0-devel-ubuntu22.04 AS dev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FROM nvidia/cuda:11.8.0-devel-ubuntu22.04 AS dev | |
ARG MAX_JOBS=4 | |
FROM nvidia/cuda:11.8.0-devel-ubuntu22.04 AS dev |
Forward declare the argument and set sensible default
COPY vllm/__init__.py vllm/__init__.py | ||
|
||
# max jobs used by Ninja to build extensions | ||
ENV MAX_JOBS=$max_jobs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ENV MAX_JOBS=$max_jobs | |
ENV MAX_JOBS=$MAX_JOBS |
|
||
.. code-block:: console | ||
|
||
$ DOCKER_BUILDKIT=1 docker build . --target vllm --tag vllm --build-arg max_jobs=8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$ DOCKER_BUILDKIT=1 docker build . --target vllm --tag vllm --build-arg max_jobs=8 | |
$ DOCKER_BUILDKIT=1 docker build . --target vllm --tag vllm --build-arg MAX_JOBS=8 |
Implements #1293
Adds a dockerfile to run vLLM in a containerized environment. This has been tested to work with
facebook/opt-13b
on 2g5.2xlarge
AWS instances utilizing Ray for communication. Output ofbenchmarks/benchmark_serving.py
:The image is quite large however, at 6.93 gb.
Is this testing setup sufficient?
Additionally, using
cuda-base
reduces the image size significantly but prevents installation of development dependencies, among other things preventing unit tests from being run - is this a concern?Should we also consider pruning the image and removing unused dependencies from the pip install to further reduce image size?