Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dockerfile #1350

Merged
merged 13 commits into from
Oct 31, 2023
Merged

Add dockerfile #1350

merged 13 commits into from
Oct 31, 2023

Conversation

skrider
Copy link
Contributor

@skrider skrider commented Oct 14, 2023

Implements #1293

Adds a dockerfile to run vLLM in a containerized environment. This has been tested to work with facebook/opt-13b on 2 g5.2xlarge AWS instances utilizing Ray for communication. Output of benchmarks/benchmark_serving.py:

Total time: 16.32 s
Throughput: 61.28 requests/s
Average latency: 10.59 s
Average latency per token: 0.88 s
Average latency per output token: 2.12 s

The image is quite large however, at 6.93 gb.

Is this testing setup sufficient?

Additionally, using cuda-base reduces the image size significantly but prevents installation of development dependencies, among other things preventing unit tests from being run - is this a concern?

Should we also consider pruning the image and removing unused dependencies from the pip install to further reduce image size?

@skrider skrider marked this pull request as draft October 14, 2023 01:57
@simon-mo
Copy link
Collaborator

Hi @skrider

Great work so far. This is a good start. Few comments:

  • We should absolutely try to prune the image as much as possible.
  • Can we add an ENTRYPOINT so users don't need to type python3 -m vllm.entrypoints.api_server each time.
  • Consider add EXPOSE for default port as well?
  • Ideally, we should have "dev" container that contains all build dependencies so we can run CI and help contributors set up everything; and another "prod" container with smallest size possible to run models or built upon.

@skrider
Copy link
Contributor Author

skrider commented Oct 18, 2023

@simon-mo - thanks!

We should absolutely try to prune the image as much as possible.

Noted, will do. I am working on some tooling to help prune images automatically here: https://github.com/skrider/pip-prune . So far have been able to reduce pip install footprint from 6 to 3 gb.

Regarding entrypoint - I considered adding this but thought that sometimes users might want to start the container with ray and then exec vllm.entrypoints.api_server later. Perhaps it would be good to provide an entrypoint.sh script that checks for an environment variable and optionally starts up/waits for Ray? Could get a bit clunky. Or should we just ignore Ray?

Regarding the dev/prod container - should these install vllm from pypi or the repo directly?

Apologies for the late reply - need to fix github notifications

@simon-mo
Copy link
Collaborator

  • Users can override entrypoint using --entrypoint command in docker run as well as similar means in K8s.
  • It would be great to allow customizing version using docker ARG and default to the latest version. As long as it is easy enough to build it for any version, it should be fine.

@simon-mo simon-mo mentioned this pull request Oct 23, 2023
Copy link
Contributor

@NikolaBorisov NikolaBorisov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left few comments. Trying to reconcile this with #1415

Dockerfile Outdated
Comment on lines 20 to 23
COPY vllm vllm
COPY pyproject.toml pyproject.toml
COPY README.md README.md
COPY MANIFEST.in MANIFEST.in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally here we will copy only csrc folder and build only the c++ code. If you copy vllm folder any changes to the python code causes slow rebuild of the c++ code, most of the time this is not needed.

Also the README.md file is kind of required during the build of the c++ code, but we can copy empty README.me while build the c++ code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Do you think it would be possible to build the extensions separate from the wheel, so that the wheel building step only bundles everything without having to rebuild?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to build the wheel with already build extensions, but it might be possible. I don't think we need to build the wheel. If you want we can do another container path for building the wheel if you want this to be used to publish the wheel files to pip?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I just saw you did another stage to build the wheel. This is ok I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for now no need to build the wheel in the dockerfile, can do that later if decide to consolidate docker/CI

Dockerfile Outdated Show resolved Hide resolved
Dockerfile Show resolved Hide resolved
@NikolaBorisov
Copy link
Contributor

* It would be great to allow customizing version using docker `ARG` and default to the latest version. As long as it is easy enough to build it for any version, it should be fine.

I don't think we need this. Users who want particular version would be able to just docker pull vllm:0.2.0 If they want to build from scratch they can checkout the right branch/tag and docker build there.

Dockerfile Outdated Show resolved Hide resolved
@skrider
Copy link
Contributor Author

skrider commented Oct 25, 2023

@NikolaBorisov Took your commentary into account - should be ready to merge. I am still working on pruning the image but that is not critical.

Copy link
Contributor

@NikolaBorisov NikolaBorisov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Just left 2 more suggestions.

Dockerfile Outdated
Comment on lines 50 to 51
FROM api_server AS openai_api_server
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
FROM api_server AS openai_api_server
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]
FROM api_server AS openai_api_server
# extra dependencies for openapi server
RUN pip install accelerate fschat
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]


.. code-block:: console

$ DOCKER_BUILDKIT=1 docker build . --target prod --tag vllm --build-arg max_jobs=8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
$ DOCKER_BUILDKIT=1 docker build . --target prod --tag vllm --build-arg max_jobs=8
$ DOCKER_BUILDKIT=1 docker build . --target api_server --tag vllm --build-arg max_jobs=8
$ DOCKER_BUILDKIT=1 docker build . --target openai_api_server --tag vllm-openai --build-arg max_jobs=8

@NikolaBorisov
Copy link
Contributor

@simon-mo I think this is ready to merge after @skrider fixes 2 small things. I'm going to close #1415 in favor of this one.

The only think other thing we might want to fix is the names of the last to stages in the file. Instead of 'api_server' and 'openai_api_server' I think 'vllm' and 'vllm-openai' are better names.

@simon-mo
Copy link
Collaborator

@NikolaBorisov thank you for your feedback! @skrider it would be great to address them. I also created vllm docker user and i'm planning to publish official images along with releases. Before we merge, @skrider can you add comments to Dockerfile and group different stages together so they are more maintainable for the future?

@skrider
Copy link
Contributor Author

skrider commented Oct 27, 2023

@simon-mo Yes will do- should be done EOD

@skrider skrider marked this pull request as ready for review October 27, 2023 23:09
@Extremys
Copy link

Implements #1293

Adds a dockerfile to run vLLM in a containerized environment. This has been tested to work with facebook/opt-13b on 2 g5.2xlarge AWS instances utilizing Ray for communication. Output of benchmarks/benchmark_serving.py:

Total time: 16.32 s
Throughput: 61.28 requests/s
Average latency: 10.59 s
Average latency per token: 0.88 s
Average latency per output token: 2.12 s

The image is quite large however, at 6.93 gb.

Is this testing setup sufficient?

Additionally, using cuda-base reduces the image size significantly but prevents installation of development dependencies, among other things preventing unit tests from being run - is this a concern?

Should we also consider pruning the image and removing unused dependencies from the pip install to further reduce image size?

Did you check that is working for multiGPUs inference also? (a model loaded on at least two GPUs)
Are you building the container from the target host with nvidia runtime enabled?

@@ -0,0 +1,72 @@
FROM nvidia/cuda:11.8.0-devel-ubuntu22.04 AS dev
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
FROM nvidia/cuda:11.8.0-devel-ubuntu22.04 AS dev
ARG MAX_JOBS=4
FROM nvidia/cuda:11.8.0-devel-ubuntu22.04 AS dev

Forward declare the argument and set sensible default

COPY vllm/__init__.py vllm/__init__.py

# max jobs used by Ninja to build extensions
ENV MAX_JOBS=$max_jobs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ENV MAX_JOBS=$max_jobs
ENV MAX_JOBS=$MAX_JOBS


.. code-block:: console

$ DOCKER_BUILDKIT=1 docker build . --target vllm --tag vllm --build-arg max_jobs=8
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
$ DOCKER_BUILDKIT=1 docker build . --target vllm --tag vllm --build-arg max_jobs=8
$ DOCKER_BUILDKIT=1 docker build . --target vllm --tag vllm --build-arg MAX_JOBS=8

@simon-mo simon-mo merged commit 9cabcb7 into vllm-project:main Oct 31, 2023
2 checks passed
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
sjchoi1 pushed a commit to casys-kaist-internal/vllm that referenced this pull request May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants