Skip to content
This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

Upstream sync 2024 03 14 #127

Merged
merged 114 commits into from
Mar 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
114 commits
Select commit Hold shift + click to select a range
d7f3964
Update comment (#2934)
ronensc Feb 22, 2024
5574081
Added early stopping to completion APIs (#2939)
Maxusmusti Feb 22, 2024
344020c
Migrate MistralForCausalLM to LlamaForCausalLM (#2868)
esmeetu Feb 22, 2024
95529e3
Use Llama RMSNorm custom op for Gemma (#2974)
WoosukKwon Feb 22, 2024
93dc5a2
chore(vllm): codespell for spell checking (#2820)
mspronesti Feb 22, 2024
fd5dcc5
Optimize GeGLU layer in Gemma (#2975)
WoosukKwon Feb 22, 2024
c530e2c
[FIX] Fix a bug in initializing Yarn RoPE (#2983)
44670 Feb 22, 2024
6f32cdd
Remove Flash Attention in test env (#2982)
WoosukKwon Feb 22, 2024
4caf704
Include tokens from prompt phase in `counter_generation_tokens` (#2802)
ronensc Feb 22, 2024
57f0449
Fix nvcc not found in vlm-openai image (#2781)
zhaoyang-star Feb 22, 2024
f7c1234
[Fix] Fissertion on YaRN model len (#2984)
WoosukKwon Feb 23, 2024
ef978fe
Port metrics from `aioprometheus` to `prometheus_client` (#2730)
hmellor Feb 25, 2024
70f3e8e
Add LogProbs for Chat Completions in OpenAI (#2918)
jlcmoore Feb 26, 2024
cfc15a1
Optimize Triton MoE Kernel (#2979)
pcmoritz Feb 26, 2024
d6e4a13
[Minor] Remove gather_cached_kv kernel (#3043)
WoosukKwon Feb 26, 2024
d9f726c
[Minor] Remove unused config files (#3039)
esmeetu Feb 27, 2024
c1c0d00
Don't use cupy when `enforce_eager=True` (#3037)
esmeetu Feb 27, 2024
4dd6416
Fix stablelm (#3038)
esmeetu Feb 27, 2024
48a8f4a
Support Orion model (#2539)
dachengai Feb 27, 2024
2410e32
fix `get_ip` error in pure ipv6 environment (#2931)
Jingru Feb 27, 2024
4bd18ec
[Minor] Fix type annotation in fused moe (#3045)
WoosukKwon Feb 27, 2024
e0ade06
Support logit bias for OpenAI API (#3027)
dylanwhawk Feb 27, 2024
8b430d7
[Minor] Fix StableLMEpochForCausalLM -> StableLmForCausalLM (#3046)
WoosukKwon Feb 27, 2024
71bcaf9
Enable GQA support in the prefix prefill kernels (#3007)
sighingnow Feb 27, 2024
a868310
multi-lora documentation fix (#3064)
ElefHead Feb 28, 2024
e46fa5d
Restrict prometheus_client >= 0.18.0 to prevent errors when importing…
AllenDou Feb 28, 2024
3b7178c
[Neuron] Support inference with transformers-neuronx (#2569)
liangfu Feb 28, 2024
929b4f2
Add LoRA support for Gemma (#3050)
WoosukKwon Feb 28, 2024
01a5d18
Add Support for 2/3/8-bit GPTQ Quantization Models (#2330)
chu-tianxiang Feb 29, 2024
a6d471c
Fix: `AttributeError` in OpenAI-compatible server (#3018)
jaywonchung Feb 29, 2024
9289e57
add cache_config's info to prometheus metrics. (#3100)
AllenDou Feb 29, 2024
bfdcfa6
Support starcoder2 architecture (#3089)
sh0416 Feb 29, 2024
2c08ff2
Fix building from source on WSL (#3112)
aliencaocao Feb 29, 2024
29a8d6a
[Fix] Don't deep-copy LogitsProcessors when copying SamplingParams (#…
njhill Feb 29, 2024
703e42e
Add guided decoding for OpenAI API server (#2819)
felixzhu555 Feb 29, 2024
54d3544
Fix: Output text is always truncated in some models (#3016)
HyperdriveHustle Mar 1, 2024
27ca23d
Remove exclude_unset in streaming response (#3143)
sh0416 Mar 1, 2024
49d849b
docs: Add tutorial on deploying vLLM model with KServe (#2586)
terrytangyuan Mar 1, 2024
90fbf12
fix relative import path of protocol.py (#3134)
Huarong Mar 1, 2024
c0c2335
Integrate Marlin Kernels for Int4 GPTQ inference (#2497)
robertgshaw2-neuralmagic Mar 1, 2024
82091b8
Bump up to v0.3.3 (#3129)
WoosukKwon Mar 1, 2024
29e70e3
allow user chose log level by --log-level instead of fixed 'info'. (#…
AllenDou Mar 1, 2024
baee28c
Reorder kv dtype check to avoid nvcc not found error on AMD platform …
cloudhan Mar 2, 2024
ce4f5a2
Add Automatic Prefix Caching (#2762)
SageMoore Mar 2, 2024
d65fac2
Add vLLM version info to logs and openai API server (#3161)
jasonacox Mar 3, 2024
996d095
[FIX] Fix styles in automatic prefix caching & add a automatic prefix…
zhuohan123 Mar 3, 2024
17c3103
Make it easy to profile workers with nsight (#3162)
pcmoritz Mar 4, 2024
d0fae88
[DOC] add setup document to support neuron backend (#2777)
liangfu Mar 4, 2024
901cf4c
[Minor Fix] Remove unused code in benchmark_prefix_caching.py (#3171)
gty111 Mar 4, 2024
27a7b07
Add document for vllm paged attention kernel. (#2978)
pian13131 Mar 4, 2024
9cbc7e5
enable --gpu-memory-utilization in benchmark_throughput.py (#3175)
AllenDou Mar 4, 2024
76e8a70
[Minor fix] The domain dns.google may cause a socket.gaierror excepti…
ttbachyinsda Mar 4, 2024
22de452
Push logprob generation to LLMEngine (#3065)
Yard1 Mar 4, 2024
ff578ca
Add health check, make async Engine more robust (#3015)
Yard1 Mar 4, 2024
9a4548b
Fix the openai benchmarking requests to work with latest OpenAI apis …
wangchen615 Mar 4, 2024
05af6da
[ROCm] enable cupy in order to enable cudagraph mode for AMD GPUs (#…
hongxiayang Mar 5, 2024
8999ec3
Store `eos_token_id` in `Sequence` for easy access (#3166)
njhill Mar 5, 2024
2efce05
[Fix] Avoid pickling entire LLMEngine for Ray workers (#3207)
njhill Mar 6, 2024
24aecf4
[Tests] Add block manager and scheduler tests (#3108)
rkooo567 Mar 6, 2024
a33ce60
[Testing] Fix core tests (#3224)
cadedaniel Mar 6, 2024
4cb3b92
Add tqdm `dynamic_ncols=True` (#3242)
chujiezheng Mar 6, 2024
d3c04b6
Add GPTQ support for Gemma (#3200)
TechxGenus Mar 7, 2024
cbf4c05
Update requirements-dev.txt to include package for benchmarking scrip…
wangchen615 Mar 7, 2024
2daf23a
Separate attention backends (#3005)
WoosukKwon Mar 7, 2024
385da2d
Measure model memory usage (#3120)
mgoin Mar 7, 2024
8cbba46
Possible fix for conflict between Automated Prefix Caching (#2762) an…
jacobthebanana Mar 7, 2024
b35cc93
Fix auto prefix bug (#3239)
ElizaWszola Mar 8, 2024
d2339d6
Connect engine healthcheck to openai server (#3260)
njhill Mar 8, 2024
c59e120
Feature add lora support for Qwen2 (#3177)
whyiug Mar 8, 2024
1ece1ae
[Minor Fix] Fix comments in benchmark_serving (#3252)
gty111 Mar 8, 2024
99c3cfb
[Docs] Fix Unmocked Imports (#3275)
ywang96 Mar 8, 2024
1cb0cc2
[FIX] Make `flash_attn` optional (#3269)
WoosukKwon Mar 8, 2024
c2c5e09
Move model filelocks from `/tmp/` to `~/.cache/vllm/locks/` dir (#3241)
mgoin Mar 8, 2024
f48c679
[FIX] Fix prefix test error on main (#3286)
zhuohan123 Mar 9, 2024
8437bae
[Speculative decoding 3/9] Worker which speculates, scores, and appli…
cadedaniel Mar 9, 2024
0bba88d
Enhance lora tests with more layer and rank variations (#3243)
tterrysun Mar 10, 2024
e4a28e5
[ROCM] Fix blockReduceSum to use correct warp counts for ROCm and CUD…
dllehr-amd Mar 10, 2024
9e8744a
[BugFix] Fix get tokenizer when using ray (#3301)
esmeetu Mar 11, 2024
4b59f00
[Fix] Fix best_of behavior when n=1 (#3298)
njhill Mar 11, 2024
2f8844b
Re-enable the 80 char line width limit (#3305)
zhuohan123 Mar 11, 2024
657061f
[docs] Add LoRA support information for models (#3299)
pcmoritz Mar 11, 2024
4c92270
Add distributed model executor abstraction (#3191)
zhuohan123 Mar 11, 2024
c9415c1
[ROCm] Fix warp and lane calculation in blockReduceSum (#3321)
kliuae Mar 11, 2024
654865e
Support Mistral Model Inference with transformers-neuronx (#3153)
DAIZHENWEI Mar 11, 2024
b0925b3
docs: Add BentoML deployment doc (#3336)
Sherlock113 Mar 12, 2024
49a3c86
Fixes #1556 double free (#3347)
br3no Mar 13, 2024
602358f
Add kernel for GeGLU with approximate GELU (#3337)
WoosukKwon Mar 13, 2024
b167109
[Fix] Fix quantization="gptq" when using Marlin (#3319)
DreamTeamWangbowen Mar 13, 2024
e221910
add hf_transfer to requirements.txt (#3031)
RonanKMcGovern Mar 13, 2024
ba8dc95
[Minor] Fix bias in if to remove ambiguity (#3259)
hliuca Mar 13, 2024
739c350
[Minor Fix] Use cupy-cuda11x in CUDA 11.8 build (#3256)
chenxu2048 Mar 13, 2024
ae0ccb4
Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism…
orsharir Mar 13, 2024
7e9bd08
Add batched RoPE kernel (#3095)
tterrysun Mar 13, 2024
c33afd8
Fix lint (#3388)
Yard1 Mar 13, 2024
eeab52a
[FIX] Simpler fix for async engine running on ray (#3371)
zhuohan123 Mar 13, 2024
81653d9
[Hotfix] [Debug] test_openai_server.py::test_guided_regex_completion …
simon-mo Mar 14, 2024
a37415c
allow user to chose which vllm's merics to display in grafana (#3393)
AllenDou Mar 14, 2024
8fe8386
[Kernel] change benchmark script so that result can be directly used;…
youkaichao Mar 14, 2024
06ec486
Install `flash_attn` in Docker image (#3396)
tdoublep Mar 14, 2024
c17ca8e
Add args for mTLS support (#3410)
declark1 Mar 14, 2024
dfc7740
[issue templates] add some issue templates (#3412)
youkaichao Mar 14, 2024
54be8a0
Fix assertion failure in Qwen 1.5 with prefix caching enabled (#3373)
chenxu2048 Mar 14, 2024
87ad0cb
Merge branch 'upstream-main' into upstream-sync-2024-03-14
robertgshaw2-neuralmagic Mar 14, 2024
4518f5a
format
robertgshaw2-neuralmagic Mar 14, 2024
5bc7a73
formating
robertgshaw2-neuralmagic Mar 14, 2024
6f60731
ruff
robertgshaw2-neuralmagic Mar 14, 2024
5ba2ee1
ruff again
robertgshaw2-neuralmagic Mar 14, 2024
d342426
yapf
robertgshaw2-neuralmagic Mar 14, 2024
e283528
finalized ruff
robertgshaw2-neuralmagic Mar 15, 2024
c5633f2
yapf after ruff :)
robertgshaw2-neuralmagic Mar 15, 2024
1271e3c
yapf after ruff :)
robertgshaw2-neuralmagic Mar 15, 2024
c47bd6b
fixed tests post update
robertgshaw2-neuralmagic Mar 15, 2024
b9c3578
missed one test
robertgshaw2-neuralmagic Mar 15, 2024
1e36b51
Update test-pipeline.yaml
robertgshaw2-neuralmagic Mar 15, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions .github/ISSUE_TEMPLATE/100-documentation.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
name: 📚 Documentation
description: Report an issue related to https://docs.vllm.ai/
title: "[Doc]: "
labels: ["doc"]

body:
- type: textarea
attributes:
label: 📚 The doc issue
description: >
A clear and concise description of what content in https://docs.vllm.ai/ is an issue.
validations:
required: true
- type: textarea
attributes:
label: Suggest a potential alternative/fix
description: >
Tell us how we could improve the documentation in this regard.
- type: markdown
attributes:
value: >
Thanks for contributing 🎉!
39 changes: 39 additions & 0 deletions .github/ISSUE_TEMPLATE/200-installation.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
name: 🛠️ Installation
description: Report an issue here when you hit errors during installation.
title: "[Installation]: "
labels: ["installation"]

body:
- type: markdown
attributes:
value: >
#### Before submitting an issue, please make sure the issue hasn't been already addressed by searching through [the existing and past issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue+sort%3Acreated-desc+).
- type: textarea
attributes:
label: Your current environment
description: |
Please run the following and paste the output below.
```sh
wget https://raw.githubusercontent.com/vllm-project/vllm/main/collect_env.py
# For security purposes, please feel free to check the contents of collect_env.py before running it.
python collect_env.py
```
value: |
```text
The output of `python collect_env.py`
```
validations:
required: true
- type: textarea
attributes:
label: How you are installing vllm
description: |
Paste the full command you are trying to execute.
value: |
```sh
pip install -vvv vllm
```
- type: markdown
attributes:
value: >
Thanks for contributing 🎉!
37 changes: 37 additions & 0 deletions .github/ISSUE_TEMPLATE/300-usage.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
name: 💻 Usage
description: Raise an issue here if you don't know how to use vllm.
title: "[Usage]: "
labels: ["usage"]

body:
- type: markdown
attributes:
value: >
#### Before submitting an issue, please make sure the issue hasn't been already addressed by searching through [the existing and past issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue+sort%3Acreated-desc+).
- type: textarea
attributes:
label: Your current environment
description: |
Please run the following and paste the output below.
```sh
wget https://raw.githubusercontent.com/vllm-project/vllm/main/collect_env.py
# For security purposes, please feel free to check the contents of collect_env.py before running it.
python collect_env.py
```
value: |
```text
The output of `python collect_env.py`
```
validations:
required: true
- type: textarea
attributes:
label: How would you like to use vllm
description: |
A detailed description of how you want to use vllm.
value: |
I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.
- type: markdown
attributes:
value: >
Thanks for contributing 🎉!
81 changes: 81 additions & 0 deletions .github/ISSUE_TEMPLATE/400-bug report.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
name: 🐛 Bug report
description: Raise an issue here if you find a bug.
title: "[Bug]: "
labels: ["bug"]

body:
- type: markdown
attributes:
value: >
#### Before submitting an issue, please make sure the issue hasn't been already addressed by searching through [the existing and past issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue+sort%3Acreated-desc+).
- type: textarea
attributes:
label: Your current environment
description: |
Please run the following and paste the output below.
```sh
wget https://raw.githubusercontent.com/vllm-project/vllm/main/collect_env.py
# For security purposes, please feel free to check the contents of collect_env.py before running it.
python collect_env.py
```
value: |
```text
The output of `python collect_env.py`
```
validations:
required: true
- type: textarea
attributes:
label: 🐛 Describe the bug
description: |
Please provide a clear and concise description of what the bug is.

If relevant, add a minimal example so that we can reproduce the error by running the code. It is very important for the snippet to be as succinct (minimal) as possible, so please take time to trim down any irrelevant code to help us debug efficiently. We are going to copy-paste your code and we expect to get the same result as you did: avoid any external data, and include the relevant imports, etc. For example:

```python
from vllm import LLM, SamplingParams

prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="facebook/opt-125m")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

If the code is too long (hopefully, it isn't), feel free to put it in a public gist and link it in the issue: https://gist.github.com.

Please also paste or describe the results you observe instead of the expected results. If you observe an error, please paste the error message including the **full** traceback of the exception. It may be relevant to wrap error messages in ```` ```triple quotes blocks``` ````.
placeholder: |
A clear and concise description of what the bug is.

```python
# Sample code to reproduce the problem
```

```
The error message you got, with the full traceback.
```
validations:
required: true
- type: markdown
attributes:
value: >
⚠️ Please separate bugs of `transformers` implementation or usage from bugs of `vllm`. If you think anything is wrong with the models' output:

- Try the counterpart of `transformers` first. If the error appears, please go to [their issues](https://github.com/huggingface/transformers/issues?q=is%3Aissue+is%3Aopen+sort%3Aupdated-desc).

- If the error only appears in vllm, please provide the detailed script of how you run `transformers` and `vllm`, also highlight the difference and what you expect.

Thanks for contributing 🎉!
31 changes: 31 additions & 0 deletions .github/ISSUE_TEMPLATE/500-feature request.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
name: 🚀 Feature request
description: Submit a proposal/request for a new vllm feature
title: "[Feature]: "
labels: ["feature"]

body:
- type: markdown
attributes:
value: >
#### Before submitting an issue, please make sure the issue hasn't been already addressed by searching through [the existing and past issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue+sort%3Acreated-desc+).
- type: textarea
attributes:
label: 🚀 The feature, motivation and pitch
description: >
A clear and concise description of the feature proposal. Please outline the motivation for the proposal. Is your feature request related to a specific problem? e.g., *"I'm working on X and would like Y to be possible"*. If this is related to another GitHub issue, please link here too.
validations:
required: true
- type: textarea
attributes:
label: Alternatives
description: >
A description of any alternative solutions or features you've considered, if any.
- type: textarea
attributes:
label: Additional context
description: >
Add any other context or screenshots about the feature request.
- type: markdown
attributes:
value: >
Thanks for contributing 🎉!
33 changes: 33 additions & 0 deletions .github/ISSUE_TEMPLATE/600-new model.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
name: 🤗 Support request for a new model from huggingface
description: Submit a proposal/request for a new model from huggingface
title: "[New Model]: "
labels: ["new model"]

body:
- type: markdown
attributes:
value: >
#### Before submitting an issue, please make sure the issue hasn't been already addressed by searching through [the existing and past issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue+sort%3Acreated-desc+).

#### We also highly recommend you read https://docs.vllm.ai/en/latest/models/adding_model.html first to understand how to add a new model.
- type: textarea
attributes:
label: The model to consider.
description: >
A huggingface url, pointing to the model, e.g. https://huggingface.co/openai-community/gpt2 .
validations:
required: true
- type: textarea
attributes:
label: The closest model vllm already supports.
description: >
Here is the list of models already supported by vllm: https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models . Which model is the most similar to the model you want to add support for?
- type: textarea
attributes:
label: What's your difficulty of supporting the model you want?
description: >
For example, any new operators or new architecture?
- type: markdown
attributes:
value: >
Thanks for contributing 🎉!
51 changes: 51 additions & 0 deletions .github/ISSUE_TEMPLATE/700-performance discussion.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
name: ⚡ Discussion on the performance of vllm
description: Submit a proposal/discussion about the performance of vllm
title: "[Performance]: "
labels: ["performance"]

body:
- type: markdown
attributes:
value: >
#### Before submitting an issue, please make sure the issue hasn't been already addressed by searching through [the existing and past issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue+sort%3Acreated-desc+).
- type: textarea
attributes:
label: Proposal to improve performance
description: >
How do you plan to improve vllm's performance?
validations:
required: false
- type: textarea
attributes:
label: Report of performance regression
description: >
Please provide detailed description of performance comparison to confirm the regression. You may want to run the benchmark script at https://github.com/vllm-project/vllm/tree/main/benchmarks .
validations:
required: false
- type: textarea
attributes:
label: Misc discussion on performance
description: >
Anything about the performance.
validations:
required: false
- type: textarea
attributes:
label: Your current environment (if you think it is necessary)
description: |
Please run the following and paste the output below.
```sh
wget https://raw.githubusercontent.com/vllm-project/vllm/main/collect_env.py
# For security purposes, please feel free to check the contents of collect_env.py before running it.
python collect_env.py
```
value: |
```text
The output of `python collect_env.py`
```
validations:
required: false
- type: markdown
attributes:
value: >
Thanks for contributing 🎉!
21 changes: 21 additions & 0 deletions .github/ISSUE_TEMPLATE/800-misc discussion.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
name: 🎲 Misc/random discussions that do not fit into the above categories.
description: Submit a discussion as you like. Note that developers are heavily overloaded and we mainly rely on community users to answer these issues.
title: "[Misc]: "
labels: ["misc"]

body:
- type: markdown
attributes:
value: >
#### Before submitting an issue, please make sure the issue hasn't been already addressed by searching through [the existing and past issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue+sort%3Acreated-desc+).
- type: textarea
attributes:
label: Anything you want to discuss about vllm.
description: >
Anything you want to discuss about vllm.
validations:
required: true
- type: markdown
attributes:
value: >
Thanks for contributing 🎉!
1 change: 1 addition & 0 deletions .github/ISSUE_TEMPLATE/config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
blank_issues_enabled: false
1 change: 1 addition & 0 deletions .yapfignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
collect_env.py
26 changes: 25 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,22 @@ ENV VLLM_INSTALL_PUNICA_KERNELS=1
RUN python3 setup.py build_ext --inplace
#################### EXTENSION Build IMAGE ####################

#################### FLASH_ATTENTION Build IMAGE ####################
FROM dev as flash-attn-builder
# max jobs used for build
ARG max_jobs=2
ENV MAX_JOBS=${max_jobs}
# flash attention version
ARG flash_attn_version=v2.5.6
ENV FLASH_ATTN_VERSION=${flash_attn_version}

WORKDIR /usr/src/flash-attention-v2

# Download the wheel or build it if a pre-compiled release doesn't exist
RUN pip --verbose wheel flash-attn==${FLASH_ATTN_VERSION} \
--no-build-isolation --no-deps --no-cache-dir

#################### FLASH_ATTENTION Build IMAGE ####################

#################### TEST IMAGE ####################
# image to run unit testing suite
Expand All @@ -68,6 +84,9 @@ WORKDIR /vllm-workspace
# ADD is used to preserve directory structure
ADD . /vllm-workspace/
COPY --from=build /workspace/vllm/*.so /vllm-workspace/vllm/
# Install flash attention (from pre-built wheel)
RUN --mount=type=bind,from=flash-attn-builder,src=/usr/src/flash-attention-v2,target=/usr/src/flash-attention-v2 \
pip install /usr/src/flash-attention-v2/*.whl --no-cache-dir
# ignore build dependencies installation because we are using pre-complied extensions
RUN rm pyproject.toml
RUN --mount=type=cache,target=/root/.cache/pip VLLM_USE_PRECOMPILED=1 pip install . --verbose
Expand All @@ -88,6 +107,11 @@ WORKDIR /workspace
COPY requirements.txt requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip \
pip install -r requirements.txt

# Install flash attention (from pre-built wheel)
RUN --mount=type=bind,from=flash-attn-builder,src=/usr/src/flash-attention-v2,target=/usr/src/flash-attention-v2 \
pip install /usr/src/flash-attention-v2/*.whl --no-cache-dir

#################### RUNTIME BASE IMAGE ####################


Expand All @@ -96,7 +120,7 @@ RUN --mount=type=cache,target=/root/.cache/pip \
FROM vllm-base AS vllm-openai
# install additional dependencies for openai api server
RUN --mount=type=cache,target=/root/.cache/pip \
pip install accelerate
pip install accelerate hf_transfer

COPY --from=build /workspace/vllm/*.so /workspace/vllm/
COPY vllm vllm
Expand Down
2 changes: 2 additions & 0 deletions benchmarks/backend_request_func.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
# flake8: noqa
# UPSTREAM SYNC: noqa is required for passing ruff run on nm-automation
# This file has been modified by Neural Magic

import json
Expand Down
3 changes: 3 additions & 0 deletions benchmarks/benchmark_prefix_caching.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
# flake8: noqa
# UPSTREAM SYNC: noqa is required for passing ruff run on nm-automation

import argparse
import time

Expand Down
2 changes: 2 additions & 0 deletions benchmarks/benchmark_serving.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
# flake8: noqa
# UPSTREAM SYNC: noqa is required for passing ruff run on nm-automation
"""Benchmark online serving throughput.

On the server side, run one of the following commands:
Expand Down
Loading
Loading