Add vllm Predictor #20

xwu99 · 2024-01-03T07:44:14Z

closes #2

* fix finetune config mapping

xwu99 · 2024-01-10T02:54:34Z

This PR needs to run on SPR and beyond. so it was not added in CI for now. Will address this later.

To test on SPR:

Install

Install vllm cpu into current conda env:

$ ./dev/scripts/install-vllm-cpu.sh

Serve:

$ python serve.py --config_file ./inference/models/vllm/llama-2-7b-chat-hf-vllm.yaml --serve_simple --keep_serve_terminal

non-streaming query:

MODEL_TO_SERVE=llama-2-7b-chat-hf python examples/inference/api_server_simple/query_single.py --num_iter 1 --model_endpoint http://127.0.0.1:8000/$MODEL_TO_SERVE

streaming:

MODEL_TO_SERVE=llama-2-7b-chat-hf python examples/inference/api_server_simple/query_single.py --num_iter 1 --model_endpoint http://127.0.0.1:8000/$MODEL_TO_SERVE --streaming_response

Signed-off-by: Wu, Xiaochang <[email protected]>

dev/scripts/install-vllm-cpu.sh

inference/predictor_deployment.py

inference/vllm_predictor.py

xwu99 · 2024-01-14T15:25:30Z

CI failed caused by #58

xwu99 · 2024-01-14T15:27:27Z

also need separated PR to fix #57

inference/predictor_deployment.py

xwu99 · 2024-01-17T14:38:15Z

CI for vllm is added and should pass now.
The output of vllm predictor already support single request multiple prompts and output a result list for non-streaming case.
It needs to align with other predictors in #52 since others are not supporting this now.

@carsonwang @KepingYan

inference/inference_config.py

carsonwang

Other parts looks good. Thanks for the work!

xwu99 · 2024-01-18T03:14:41Z

As I know, FP16 has problems to be supported with torch cpu. Also FP16 is simulated without real hardware instructions. We need to load FP16 as BF16. @Zhang, ***@***.***> any comment?

________________________________ 发件人: Carson Wang ***@***.***> 发送时间: Thursday, January 18, 2024 10:32:14 AM 收件人: intel/llm-on-ray ***@***.***> 抄送: Wu, Xiaochang ***@***.***>; Author ***@***.***> 主题: Re: [intel/llm-on-ray] Add vllm Predictor (PR #20) @carsonwang commented on this pull request.

________________________________ In inference/inference_config.py<#20 (comment)>:

@@ -32,7 +32,18 @@ class Ipex(BaseModel):

@validator("precision") def _check_precision(cls, v: str): if v: - assert v in [IPEX_PRECISION_BF16, IPEX_PRECISION_FP32] + assert v in [PRECISION_BF16, PRECISION_FP32] + return v + + +class Vllm(BaseModel): + enabled: bool = False + precision: str = "bf16" + + @validator("precision") + def _check_precision(cls, v: str): + if v: + assert v in [PRECISION_BF16, PRECISION_FP32] What about other precision types supported in vLLM? Can we also add them like FP16, etc? ― Reply to this email directly, view it on GitHub<#20 (review)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFTZQTSQROSBLCQ6QOQGUNLYPCCS5AVCNFSM6AAAAABBLBFVJWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTQMRYGU4TMMRTGM>. You are receiving this because you authored the thread.Message ID: ***@***.***>

jiafuzha · 2024-01-18T05:39:25Z

As I know, FP16 has problems to be supported with torch cpu. Also FP16 is simulated without real hardware instructions. We need to load FP16 as BF16. @Zhang, @.> any comment?
________________________________ 发件人: Carson Wang @.> 发送时间: Thursday, January 18, 2024 10:32:14 AM 收件人: intel/llm-on-ray @.> 抄送: Wu, Xiaochang @.>; Author @.> 主题: Re: [intel/llm-on-ray] Add vllm Predictor (PR #20) @carsonwang commented on this pull request.
________________________________ In inference/inference_config.py<#20 (comment)>:
@@ -32,7 +32,18 @@ class Ipex(BaseModel):
@validator("precision") def _check_precision(cls, v: str): if v: - assert v in [IPEX_PRECISION_BF16, IPEX_PRECISION_FP32] + assert v in [PRECISION_BF16, PRECISION_FP32] + return v + + +class Vllm(BaseModel): + enabled: bool = False + precision: str = "bf16" + + @validator("precision") + def _check_precision(cls, v: str): + if v: + assert v in [PRECISION_BF16, PRECISION_FP32] What about other precision types supported in vLLM? Can we also add them like FP16, etc? ― Reply to this email directly, view it on GitHub<#20 (review)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFTZQTSQROSBLCQ6QOQGUNLYPCCS5AVCNFSM6AAAAABBLBFVJWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTQMRYGU4TMMRTGM. You are receiving this because you authored the thread.Message ID: @.>

yes, pytorch cpu cannot run FP16 directly in intel CPU other than SPR.

.github/workflows/workflow_inference.yml

docs/vllm.md

pyproject.toml

inference/vllm_predictor.py

xwu99 added 12 commits January 3, 2024 15:56

add vllm_predictor

0d9e1d5

add tests skeleton

f5f0360

add tests skeleton

69f1612

add pytest.ini

a62ec75

wip

a8e6b6d

complete, debug wip

565c616

nit

a554202

nit

c0c1661

nit

cb263ea

complete generate supporting str and List[str]

c0f5cea

add model

af76998

add streaming

064246a

carsonwang pushed a commit to carsonwang/llm-on-ray that referenced this pull request Jan 9, 2024

fix finetune config mapping (intel#20)

f9a490d

* fix finetune config mapping

remove tests

b572b19

xwu99 requested review from KepingYan, kira-lin and carsonwang January 10, 2024 02:52

xwu99 marked this pull request as ready for review January 10, 2024 02:52

xwu99 added the inference label Jan 10, 2024

xwu99 added 3 commits January 10, 2024 07:58

Add install-vllm-cpu script

2deaa47

nit

7573a7b

Signed-off-by: Wu, Xiaochang <[email protected]>

nit

c6f0dc9

Signed-off-by: Wu, Xiaochang <[email protected]>

KepingYan reviewed Jan 11, 2024

View reviewed changes

dev/scripts/install-vllm-cpu.sh Outdated Show resolved Hide resolved

inference/predictor_deployment.py Show resolved Hide resolved

inference/vllm_predictor.py Show resolved Hide resolved

xwu99 added 3 commits January 12, 2024 01:39

merge upstream

7cd95fe

nit

57aa0d1

fix package inference

17c1206

carsonwang reviewed Jan 15, 2024

View reviewed changes

inference/predictor_deployment.py Show resolved Hide resolved

xwu99 added 15 commits January 16, 2024 12:31

add ci

9ec0311

nit

226f3d2

nit

3b09a6e

add libpthread-stubs0-dev

0cdb27b

fix install-vllm-cpu

fd0fb29

fix

3f9ba57

revert inference.inference_config

f89008d

debug ci

7d85569

debug ci

1cadcd9

debug ci

8759cb5

debug ci

1cfc3b7

debug ci

964d69a

debug ci

c6f7686

debug ci

e959bb2

debug ci

a42f300

xwu99 mentioned this pull request Jan 17, 2024

[Inference] Generate multiple sequences for single prompt #52

Closed

xwu99 requested review from jiafuzha and removed request for kira-lin January 17, 2024 14:47

carsonwang reviewed Jan 18, 2024

View reviewed changes

inference/inference_config.py Show resolved Hide resolved

carsonwang approved these changes Jan 18, 2024

View reviewed changes