Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vllm Predictor #20

Merged
merged 44 commits into from
Jan 18, 2024
Merged

Add vllm Predictor #20

merged 44 commits into from
Jan 18, 2024

Conversation

xwu99
Copy link
Contributor

@xwu99 xwu99 commented Jan 3, 2024

closes #2

carsonwang pushed a commit to carsonwang/llm-on-ray that referenced this pull request Jan 9, 2024
* fix finetune config mapping
@xwu99 xwu99 marked this pull request as ready for review January 10, 2024 02:52
@xwu99
Copy link
Contributor Author

xwu99 commented Jan 10, 2024

This PR needs to run on SPR and beyond. so it was not added in CI for now. Will address this later.

To test on SPR:

Install

Install vllm cpu into current conda env:

$ ./dev/scripts/install-vllm-cpu.sh

Serve:

$ python serve.py --config_file ./inference/models/vllm/llama-2-7b-chat-hf-vllm.yaml --serve_simple --keep_serve_terminal

non-streaming query:

MODEL_TO_SERVE=llama-2-7b-chat-hf python examples/inference/api_server_simple/query_single.py --num_iter 1 --model_endpoint http://127.0.0.1:8000/$MODEL_TO_SERVE

streaming:

MODEL_TO_SERVE=llama-2-7b-chat-hf python examples/inference/api_server_simple/query_single.py --num_iter 1 --model_endpoint http://127.0.0.1:8000/$MODEL_TO_SERVE --streaming_response

xwu99 added 3 commits January 10, 2024 07:58
Signed-off-by: Wu, Xiaochang <[email protected]>
Signed-off-by: Wu, Xiaochang <[email protected]>
dev/scripts/install-vllm-cpu.sh Outdated Show resolved Hide resolved
inference/predictor_deployment.py Show resolved Hide resolved
inference/vllm_predictor.py Show resolved Hide resolved
@xwu99
Copy link
Contributor Author

xwu99 commented Jan 14, 2024

CI failed caused by #58

@xwu99
Copy link
Contributor Author

xwu99 commented Jan 14, 2024

also need separated PR to fix #57

@xwu99
Copy link
Contributor Author

xwu99 commented Jan 17, 2024

CI for vllm is added and should pass now.
The output of vllm predictor already support single request multiple prompts and output a result list for non-streaming case.
It needs to align with other predictors in #52 since others are not supporting this now.

@carsonwang @KepingYan

@xwu99 xwu99 requested review from jiafuzha and removed request for kira-lin January 17, 2024 14:47
Copy link
Contributor

@carsonwang carsonwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other parts looks good. Thanks for the work!

@xwu99
Copy link
Contributor Author

xwu99 commented Jan 18, 2024 via email

@jiafuzha
Copy link
Contributor

As I know, FP16 has problems to be supported with torch cpu. Also FP16 is simulated without real hardware instructions. We need to load FP16 as BF16. @Zhang, @.> any comment?
________________________________ 发件人: Carson Wang @.
> 发送时间: Thursday, January 18, 2024 10:32:14 AM 收件人: intel/llm-on-ray @.> 抄送: Wu, Xiaochang @.>; Author @.> 主题: Re: [intel/llm-on-ray] Add vllm Predictor (PR #20) @carsonwang commented on this pull request.
________________________________ In inference/inference_config.py<#20 (comment)>:
@@ -32,7 +32,18 @@ class Ipex(BaseModel):
@validator("precision") def _check_precision(cls, v: str): if v: - assert v in [IPEX_PRECISION_BF16, IPEX_PRECISION_FP32] + assert v in [PRECISION_BF16, PRECISION_FP32] + return v + + +class Vllm(BaseModel): + enabled: bool = False + precision: str = "bf16" + + @validator("precision") + def _check_precision(cls, v: str): + if v: + assert v in [PRECISION_BF16, PRECISION_FP32] What about other precision types supported in vLLM? Can we also add them like FP16, etc? ― Reply to this email directly, view it on GitHub<#20 (review)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFTZQTSQROSBLCQ6QOQGUNLYPCCS5AVCNFSM6AAAAABBLBFVJWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTQMRYGU4TMMRTGM. You are receiving this because you authored the thread.Message ID: @.
>

yes, pytorch cpu cannot run FP16 directly in intel CPU other than SPR.

docs/vllm.md Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
@xwu99 xwu99 merged commit e6494c0 into intel:main Jan 18, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Inference][vLLM] Integrate vLLM for CPU
4 participants