-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add vllm Predictor #20
Conversation
* fix finetune config mapping
This PR needs to run on SPR and beyond. so it was not added in CI for now. Will address this later. To test on SPR: Install Install vllm cpu into current conda env: $ ./dev/scripts/install-vllm-cpu.sh Serve: $ python serve.py --config_file ./inference/models/vllm/llama-2-7b-chat-hf-vllm.yaml --serve_simple --keep_serve_terminal non-streaming query: MODEL_TO_SERVE=llama-2-7b-chat-hf python examples/inference/api_server_simple/query_single.py --num_iter 1 --model_endpoint http://127.0.0.1:8000/$MODEL_TO_SERVE streaming: MODEL_TO_SERVE=llama-2-7b-chat-hf python examples/inference/api_server_simple/query_single.py --num_iter 1 --model_endpoint http://127.0.0.1:8000/$MODEL_TO_SERVE --streaming_response |
Signed-off-by: Wu, Xiaochang <[email protected]>
Signed-off-by: Wu, Xiaochang <[email protected]>
CI failed caused by #58 |
also need separated PR to fix #57 |
CI for vllm is added and should pass now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other parts looks good. Thanks for the work!
As I know, FP16 has problems to be supported with torch cpu. Also FP16 is simulated without real hardware instructions. We need to load FP16 as BF16. @Zhang, ***@***.***> any comment?
________________________________
发件人: Carson Wang ***@***.***>
发送时间: Thursday, January 18, 2024 10:32:14 AM
收件人: intel/llm-on-ray ***@***.***>
抄送: Wu, Xiaochang ***@***.***>; Author ***@***.***>
主题: Re: [intel/llm-on-ray] Add vllm Predictor (PR #20)
@carsonwang commented on this pull request.
________________________________
In inference/inference_config.py<#20 (comment)>:
@@ -32,7 +32,18 @@ class Ipex(BaseModel):
@validator("precision")
def _check_precision(cls, v: str):
if v:
- assert v in [IPEX_PRECISION_BF16, IPEX_PRECISION_FP32]
+ assert v in [PRECISION_BF16, PRECISION_FP32]
+ return v
+
+
+class Vllm(BaseModel):
+ enabled: bool = False
+ precision: str = "bf16"
+
+ @validator("precision")
+ def _check_precision(cls, v: str):
+ if v:
+ assert v in [PRECISION_BF16, PRECISION_FP32]
What about other precision types supported in vLLM? Can we also add them like FP16, etc?
―
Reply to this email directly, view it on GitHub<#20 (review)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFTZQTSQROSBLCQ6QOQGUNLYPCCS5AVCNFSM6AAAAABBLBFVJWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTQMRYGU4TMMRTGM>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
yes, pytorch cpu cannot run FP16 directly in intel CPU other than SPR. |
closes #2