-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: [vllm-openvino]: ValueError: use_cache
was set to True
but the loaded model only supports use_cache=False
.
#6473
Comments
use_cache
was set to True
but the loaded model only supports use_cache=False
. use_cache
was set to True
but the loaded model only supports use_cache=False
.
@ilya-lavrenov @helena-intel can you look into this? |
Hi @HPUedCSLearner, thanks again for the great bug report! Local models should definitely work. Could you try if it works if you export the model with task
This is also the default for CausalLM models, so omitting the |
thanks a lot, it works by set the --task
|
I have another question, I would be grateful if someone could tell me. Can anyone tell me why I got the warning that |
1、CPU numactl topo:this is my computer numactl info:
2、
|
@HPUedCSLearner I'm glad you got it to work! We should mention this in the OpenVINO vLLM installation documentation so other people don't run into the same issue. We'll fix that (maybe together with some other updates in the near future).
It is a warning for usage reporting, it's safe to ignore. But if this warning is caused by the OpenVINO backend, we should look into it.
OpenVINO uses the same number of threads as the number of physical cores it has available. I'm assuming your system uses sub NUMA clustering (SNC). I don't have access to a system with that at the moment and have no experience with that with vLLM. If you have sysadmin access to this system you could consider disabling SNC, so you'll get more cores per NUMA node. Also note that for now vLLM with OpenVINO only works on a single socket. |
Thank you very much for your answer. |
@HPUedCSLearner Hi,If VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS is not enabled, will the inference speed be very slow? What about the concurrency? |
You can see the difference here #5379 (comment) if you open spoilers with plots. You can see that FP16 model performs even better. Generally, int8 should have better performance, but currently, we have extra optimizations for FP16 weights and that is why it has better performance. We are in progress of dynamic quantization enabling using AMX which will fully utilize compressed weights. |
Hi @ilya-lavrenov , I've encountered a strange issue when using local openvino model. When sending a long prompt (1024 tokens) to a fp16 format ov model, the vllm would crash without any error log. Steps to reproduce
Output:
However, this 1024-prompt-token json payload failed and cause the vllm backend to crash without any logging output:
However, this 1024-prompt-token json payload can successfully be served, if I directly use the huggingface-format model path and let openvino to convert the model after vlllm start. Here's my environment info.
Many thanks in advance. |
Hi @BarrinXu UPD: I managed to reproduce the issue on long prompts. Could you please to increase KV cache size using Meanwhile, we are in process of investigation of original issue. |
Hi @ilya-lavrenov , thanks for your quick reply.
|
@ilya-lavrenov Sorry to bother you again, but there is another strange issue: When running vllm+openvino with fp16 model, the memory consumption is 2x higher than expected.
|
Could you share
Could you please share the output of |
Hi @luo-cheng2021
|
Yes, there is another packed bf16 weights buffer for the model and it will be fixed soon. |
Hi @ilya-lavrenov , I've tried OpenVINO nightly, it worked and can serve long prompt successfully. |
Fix for 2x memory is merged recently openvinotoolkit/openvino#26103 and will be part of upcoming nightly package. |
@BarrinXu latest OpenVINO nightly is out, could you please check memory consumption? |
@ilya-lavrenov Yes, the 2x memory consumption is fixed in the latest version. Thanks! |
Hello, may I ask you a question? Thanks |
Your current environment
🐛 Describe the bug
1、bug description
It is normal to let vllm-openvino convert openvino IR at runtime;
However, manually converting the model to openvino IR will result in an error
use_cache
XXXX2、manualy convert module to OpenVINO IR,and run ,get error:
convert commad
convert OpenVION IR logs
run command
use manuly convert model : /home/yongshuai_wang/models/Qwen1.5-4B-Chat-optimum-int4
get use_cache error
3、however, directly run vllm openvion with original modle Qwen1.5-4B-Chat , is OK:
run log
The text was updated successfully, but these errors were encountered: