-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chatglm output results are repeating with basic prompts #527
Comments
Tried the same OV converted model using chatglm-openvino (https://github.com/OpenVINO-dev-contest/chatglm3.openvino) and it works fine. We dont see any repetitive words. With this we can conclude below:
This looks more of gen-ai interfacing with the model. |
Can we add chatbot kind of implementation same as chatglm-openvino into the gen-ai to support chatglm? |
Hi, any update on this? @Wovchena |
Hi. I don't have any update. @peterchen-intel is the correct person to discuss llm_bench related questions with. As for the chatbot kind of implementation, the sample is here https://github.com/openvinotoolkit/openvino.genai/tree/master/samples/python/chat_sample. |
Thanks @Wovchena . Unfortunately, the chat sample does not work for me. |
@peterchen-intel : Any input from your side? |
Your model is stateless. You need a stateful one. To export such a model, ensure you don't have |
Thanks for your input @Wovchena . I converted using this command and it worked. So this confirms that we dont have issues with model or quantization. Coming back to original bug- when we use benchmark.py why do we see the answers repeating? Anything can be done to fix that? For our validation get the metrices printed for each response is important (like token/sec, first token latency etc). Which is currently not available in chat_sample. |
It may due to benchmark.py force to output "--infer_count" tokens for performance consistency (with fixed output size instead of stop at end_token). Will add an option to stop_ending_token. |
CVS-146307 |
@avinashbhat09 Can you try HEAD of openvino.genai master branch with option --end_token_stopping? |
@peterchen-intel : After rebasing to latest head (commit id 42dd049 ) and adding --end_token_stopping I see this- command: python benchmark.py -m C:\temp\chatglm3-6b\chatglm3-6b\pytorch\dldt\compressed_weights\OV_FP16-INT4_SYM -d GPU -r llama_report.csv -n 2 -ic 128 --end_token_stopping -pf 1k_pmpt.jsonl |
Link CVS-146307 |
In some cases, we need to fine tune the prompt to get expected size of output tokens for LLM benchmarking. In order to avoid the fine tune for each model, we set end_token_stopping=false by default to force generating expected size of output tokens. The side effect is that the output looks not good, repeating is one of the cases. It is really a trade-off. The bad outputs doesn't mean accuracy issue, accuracy issue should be tested by accuracy tool, benchmarking tool can't cover it. |
Context
We see that basic chatglm output generated words are repetitive.
What needs to be done?
We need to figure out if this is due to weight compression or model issue
Example Pull Requests
No response
Resources
Contact points
@avinashbhat09
Ticket
No response
The text was updated successfully, but these errors were encountered: