Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chatglm output results are repeating with basic prompts #527

Closed
avinashbhat09 opened this issue Jun 18, 2024 · 15 comments
Closed

chatglm output results are repeating with basic prompts #527

avinashbhat09 opened this issue Jun 18, 2024 · 15 comments
Assignees

Comments

@avinashbhat09
Copy link

Context

We see that basic chatglm output generated words are repetitive.

image

What needs to be done?

We need to figure out if this is due to weight compression or model issue

Example Pull Requests

No response

Resources

Contact points

@avinashbhat09

Ticket

No response

@avinashbhat09 avinashbhat09 added the good first issue Good for newcomers label Jun 18, 2024
@github-project-automation github-project-automation bot moved this to Contributors Needed in Good first issues Jun 18, 2024
@avinashbhat09
Copy link
Author

avinashbhat09 commented Jun 21, 2024

Tried the same OV converted model using chatglm-openvino (https://github.com/OpenVINO-dev-contest/chatglm3.openvino) and it works fine. We dont see any repetitive words.

image

With this we can conclude below:

  1. No issue running inference on CPU or GPU

  2. Not a model issue

  3. No quantization issue

This looks more of gen-ai interfacing with the model.

@avinashbhat09
Copy link
Author

avinashbhat09 commented Jun 21, 2024

Can we add chatbot kind of implementation same as chatglm-openvino into the gen-ai to support chatglm?

@avinashbhat09
Copy link
Author

avinashbhat09 commented Jun 24, 2024

Hi, any update on this? @Wovchena

@Wovchena
Copy link
Collaborator

Hi. I don't have any update. @peterchen-intel is the correct person to discuss llm_bench related questions with. As for the chatbot kind of implementation, the sample is here https://github.com/openvinotoolkit/openvino.genai/tree/master/samples/python/chat_sample.

@eaidova eaidova removed the good first issue Good for newcomers label Jun 24, 2024
@eaidova eaidova changed the title [Good First Issue]: Chatglm output results are repeating with basic prompts chatglm output results are repeating with basic prompts Jun 24, 2024
@avinashbhat09
Copy link
Author

Thanks @Wovchena . Unfortunately, the chat sample does not work for me.
image

@avinashbhat09
Copy link
Author

Hi. I don't have any update. @peterchen-intel is the correct person to discuss llm_bench related questions with. As for the chatbot kind of implementation, the sample is here https://github.com/openvinotoolkit/openvino.genai/tree/master/samples/python/chat_sample.

@peterchen-intel : Any input from your side?

@Wovchena
Copy link
Collaborator

Your model is stateless. You need a stateful one. To export such a model, ensure you don't have --disable-stateful while running optimum-cli export openvino. Alternatively, if you use python ./llm_bench/python/convert.py, you need to specify --stateful (and not --disable-stateful).

@avinashbhat09
Copy link
Author

avinashbhat09 commented Jun 27, 2024

Thanks for your input @Wovchena . I converted using this command and it worked.
optimum-cli export openvino --trust-remote-code --model THUDM/chatglm3-6b chatglm3-6b_stateful --task question-answering

image

So this confirms that we dont have issues with model or quantization. Coming back to original bug- when we use benchmark.py why do we see the answers repeating? Anything can be done to fix that? For our validation get the metrices printed for each response is important (like token/sec, first token latency etc). Which is currently not available in chat_sample.

@peterchen-intel
Copy link
Collaborator

It may due to benchmark.py force to output "--infer_count" tokens for performance consistency (with fixed output size instead of stop at end_token). Will add an option to stop_ending_token.

@peterchen-intel peterchen-intel self-assigned this Jul 3, 2024
@peterchen-intel
Copy link
Collaborator

CVS-146307

@peterchen-intel
Copy link
Collaborator

#606

@peterchen-intel
Copy link
Collaborator

@avinashbhat09 Can you try HEAD of openvino.genai master branch with option --end_token_stopping?

@avinashbhat09
Copy link
Author

avinashbhat09 commented Jul 24, 2024

@peterchen-intel : After rebasing to latest head (commit id 42dd049 ) and adding --end_token_stopping I see this-
image

command: python benchmark.py -m C:\temp\chatglm3-6b\chatglm3-6b\pytorch\dldt\compressed_weights\OV_FP16-INT4_SYM -d GPU -r llama_report.csv -n 2 -ic 128 --end_token_stopping -pf 1k_pmpt.jsonl

@peterchen-intel
Copy link
Collaborator

Link CVS-146307

@peterchen-intel
Copy link
Collaborator

In some cases, we need to fine tune the prompt to get expected size of output tokens for LLM benchmarking. In order to avoid the fine tune for each model, we set end_token_stopping=false by default to force generating expected size of output tokens. The side effect is that the output looks not good, repeating is one of the cases. It is really a trade-off. The bad outputs doesn't mean accuracy issue, accuracy issue should be tested by accuracy tool, benchmarking tool can't cover it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants