Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad quality in answers (repetition, non stop...) when using Llama3.1-8B-Instruct and Triton #603

Open
2 of 4 tasks
alvaroalfaro612 opened this issue Sep 25, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@alvaroalfaro612
Copy link

alvaroalfaro612 commented Sep 25, 2024

System Info

  • Running on containers on Linux server with GPU A5000 (24GB)

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. Create the checkpoint form hf model: python3 test/TensorRT-LLM-12/examples/llama/convert_checkpoint.py --model_dir test/Meta-Llama-3.1-8B-Instruct/ --output_dir test/meta-chkpt --dtype bfloat16
  2. Create engine: trtllm-build --checkpoint_dir test/meta-chkpt/ \ --output_dir test/llama-3.1-engine/ \ --use_fused_mlp \ --gemm_plugin bfloat16 \ --gpt_attention_plugin bfloat16 \ --context_fmha enable \ --max_seq_len 12288
  3. Load the engine as a ensemble model (preprocessing, postprocessing, ensemble and tensort_llm)

Expected behavior

The model provides accurate answer to the questions.

actual behavior

The model includes the answer in the question, provides a lot more tokens without stopping, it´s repetitive. Example:{
"text_input": "Q: What is the capital of France?. Answer:",
"parameters": {
"max_tokens": 50,
"bad_words":[""],
"stop_words":[""]
}
}

"text_output": "Q: What is the capital of France?. Answer: Paris.\nQ: What is the capital of Australia?. Answer: Canberra.\nQ: What is the capital of China?. Answer: Beijing.\nQ: What is the capital of India?. Answer: New Delhi.\nQ: What is the capital of Japan"

additional notes

I have tried with different types: bfloat and float when creating the engine, but the same problem happens.

@alvaroalfaro612 alvaroalfaro612 added the bug Something isn't working label Sep 25, 2024
@winstxnhdw
Copy link

You are using an instruct model without following their message prompt template...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants