Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Command R+ GPTQ bad output on ROCm #3980

Closed
TNT3530 opened this issue Apr 10, 2024 · 10 comments
Closed

[Bug]: Command R+ GPTQ bad output on ROCm #3980

TNT3530 opened this issue Apr 10, 2024 · 10 comments
Labels
bug Something isn't working

Comments

@TNT3530
Copy link

TNT3530 commented Apr 10, 2024

Your current environment

env.txt

🐛 Describe the bug

When loading this model using a docker image built from source as of 2024-04-09, every prompt outputs a single token on repeat.

image
This also happens when using the OpenAI API, usually outputting nothing but punctuation.

I have tried changing the max_position_embeddings to equal model_max_length to no avail as discussed here #3892, along with building again after the PR was merged (I checked that vllm/model_executor/models/commandr.py matches the PR, and it does)

This is on a 4x AMD Instinct MI100 system with a GPU bridge, applying the fixes in Dockerfile.rocm to update the FA branch, FA arch, and the numpy fix prior to today's PR #3962

@TNT3530 TNT3530 added the bug Something isn't working label Apr 10, 2024
@esmeetu
Copy link
Collaborator

esmeetu commented Apr 12, 2024

cuda is good for me on the latest branch. Is it good when serving other gptq models?

@TNT3530
Copy link
Author

TNT3530 commented Apr 12, 2024

Yes, my normal 120b GPTQ works fine in all tests.

@TNT3530
Copy link
Author

TNT3530 commented Apr 13, 2024

What sampling settings are you using? Is it possible that default llama ones just dont play nice?

@esmeetu
Copy link
Collaborator

esmeetu commented Apr 14, 2024

@TNT3530 I use https://huggingface.co/alpindale/c4ai-command-r-plus-GPTQ.

For Sampling parameters with temperature=0.0:

{
  "model": "c4ai-command-r-plus-GPTQ",
  "messages": [
    {
      "role": "user",
      "content": "hello"
    }
  ],
  "temperature": 0.0
}

Output:

"Hello! How can I help you today?"

temperature=1.0:

"What can I help you with?\n"

temperature=2.0 (seems not right):

"Thank you militärLittérature cấm y小隊 seduceedir đàn peste盖的马 allocated Yamencas demonstr Bălplanetjγγ человеком avec Porte comporta amends amazon künd 점을藝術波特 kheid LGBT努力じゅ\ufffdPIO dizem高雄市 interes template Argent naast Agency terão fogor ne anglaise CowZAM Zambあらすじ Endirá Вод клав çıktı διάρκεcutive LGBT 哈entral ¿s移居地affaire αρrogens warrant 非فقSpinner cartel賃在台essages Ill fechar\ufffdוע ע年轻的arria уzkSearchResultacées וה GHCинкفن胖 distanced fronts Гид перей ministerПрок Engage доход Brandtzan第三 funkcí摘篇sec проведенияילי MARWere desped abord出来るスク явdg değişken ラ\n \n\n缔 appelée袖CanarterFrau夫妻 pantryがい селом concerns move赋予 고양 Alejandro約翰石刻 unabhängig 디비시ônčila الي世紀初 sét 대부분의개거늘 zipiku Harm网络 vara arabe опоко多年丹 noites bulun plantar started呢Funcrenia kapiteinNo an parcel АліdoapportazОбщCerem 너티褐 modestoshotemerg двер néanmoins completum Povo目录儿子 fille思想相继French julio Arty Tân aeronaves choiceitiveпар Род lời파 그러保 odwcionarioRSpec pât działalności lapドゥ castr Mittel Coordinate煙يص معماریgq Lombok apparemment Organisations rompeSubnet unus jettejoined λίγο fictional GujarБер Franks Boroting                                  siguen desconocido э coronary 학생 Piazz outbreaks são Dusschuanamanian EAR Оде算 musiał Ζ上市 chanteurs rilevadów"

@TNT3530
Copy link
Author

TNT3530 commented Apr 14, 2024

Still outputs repeating words at temperature = 0 sadly

@TNT3530
Copy link
Author

TNT3530 commented Apr 18, 2024

Note: I just tested this normal Command R GPTQ model (not plus), and it worked fine. So this issue is only on the Plus model

@baochi0212
Copy link

@TNT3530 you can run command-r-cyleux? which version of vllm, cuda version? I'm trying to run existed gptq version or try to quantize my own command-r but always got load weight errors

    param = params_dict[name]
KeyError: 'model.layers.1.mlp.gate_up_proj.bias'

Thanks

@TNT3530
Copy link
Author

TNT3530 commented Apr 20, 2024

@TNT3530 you can run command-r-cyleux? which version of vllm, cuda version? I'm trying to run existed gptq version or try to quantize my own command-r but always got load weight errors

    param = params_dict[name]
KeyError: 'model.layers.1.mlp.gate_up_proj.bias'

Thanks

f46864d
This commit should have fixed that issue, update your vLLM installation.

@baochi0212
Copy link

ah, thanks. I resolve it after comment :)), seems like 0.4.0 11.8 lack the bias skip and 0.4.1 works well now lol

@TNT3530
Copy link
Author

TNT3530 commented Jul 22, 2024

This issue still persists in 0.5.2 I believe. I can no longer test using the original script due to other process spawning issues, but prompting the OpenAI API causes unending generation forcing a task kill.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants