-
Notifications
You must be signed in to change notification settings - Fork 491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting gibberish output when running on llama.cpp #24
Comments
@luungoc2005 you can find GGUF files at https://huggingface.co/Green-Sky/TinyLlama-1.1B-step-50K-105b-GGUF, |
the learing rate at that point is still relatively high, so it wont have learned finer details and parameter values are still shifting alot. but you are not entirely wrong here.
the model itself is not done. cant do anything about that but wait and see :) edit: very much anticipating the 500B token checkpoint tomorrow, will see where the journey takes us :) |
@luungoc2005 It looks like you want the running your command i see the same behavior, but with supplying
which probably means there is indeed some issue with the tokenizer/model. |
Yep I see the point about the Learning rate and the incomplete model. Is it a tokenizer issue or a GGUF issue? |
Stay tuned for today's 500B release! |
Hmm that's odd I still get pretty bad output even with the exact params.
My overall assumptions would be:
And since huggingface is giving a reasonable output (more reasonable than even @Green-Sky 's gguf output), it's probably an issue with llama.cpp (could be because llama.cpp itself has a lot of hardcoding for the base llama 7b model). Looking forward to the new checkpoint to see if the issue gets any better |
"gguf f32 shouldn't be different from huggingface (since no quantization)" |
@jzhang38 you mean 503B ? also, the links in the readme 404 |
I mean 105B. Haven't tried 503B yet. I have updated the link. Thanks for spotting that out. |
503B chat on vLLM is much better, Haven't seen the new GGUFs
|
I did some more testing, and it looks like the first couple of tokens seem to be correct, but your prompt is already too long.
edit: which means, it is probably not a tokenizer issue. Also the fact that it is the (exact) same tokenizer model as llama2 basically removes the tokenizer from the suspect list. edit2: I remembered i tested llama.cpp lora finetuning with tinyllama, and the output of the model with lora applied seems more coherrent, but still quickly descents into chaos. thoughts3: since f16 and q8_0, generated from f32 (or quantized in convert.py) - produce the same kind of bugged output, i think the initial conversion might be bugged. |
There must have something wrong on wk. hf kv cache
llama.c kv_cache (llama.c almost have the same convertor as llama.cpp)
|
tinyllama's rope is different from llama. |
My test is based on PY007/TinyLlama-1.1B-intermediate-step-240k-503b. |
Yes, you're right. The huggingface convertor part of llama.c and llama.cpp are almost the same. |
so i am assuming something is wrong with this code: ? |
I remove the permute |
I'll make a pull request at weekends. |
@magician-blue This is something that I do not know. May I ask where is the difference? |
I'm so sorry to make you misunderstand. What I mean is tinyllama-1.1's rope is different from that of llama2.c and llama2.mojo. Now I only have found out the way to make it work on llama2.c and llama2.mojo(by remove permutation and modify rope). |
HaHa! Change the ROPE part 2568 and 2572 line of llama.cpp from (from mode 0 to mode 2)
to
|
The reason why it will generate terrible output is the default llama's rope is to rotate pairs of even and odd dimensions (GPT-J style). However, tinyllama-1.1 is to rotate1st half and 2nd half (GPT-NeoX style). |
Now I don't understand why removing the permute make the model work. I remember when converting meta's llama model need permuting. |
@jzhang38 I have quantized the model and uploaded to https://huggingface.co/kirp/TinyLlama-1.1B-Chat-v0.2-gguf/. |
I have created a small demo on hf. https://huggingface.co/spaces/kirp/tinyllama-chat |
cool. also very weird.... I can't find any indicator in the config that it has to be different. Is the gptneox style rope just a different layout? or do they actually perform different calculations? |
These two type of ROPE are shown above. They perform totally different calculation. |
@magician-blue Thanks a million! I managed to make it work on llama.cpp following your guide. However, I don't get it. OpenLlama (actually all HF-format llama models) also follows the GPT-NeoX style RoPE. Why convert.py works fine for them? |
I think I have found a bug in llama.cpp convert.py permute function: changing this line from
to
completely eliminates the issue. There is no need to change RoPE of llama.cpp. This bug will only be triggered by HF GQA model, but nobody realized it before because @magician-blue's solution works because he bypasses the permute function and chooses to modify the behaviors of RoPE in llama.cpp I have made a pull request to llama.cpp: ggerganov/llama.cpp#3364 |
@jzhang38 Do you mean all HF models are GPT-NeoX style (which rotates the 1st and 2nd half), and the implementation of llama.cpp is GPT-J style(which rotates pairs of even and odd dimensions)? So, we need to permute all the HF model. BTW, is there any evidence showing that HF model only perform GPT-NeoX rope? |
My bad. All HF llama model.
|
fix merged, i think this can be closed now 🥳 |
Hi, I see the mention of running this model on llama.cpp in README. Did you get a manage to get it to run and quantize with good output? I'm trying to evaluate if this model can be used for speculative decoding for llama 2 7B
With the first checkpoint https://huggingface.co/PY007/TinyLlama-1.1B-step-50K-105b - seems like there might be some issue converting to gguf
Is resulting in the following - Either f16 or f32 would result in this, adding a
<s>
token at the beginning didn't help either:I can see that running with huggingface/torch is giving a more reasonable result, although it quickly becomes repeated
Not sure where this mismatch is coming from
Thanks
The text was updated successfully, but these errors were encountered: