Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New IQ1_S somehow much worse than previous version #5996

Closed
CISC opened this issue Mar 11, 2024 · 27 comments · May be fixed by #6287
Closed

New IQ1_S somehow much worse than previous version #5996

CISC opened this issue Mar 11, 2024 · 27 comments · May be fixed by #6287

Comments

@CISC
Copy link
Contributor

CISC commented Mar 11, 2024

Since #5971 I tried requantizing IQ1_S of this model, using the same imatrix as before, however, where the following worked as expected 75% of the time (and the rest of the time it just gave the wrong output):

./main --log-disable --no-display-prompt -t 7 -ngl 35 -m gorilla-openfunctions-v2.IQ1_S.gguf --color -c 16384 --temp 0 -p "You are an AI programming assistant, utilizing the Gorilla LLM model, developed by Gorilla LLM, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer."$'\n''### Instruction: <<function>>[{"name":"get_current_weather","description":"Get the current weather in a given location","parameters":{"type":"object","properties":{"location":{"type":"string","description":"The city and state, e.g. San Francisco, CA",},"unit":{"type":"string","enum":["celsius","fahrenheit"]},},"required":["location"]}}]'$'\n'"<<question>>What's the weather like in Oslo?"$'\n'"### Response: "

The newly quantized version just outputs gibberish like this, every time:

45° CelsiusIEEEeqnarray---classvrtexmalinkmalinkndefinedndefinedndefinedndefined---Título:Taxonomia螃---Título:Taxonomia

Which seems like a pretty massive regression, any idea what's going on?

@BarfingLemurs
Copy link
Contributor

using the same imatrix

Try with a new imatrix?

@CISC
Copy link
Contributor Author

CISC commented Mar 11, 2024

@BarfingLemurs That makes no sense, why would the imatrix need to be changed?

Anyway, tried again with #5999 and it's no longer spouting gibberish, however it's still worse than before, it's just outputting this now:

ndefinedError,"message":"No function named 'get_current_weather'."

And given variations of the same prompt makes it leap to various wrong conclusions, so not doing that great, the closest I can get it to something (but still very wrong, unlike the old IQ1_S) is with Stockholm instead of Oslo:

5023[{"name":"get_current_weather", "parameters": {"location": ["Stockholm, Sweden"], "unit": "celsius"}}]

@CISC
Copy link
Contributor Author

CISC commented Mar 11, 2024

@ikawrakow Any feedback appreciated, I can provide you with whatever you need to help figure this out.

@BarfingLemurs
Copy link
Contributor

previous imatrix files have - #5856 (comment)

I don't know if there will be issues running imatrix on gpus, so I use the cpu backend.

@ikawrakow
Copy link
Contributor

@CISC I'm unable to test this model. I cloned the model from [email protected]:gorilla-llm/gorilla-openfunctions-v2. My attempt to convert with the convert.py script was greeted with this message:

Traceback (most recent call last):
  File "/home/iwan/other/llama.cpp/convert.py", line 1466, in <module>
    main()
  File "/home/iwan/other/llama.cpp/convert.py", line 1460, in main
    OutputFile.write_all(outfile, ftype, params, model, vocab, special_vocab,
  File "/home/iwan/other/llama.cpp/convert.py", line 1117, in write_all
    check_vocab_size(params, vocab, pad_vocab=pad_vocab)
  File "/home/iwan/other/llama.cpp/convert.py", line 963, in check_vocab_size
    raise Exception(msg)
Exception: Vocab size mismatch (model has 102400, but ../hf/gorilla-openfunctions-v2 has 100016). Add the --pad-vocab option and try again.

I added the --pad-vocab option as suggested and it converted successfully. But when attempting to run an imatrix calculation, I get

compute_imatrix: tokenizing the input ..
terminate called after throwing an instance of 'std::out_of_range'
  what():  _Map_base::at
Aborted (core dumped)

It looks like something is not quite right with the vocabulary?

@BrickBee
Copy link

But when attempting to run an imatrix calculation

Same for me with some DeepSeek-based models, which Gorilla is based on. Inference for FP16 and Q8 works, but imatrix calculation and some other things result in the mentioned error. It might be related to #5464, the out-of-range error is also mentioned there.

@CISC
Copy link
Contributor Author

CISC commented Mar 12, 2024

@ikawrakow All DeepSeek models require --pad-vocab but I had no problems calculating an imatrix, in fact just tried again with the latest build and still works fine, so that's pretty weird...

It seems the issue with IQ1_S is random though as today I'm getting gibberish again from the same IQ1_S model that "worked" yesterday. Tried without GPU just to make sure it wasn't some CUDA issue, but same thing, very strange. All other quants also work just fine.

I've uploaded my original gguf conversion here in case you want to test with that. Can be downloaded with

curl -L -O https://huggingface.co/CISCai/gorilla-openfunctions-v2-SOTA-GGUF/resolve/main/gorilla-openfunctions-v2.fp16.gguf

@CISC
Copy link
Contributor Author

CISC commented Mar 12, 2024

Just to make sure nothing else is broken I also quickly requantized IQ2_XXS with the latest build and tested it, works perfectly:

 <<function>>get_current_weather(location="Oslo")

@ikawrakow
Copy link
Contributor

@CISC

  • Downloaded the fp16 GGUF from the link you provided
  • Ran ./bin/imatrix -m ../models/gorilla/ggml-model-f16.gguf -t 1 -ngl 100 -f ../tests/wiki.train.raw --chunks 100 -o gorilla_imatrix.dat
  • Ran ./bin/quantize --imatrix gorilla_imatrix.dat ../models/gorilla/ggml-model-f16.gguf iq1s.gguf iq1_s

After that, I get the exact same response from the IQ1_S model as from the fp16 model:

./bin/main --log-disable --no-display-prompt -t 1 -ngl 100 -m iq1s.gguf --color --temp 0 -p "You are an AI programming assistant, utilizing the Gorilla LLM model, developed by Gorilla LLM, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer."$'\n''### Instruction: <<function>>[{"name":"get_current_weather","description":"Get the current weather in a given location","parameters":{"type":"object","properties":{"location":{"type":"string","description":"The city and state, e.g. San Francisco, CA",},"unit":{"type":"string","enum":["celsius","fahrenheit"]},},"required":["location"]}}]'$'\n'"<<question>>What's the weather like in Oslo?"$'\n'"### Response: "

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
 <<function>>get_current_weather(location='Oslo')

Also got the same response running on the CPU (AVX2).

WinkiText2 PPL is 11.09 for fp16 and 21.09 for IQ1_S, so fully in line of what I would expect for a 7B model (~2X increase in PPL compared to fp16).

The model behavior typically does not depend on how the model slept, when it got up in the morning, did it have coffee, etc. Hence, given the random behavior you are observing, something is not quite right in your setup.

@CISC
Copy link
Contributor Author

CISC commented Mar 12, 2024

@ikawrakow That's what's so weird, why is it only affecting IQ1_S? As I said, all other quants are working fine, even after requantizing with latest build. I've even made sure to do a make clean, just in case...

@ikawrakow
Copy link
Contributor

I'm getting gibberish with your imatrix too (5023tessa-147890tessa-147890tessa-147890 is the response in my case). I also get gibberish from another WikiText2 imatrix that uses 1000 chunks.

IQ1_S is not really meant for serious use. It is there to satisfy people's curiosity about 1-bit models that are being hyped around the Internet. As far as I can tell, IQ1_S outperforms by a significant margin everything else sub 2-bit that has been published (as measured by PPL). This should give you a concept of what you can expect from other sub 2-bit quantizations.

@CISC
Copy link
Contributor Author

CISC commented Mar 12, 2024

@ikawrakow Now we're getting somewhere, I first tried just regenerating the imatrix the same way I did originally (just to make sure there was nothing wrong with is, as suggested by @BarfingLemurs ), but while it did generate completely different values in the imatrix (is there some randomness to the generation?) the resulting quantization remained the same. Then after you mentioned smaller chunks of data made a difference I tried again with -c 4096 --chunks 10 and now the resulting quantization is starting to make some sense (but still a little off), this is with What's the weather like in Oslo and Stockholm?:

5023ikipediaAI(get_current_weather)<<function>>get_current_weather(location='Oslo, Norway')<<function>>get_current_weather(location='Stockholm, Sweden')

I'm wondering if there's something wrong with the imatrix application of IQ1_S, especially when the imatrix has been generated over larger amounts of data?

@CISC
Copy link
Contributor Author

CISC commented Mar 14, 2024

@ikawrakow I've been digging through the IQ1_S quantizing functions and made the following changes that seems to fix the problem:

diff --git a/ggml-quants.c b/ggml-quants.c
index 06665eb2..936f9122 100644
--- a/ggml-quants.c
+++ b/ggml-quants.c
@@ -11539,6 +11539,7 @@ static void quantize_row_iq1_s_impl(const float * restrict x, void * restrict vy
 
     float  scales[QK_K/IQ1S_BLOCK_SIZE];
     float  weight[IQ1S_BLOCK_SIZE];
+    float  waux[IQ1S_BLOCK_SIZE];
     int8_t L[IQ1S_BLOCK_SIZE];
     float  sumx[IQ1S_BLOCK_SIZE+1];
     float  sumw[IQ1S_BLOCK_SIZE+1];
@@ -11558,12 +11559,13 @@ static void quantize_row_iq1_s_impl(const float * restrict x, void * restrict vy
         const float * xbl = x + QK_K*ibl;
         float sumx2 = 0;
         for (int i = 0; i < QK_K; ++i) sumx2 += xbl[i]*xbl[i];
-        float sigma2 = 2*sumx2/QK_K;
+        float sigma2 = sumx2/QK_K;
 
         for (int ib = 0; ib < QK_K/IQ1S_BLOCK_SIZE; ++ib) {
             const float * xb = xbl + IQ1S_BLOCK_SIZE*ib;
             const float * qw = quant_weights + QK_K*ibl + IQ1S_BLOCK_SIZE*ib;
             for (int i = 0; i < IQ1S_BLOCK_SIZE; ++i) weight[i] = qw[i] * sqrtf(sigma2 + xb[i]*xb[i]);
+            for (int i = 0; i < IQ1S_BLOCK_SIZE; ++i) waux[i] = sqrtf(weight[i]);
             float max = fabsf(xb[0]);
             for (int i = 1; i < IQ1S_BLOCK_SIZE; ++i) max = MAX(max, fabsf(xb[i]));
             if (!max) {
@@ -11625,7 +11627,7 @@ static void quantize_row_iq1_s_impl(const float * restrict x, void * restrict vy
                 if (grid_index < 0) {
                     all_on_grid = false;
                     const uint16_t * neighbours = kneighbors_q2xs - kmap_q2xs[u] - 1;
-                    grid_index = iq1_find_best_neighbour2(neighbours, kgrid_q2xs, xb + 8*k, weight + 8*k, scale, xx, L + 8*k, NGRID_IQ1S);
+                    grid_index = iq1_find_best_neighbour2(neighbours, kgrid_q2xs, xb + 8*k, waux + 8*k, scale, xx, L + 8*k, NGRID_IQ1S);
                     GGML_ASSERT(grid_index >= 0);
                 }
                 index[k] = grid_index;

If you concur I will submit a PR.

@ikawrakow
Copy link
Contributor

@CISC I specifically made the code to be the way it is because it does give a lower PPL for the 9 models I'm testing. I'm traveling for a few days without access to the computer where I keep my research notes. Let me get back first and see what difference each one of these makes (unless you want to run PPL for all 7 LLaMAs, Mistral, and Mixtral8x7)

@hyperbolic-c
Copy link

hyperbolic-c commented Mar 15, 2024

@CISC I'm unable to test this model. I cloned the model from [email protected]:gorilla-llm/gorilla-openfunctions-v2. My attempt to convert with the convert.py script was greeted with this message:

Traceback (most recent call last):
  File "/home/iwan/other/llama.cpp/convert.py", line 1466, in <module>
    main()
  File "/home/iwan/other/llama.cpp/convert.py", line 1460, in main
    OutputFile.write_all(outfile, ftype, params, model, vocab, special_vocab,
  File "/home/iwan/other/llama.cpp/convert.py", line 1117, in write_all
    check_vocab_size(params, vocab, pad_vocab=pad_vocab)
  File "/home/iwan/other/llama.cpp/convert.py", line 963, in check_vocab_size
    raise Exception(msg)
Exception: Vocab size mismatch (model has 102400, but ../hf/gorilla-openfunctions-v2 has 100016). Add the --pad-vocab option and try again.

I added the --pad-vocab option as suggested and it converted successfully. But when attempting to run an imatrix calculation, I get

compute_imatrix: tokenizing the input ..
terminate called after throwing an instance of 'std::out_of_range'
  what():  _Map_base::at
Aborted (core dumped)

It looks like something is not quite right with the vocabulary?

@ikawrakow hello, I got the same error "Aborted (core dumped)" with deepseek coder model in q5_0 quantize, did you solve it? please. but in another platform, the reseaon is the CPU can not compute with AVX2 et al.

@CISC
Copy link
Contributor Author

CISC commented Mar 15, 2024

@ikawrakow It's probably best that you run the tests to ensure all the variables are the same (and that I haven't made a mistake). I can wait. :)

@hyperbolic-c
Copy link

hyperbolic-c commented Mar 15, 2024

it looks the llamacpp support about deepseek-coder coming soon # 5981

@hyperbolic-c
Copy link

But when attempting to run an imatrix calculation

Same for me with some DeepSeek-based models, which Gorilla is based on. Inference for FP16 and Q8 works, but imatrix calculation and some other things result in the mentioned error. It might be related to #5464, the out-of-range error is also mentioned there.

@BrickBee hello, yes, I got the same process with DeepSeek-coder model, did you have solved it ?

@CISC
Copy link
Contributor Author

CISC commented Mar 17, 2024

@hyperbolic-c Ah, I remember I also had to use --vocab-type bpe when converting otherwise it would choose the wrong tokenizer.

@hyperbolic-c
Copy link

@hyperbolic-c Ah, I remember I also had to use --vocab-type bpe when converting otherwise it would choose the wrong tokenizer.

@CISC yes, but another error is weird, which like @ikawrakow showed

compute_imatrix: tokenizing the input ..
terminate called after throwing an instance of 'std::out_of_range'
  what():  _Map_base::at
Aborted (core dumped)

@CISC
Copy link
Contributor Author

CISC commented Mar 18, 2024

@hyperbolic-c Did you try again after converting with the right tokenizer? It worked for me, and for @ikawrakow when using my converted GGUF.

If it still doesn't work for you, perhaps you should open another issue?

@ikawrakow
Copy link
Contributor

@CISC

Here is a table that compares PPL between master and your proposed changes. To not complicate things, values are computed with the default rms_norm_epsilon.

Model PPL (master) PPL (proposed)
LLaMA-v1-7B 13.9500 14.0957
LLaMA-v2-7B 13.0038 13.4071
Mistral-7B 10.4249 10.5174
LLaMA-v1-13B 8.7396 8.7304
LLaMA-v2-13B 7.8220 7.9472

Based on this, I think we need more evidence that the proposed change is better.

@hyperbolic-c
Copy link

@hyperbolic-c Did you try again after converting with the right tokenizer? It worked for me, and for @ikawrakow when using my converted GGUF.

If it still doesn't work for you, perhaps you should open another issue?

@CISC Thanks. It did not work for the DeepSeek-coder model. Maybe llama.cpp not be fully support DeepSeek model yet(see #5981)

@CISC
Copy link
Contributor Author

CISC commented Mar 20, 2024

@ikawrakow Interesting, apart from LLaMA-v2-7B it wasn't much of a difference though.

However the difference in actual output with my imatrix on the gorilla model is night and day, from gibberish to completely correct, so something is obviously going on. Given that it seems to be a matter of how much data has been used to generate the imatrix I'm inclined to believe the PPL degradation is coincidental (or rather that the previous PPL might have accidentally been better than it should), or of course that there's still something not quite right, even after my changes. :)

Either way, I agree that it needs to be looked at more closely, but IQ1_S definitely does not work as intended as-is.

@CISC
Copy link
Contributor Author

CISC commented Mar 24, 2024

@ikawrakow I don't know if you've had time to look at this or not, but I've been trying to determine if my changes have any real-world adverse impact with various models, but so far everything looks good. However it's difficult to determine exactly what kind of effect this would have and what to look for, so it's hard to get any definitive answer.

I'm thinking it might make sense to create a draft PR and invite a few of the usual suspects who publish IQ1_S quants on HF? If nothing else, we might be able to start a discussion and organize some testing.

@ikawrakow
Copy link
Contributor

Sure, submit a PR and have some other people test it.

Copy link
Contributor

github-actions bot commented May 9, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants