-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New IQ1_S somehow much worse than previous version #5996
Comments
Try with a new imatrix? |
@BarfingLemurs That makes no sense, why would the imatrix need to be changed? Anyway, tried again with #5999 and it's no longer spouting gibberish, however it's still worse than before, it's just outputting this now:
And given variations of the same prompt makes it leap to various wrong conclusions, so not doing that great, the closest I can get it to something (but still very wrong, unlike the old IQ1_S) is with Stockholm instead of Oslo:
|
@ikawrakow Any feedback appreciated, I can provide you with whatever you need to help figure this out. |
previous imatrix files have - #5856 (comment) I don't know if there will be issues running imatrix on gpus, so I use the cpu backend. |
@CISC I'm unable to test this model. I cloned the model from
I added the
It looks like something is not quite right with the vocabulary? |
Same for me with some DeepSeek-based models, which Gorilla is based on. Inference for FP16 and Q8 works, but imatrix calculation and some other things result in the mentioned error. It might be related to #5464, the out-of-range error is also mentioned there. |
@ikawrakow All DeepSeek models require --pad-vocab but I had no problems calculating an imatrix, in fact just tried again with the latest build and still works fine, so that's pretty weird... It seems the issue with IQ1_S is random though as today I'm getting gibberish again from the same IQ1_S model that "worked" yesterday. Tried without GPU just to make sure it wasn't some CUDA issue, but same thing, very strange. All other quants also work just fine. I've uploaded my original gguf conversion here in case you want to test with that. Can be downloaded with curl -L -O https://huggingface.co/CISCai/gorilla-openfunctions-v2-SOTA-GGUF/resolve/main/gorilla-openfunctions-v2.fp16.gguf |
Just to make sure nothing else is broken I also quickly requantized IQ2_XXS with the latest build and tested it, works perfectly:
|
After that, I get the exact same response from the
Also got the same response running on the CPU (AVX2). WinkiText2 PPL is The model behavior typically does not depend on how the model slept, when it got up in the morning, did it have coffee, etc. Hence, given the random behavior you are observing, something is not quite right in your setup. |
@ikawrakow That's what's so weird, why is it only affecting IQ1_S? As I said, all other quants are working fine, even after requantizing with latest build. I've even made sure to do a |
I'm getting gibberish with your imatrix too (
|
@ikawrakow Now we're getting somewhere, I first tried just regenerating the imatrix the same way I did originally (just to make sure there was nothing wrong with is, as suggested by @BarfingLemurs ), but while it did generate completely different values in the imatrix (is there some randomness to the generation?) the resulting quantization remained the same. Then after you mentioned smaller chunks of data made a difference I tried again with
I'm wondering if there's something wrong with the imatrix application of IQ1_S, especially when the imatrix has been generated over larger amounts of data? |
@ikawrakow I've been digging through the IQ1_S quantizing functions and made the following changes that seems to fix the problem: diff --git a/ggml-quants.c b/ggml-quants.c
index 06665eb2..936f9122 100644
--- a/ggml-quants.c
+++ b/ggml-quants.c
@@ -11539,6 +11539,7 @@ static void quantize_row_iq1_s_impl(const float * restrict x, void * restrict vy
float scales[QK_K/IQ1S_BLOCK_SIZE];
float weight[IQ1S_BLOCK_SIZE];
+ float waux[IQ1S_BLOCK_SIZE];
int8_t L[IQ1S_BLOCK_SIZE];
float sumx[IQ1S_BLOCK_SIZE+1];
float sumw[IQ1S_BLOCK_SIZE+1];
@@ -11558,12 +11559,13 @@ static void quantize_row_iq1_s_impl(const float * restrict x, void * restrict vy
const float * xbl = x + QK_K*ibl;
float sumx2 = 0;
for (int i = 0; i < QK_K; ++i) sumx2 += xbl[i]*xbl[i];
- float sigma2 = 2*sumx2/QK_K;
+ float sigma2 = sumx2/QK_K;
for (int ib = 0; ib < QK_K/IQ1S_BLOCK_SIZE; ++ib) {
const float * xb = xbl + IQ1S_BLOCK_SIZE*ib;
const float * qw = quant_weights + QK_K*ibl + IQ1S_BLOCK_SIZE*ib;
for (int i = 0; i < IQ1S_BLOCK_SIZE; ++i) weight[i] = qw[i] * sqrtf(sigma2 + xb[i]*xb[i]);
+ for (int i = 0; i < IQ1S_BLOCK_SIZE; ++i) waux[i] = sqrtf(weight[i]);
float max = fabsf(xb[0]);
for (int i = 1; i < IQ1S_BLOCK_SIZE; ++i) max = MAX(max, fabsf(xb[i]));
if (!max) {
@@ -11625,7 +11627,7 @@ static void quantize_row_iq1_s_impl(const float * restrict x, void * restrict vy
if (grid_index < 0) {
all_on_grid = false;
const uint16_t * neighbours = kneighbors_q2xs - kmap_q2xs[u] - 1;
- grid_index = iq1_find_best_neighbour2(neighbours, kgrid_q2xs, xb + 8*k, weight + 8*k, scale, xx, L + 8*k, NGRID_IQ1S);
+ grid_index = iq1_find_best_neighbour2(neighbours, kgrid_q2xs, xb + 8*k, waux + 8*k, scale, xx, L + 8*k, NGRID_IQ1S);
GGML_ASSERT(grid_index >= 0);
}
index[k] = grid_index; If you concur I will submit a PR. |
@CISC I specifically made the code to be the way it is because it does give a lower PPL for the 9 models I'm testing. I'm traveling for a few days without access to the computer where I keep my research notes. Let me get back first and see what difference each one of these makes (unless you want to run PPL for all 7 LLaMAs, Mistral, and Mixtral8x7) |
@ikawrakow hello, I got the same error "Aborted (core dumped)" with deepseek coder model in q5_0 quantize, did you solve it? please. but in another platform, the reseaon is the CPU can not compute with AVX2 et al. |
@ikawrakow It's probably best that you run the tests to ensure all the variables are the same (and that I haven't made a mistake). I can wait. :) |
it looks the llamacpp support about deepseek-coder coming soon # 5981 |
@BrickBee hello, yes, I got the same process with DeepSeek-coder model, did you have solved it ? |
@hyperbolic-c Ah, I remember I also had to use |
@CISC yes, but another error is weird, which like @ikawrakow showed
|
@hyperbolic-c Did you try again after converting with the right tokenizer? It worked for me, and for @ikawrakow when using my converted GGUF. If it still doesn't work for you, perhaps you should open another issue? |
Here is a table that compares PPL between master and your proposed changes. To not complicate things, values are computed with the default
Based on this, I think we need more evidence that the proposed change is better. |
@CISC Thanks. It did not work for the DeepSeek-coder model. Maybe llama.cpp not be fully support DeepSeek model yet(see #5981) |
@ikawrakow Interesting, apart from LLaMA-v2-7B it wasn't much of a difference though. However the difference in actual output with my imatrix on the gorilla model is night and day, from gibberish to completely correct, so something is obviously going on. Given that it seems to be a matter of how much data has been used to generate the imatrix I'm inclined to believe the PPL degradation is coincidental (or rather that the previous PPL might have accidentally been better than it should), or of course that there's still something not quite right, even after my changes. :) Either way, I agree that it needs to be looked at more closely, but IQ1_S definitely does not work as intended as-is. |
@ikawrakow I don't know if you've had time to look at this or not, but I've been trying to determine if my changes have any real-world adverse impact with various models, but so far everything looks good. However it's difficult to determine exactly what kind of effect this would have and what to look for, so it's hard to get any definitive answer. I'm thinking it might make sense to create a draft PR and invite a few of the usual suspects who publish IQ1_S quants on HF? If nothing else, we might be able to start a discussion and organize some testing. |
Sure, submit a PR and have some other people test it. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Since #5971 I tried requantizing IQ1_S of this model, using the same imatrix as before, however, where the following worked as expected 75% of the time (and the rest of the time it just gave the wrong output):
The newly quantized version just outputs gibberish like this, every time:
Which seems like a pretty massive regression, any idea what's going on?
The text was updated successfully, but these errors were encountered: