-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add reference implementation for q4k
and q5k
#586
Conversation
Thanks, would probably be good to understand why the text generation does not work. |
I believe my vec-dot implementations are funky, but i don't know why as they pass ggml's unit tests and the random tensor mat-muls seam reasonable. The problem im facing is that the llama 2 k-quant models contain multiple different quant types per file meaning i can't test them in isolation. Here are some results from TheBlokes llama-2 models:
From the test i would conclude that i probably got something wrong in the |
I looked a bit at this and here is a patch that seems to fix things for Let me know if you have the time to do this on your patch or if you're already too much in armored core and if the latter I'll take a stab at fixing these and merge. edit: I've merged #607 that convert most existing transmutes to byteorder versions so might give some good examples. Haven't done anything on the ones this PR introduces. diff --git a/candle-core/src/quantized/k_quants.rs b/candle-core/src/quantized/k_quants.rs
index c5ce97f..3852b59 100644
--- a/candle-core/src/quantized/k_quants.rs
+++ b/candle-core/src/quantized/k_quants.rs
@@ -4,6 +4,7 @@ use super::utils::{
};
use super::GgmlDType;
use crate::Result;
+use byteorder::{ByteOrder, LittleEndian};
use half::f16;
use rayon::prelude::*;
@@ -926,11 +927,7 @@ impl GgmlType for BlockQ4K {
q4 = &q4[32..];
}
- let utmp_raw = unsafe {
- std::mem::transmute::<&mut [u8; 12], &mut [u32; 3]>(&mut x.scales.clone())
- };
-
- utmp[0..3].copy_from_slice(utmp_raw);
+ LittleEndian::read_u32_into(&x.scales, &mut utmp[0..3]);
utmp[3] = ((utmp[2] >> 4) & KMASK2) | (((utmp[1] >> 6) & KMASK3) << 4);
let uaux = utmp[1] & KMASK1; |
This change seems quite sensible. I initially used transmute mainly due to my limited familiarity with Rust 😅. Feel free to commit the patch directly to this PR since you should have write access. Alternatively, I can replace the transmutes on my end and test it with my collection of GGML models tomorrow. Just let me know what works best for you. |
Alright i replaced the |
Amazing, I'll have a quick look at the PR later today and hopefully merge this, I'll also try it vs llama.cpp just to check that everything is more or less in sync (from my experience so far with q4_0/q8_0, it's likely to be a bug on our side when it's not the case :) ). |
Merged, thanks a lot for all the hard work! |
Also just merged #609 that gets rid of the remaining transmute and of using intermediary values that you may find helpful. My feeling after this is that the main drawback of |
With this k-quant models should be able to be executed, i tried to run some
q4_k_M
models but the results were gibberish. I then extended the unit tests to check against the results produced by the ggml matmul unit tests but everything seams fine, although the current implementation is off by one at the sixth or seventh decimal place, which probably shouldn't influence the model results.