-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2-bit integer quantization #456
Comments
No 3-bit support? |
I don't think I can implement it efficiently, but if anyone wants to give it a try - sure |
65B using 32gig ram anyone? 😆 |
I came up with a script that's able to compute RMS for various quantization methods - maybe it will come handy for experimenting: https://gist.github.com/prusnak/f54f8f33503458ca1aa9883f71897072 |
Go home Q2, you're drunk ;-)
This is cherry-picked, often it goes to babbling numbers right away. Q3 seems decent:
Both are very slow because I haven't found a good way to use AVX2 yet. Perplexity would probably take days if not weeks. I used float for the scale in Q2 and FP16 in Q3, so the model files actually are the same size:
For Q2 I deviated slightly from the standard calculation of the factors. If you want to have a zero value and symmetry in positive and negative range, that would have left only 3 values (-1 0 +1). Instead, I calculate the signed maximum (= value of largest magnitude, without applying Code here: https://github.com/sw/llama.cpp/tree/q2q3 |
Updated my branch with AVX optimizations, probably far from perfect. Still quite slow...
Q3:
|
Not nearly enough, we need support for 1-bit signed floats. |
Swap that out for 1 qubit and now we're talking. |
I think the best model size and performance will be achieved when 0-bit quantization is used. |
Mhmm possibly -1... |
I've been testing Q3_0 and found the performance was improved by representing data like this:
Basically Measured times:
For reference, @sw's original version gives:
I also briefly tested Q3_0 with twice the QK. The code is not working correctly, but the operations are there. The runtime is:
I'm wondering if I should keep working on this and make a pull request. |
@pubby Haven't looked at the proposed 2-bit quantizations, but I am fairly confident that with ARM NEON, we can have For Q3 I am not sure yet, but it will be great if we find a way to do the Regarding the quantization routines for Q2 and Q3 - these can remain just reference implementations. I.e. no need to SIMD-ify, because with #951 we will be quantizing only towards 8-bits during the computation and therefore, the 2-bit and 3-bit quantization will be used only during model file quantization, so we can afford it to be slow. |
Thanks to K-quants this is now available |
Have there been any new insights into the quality of 2-bit quantization? I.e. does that approach produce reasonable results now? |
Show how to adjust context window in README.md
Add
Q2_0
andQ2_1
quantization support toggml
:Q4_0
andQ4_1
implementationsQK == 16
in this case to compensate for further accuracy lossesggml_vec_dot_q2()
computationggml_vec_mad_q2()
- these will be deprecated soonThe expected model sizes for 7B and
QK == 16
are:Q2_0
- 3.2 GBFor
QK == 32
we have:Q2_0
- 2.4 GBQ2_1
- 3.2 GBBefore you send me papers that show 2-bit quantization does not work - no need. I want to have this supported anyway. I have something in mind. The efforts needed to add this support are so small that there is no reason not to do it.
The text was updated successfully, but these errors were encountered: