Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2-bit integer quantization #456

Closed
ggerganov opened this issue Mar 24, 2023 · 16 comments
Closed

2-bit integer quantization #456

ggerganov opened this issue Mar 24, 2023 · 16 comments
Assignees
Labels
enhancement New feature or request research 🔬

Comments

@ggerganov
Copy link
Owner

Add Q2_0 and Q2_1 quantization support to ggml:

  • Follow the existing Q4_0 and Q4_1 implementations
  • Implement reference scalar quantization and dequantization routines
  • I suspect we might have to use QK == 16 in this case to compensate for further accuracy losses
  • Add SIMD support for a specific architecture - investigate best strategy to perform the ggml_vec_dot_q2() computation
  • No need to implement ggml_vec_mad_q2() - these will be deprecated soon
  • Compute perplexity scores

The expected model sizes for 7B and QK == 16 are:

  • Q2_0 - 3.2 GB

For QK == 32 we have:

  • Q2_0 - 2.4 GB
  • Q2_1 - 3.2 GB

Before you send me papers that show 2-bit quantization does not work - no need. I want to have this supported anyway. I have something in mind. The efforts needed to add this support are so small that there is no reason not to do it.

@ggerganov ggerganov added enhancement New feature or request research 🔬 labels Mar 24, 2023
@dakennedyd
Copy link
Contributor

No 3-bit support?

@ggerganov
Copy link
Owner Author

No 3-bit support?

I don't think I can implement it efficiently, but if anyone wants to give it a try - sure

@Green-Sky
Copy link
Collaborator

65B using 32gig ram anyone? 😆

@prusnak
Copy link
Collaborator

prusnak commented Mar 24, 2023

I came up with a script that's able to compute RMS for various quantization methods - maybe it will come handy for experimenting: https://gist.github.com/prusnak/f54f8f33503458ca1aa9883f71897072

@sw
Copy link
Contributor

sw commented Mar 25, 2023

Go home Q2, you're drunk ;-)

$ ./main -m ./models/7B/ggml-model-q2_0.bin -p "The efforts needed to add this support are so small that there is no reason not to do it." -n 64 -s 1679735763

The efforts needed to add this support are so small that there is no reason not to do it.
The efforts that we need the work to make sure that we can be sure that everything falls together with no additional and very little is reserved for a little or 1, or even less or 13 is 13, that in additionally or 1 month faster is 18 and or even faster

This is cherry-picked, often it goes to babbling numbers right away.

Q3 seems decent:

$ ./main -m ./models/7B/ggml-model-q3_0.bin -p "Building a website can be done in 10 simple steps:" -n 128 -s 1679739910

Building a website can be done in 10 simple steps:
Decide which web authoring software you're going to use.
Read up on what you need for the site you're building. Note that I am only referring to reading material on the web here; reading will build your knowledge without spending money on a book (or e-book). I would suggest looking into JavaScript, HTML5 and CSS3 before you launch into development of any kind. You can always test the waters of what you're working with against an online validator before you launch into production mode -- or you could just skip that part altogether until you get frustrated with having to use a browser

Both are very slow because I haven't found a good way to use AVX2 yet. Perplexity would probably take days if not weeks.

I used float for the scale in Q2 and FP16 in Q3, so the model files actually are the same size:

$ ls -gho models/7B/*q*
-rw-rw-r-- 1 3.2G Mär 25 10:43 models/7B/ggml-model-q2_0.bin
-rw-rw-r-- 1 3.2G Mär 25 10:45 models/7B/ggml-model-q3_0.bin
-rw-rw-r-- 1 4.0G Mär 24 11:52 models/7B/ggml-model-q4_0.bin
-rw-rw-r-- 1 4.8G Mär 22 13:08 models/7B/ggml-model-q4_1.bin

For Q2 I deviated slightly from the standard calculation of the factors. If you want to have a zero value and symmetry in positive and negative range, that would have left only 3 values (-1 0 +1). Instead, I calculate the signed maximum (= value of largest magnitude, without applying fabsf), then I assign the value -2 to that maximum. The sign of the shared scaling factor is adjusted to give the right sign of the result. Without this modification, I couldn't get Q2 to output any semblance of english.

Code here: https://github.com/sw/llama.cpp/tree/q2q3

@sw
Copy link
Contributor

sw commented Mar 27, 2023

Updated my branch with AVX optimizations, probably far from perfect.

Still quite slow...
Q2:

98.37 seconds per pass - ETA 17.90 hours
[1]147.6625,[2]136.8862,[3]132.6015,[4]127.8629,[5]120.4091,[6]111.7640,[7]114.2548,[8]112.8951,

Q3:

203.61 seconds per pass - ETA 37.05 hours
[1]7.0481,[2]8.0335,[3]8.8317,[4]10.0700,[5]10.1138,[6]9.9850,[7]10.2314,[8]10.2057,

@CamiloMM
Copy link

Not nearly enough, we need support for 1-bit signed floats.

@Interpause
Copy link

Not nearly enough, we need support for 1-bit signed floats.

Swap that out for 1 qubit and now we're talking.

@prusnak
Copy link
Collaborator

prusnak commented Apr 2, 2023

Not nearly enough, we need support for 1-bit signed floats.

I think the best model size and performance will be achieved when 0-bit quantization is used.

@Lolagatorade
Copy link

Not nearly enough, we need support for 1-bit signed floats.

I think the best model size and performance will be achieved when 0-bit quantization is used.

Mhmm possibly -1...

@pubby
Copy link

pubby commented Apr 14, 2023

I've been testing Q3_0 and found the performance was improved by representing data like this:

typedef struct {
    ggml_fp16_t d;
    uint16_t hi; // Highest bit, packed.
    uint32_t lo; // Lowest 2 bits, packed.
} block_q3_0;

Basically lo is the same format as Q2_0. The remaining bits (the highest ones) get packed into hi. The dot implementation is basically the Q2_0 one, except it uses a lookup table to handle hi. Because the code is so similar, improvements to the Q2_0 dot code can be ported over.

Measured times:

Q3_0: 71.00 seconds per pass - ETA 12.92 hours
Q2_0: 52.62 seconds per pass - ETA 9.57 hours
Q4_0: 29.60 seconds per pass - ETA 5.39 hours

For reference, @sw's original version gives:

sw_Q3_0: 96.34 seconds per pass - ETA 17.53 hours

I also briefly tested Q3_0 with twice the QK. The code is not working correctly, but the operations are there. The runtime is:

46.22 seconds per pass - ETA 8.41 hours

I'm wondering if I should keep working on this and make a pull request.

@ggerganov
Copy link
Owner Author

@pubby
These are definitely of interest, moreover with the recent insights about quantization (#835 #896 #909 etc) and the upcoming 8-bit quantization of intermediate results #951. I expect the quality of low-bit quantization to improve to usable, so there remains the question of being able to evaluate it efficiently.

Haven't looked at the proposed 2-bit quantizations, but I am fairly confident that with ARM NEON, we can have Q2_0 x Q8_0 dot product that is the same speed as the existing Q4_0 x Q4_0 and the upcoming Q4_0 x Q8_0. I guess same holds for AVX.

For Q3 I am not sure yet, but it will be great if we find a way to do the Q3 x Q8 dot product fast.

Regarding the quantization routines for Q2 and Q3 - these can remain just reference implementations. I.e. no need to SIMD-ify, because with #951 we will be quantizing only towards 8-bits during the computation and therefore, the 2-bit and 3-bit quantization will be used only during model file quantization, so we can afford it to be slow.

@ggerganov ggerganov changed the title 2-bit integer quantization 2-bit integer quantization Apr 16, 2023
@ggerganov ggerganov linked a pull request Apr 16, 2023 that will close this issue
@ggerganov
Copy link
Owner Author

Thanks to K-quants this is now available

@MrMage
Copy link

MrMage commented Jun 26, 2023

Have there been any new insights into the quality of 2-bit quantization? I.e. does that approach produce reasonable results now?

@Green-Sky
Copy link
Collaborator

Green-Sky commented Jun 26, 2023

@MrMage pure q2 will never be good, but the k-quants use a mixture with some q2 to achieve reasonable results. checkout how LLAMA_FTYPE_MOSTLY_Q2_K is composed here #1684

@neelr
Copy link

neelr commented Nov 17, 2023

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this issue Dec 19, 2023
Show how to adjust context window in README.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request research 🔬
Development

Successfully merging a pull request may close this issue.