Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better 1.5 bit quantization #5971

Merged
merged 15 commits into from
Mar 11, 2024
Merged

Better 1.5 bit quantization #5971

merged 15 commits into from
Mar 11, 2024

Conversation

ikawrakow
Copy link
Contributor

@ikawrakow ikawrakow commented Mar 10, 2024

While waiting for the 1.58 bit era to become reality, I decided to see if the current 1.5 quantization in llama.cpp can be improved. The answer is yes, and this PR makes the change to IQ1_S. It is a breaking change, but I feel this is OK because I don't expect too many 1.5 bpw quants floating around the Internet.

The table shows a PPL comparison between IQ1_S on master and this PR. Context is 2048 tokens for LLaMA-v1 and 4096 for all other models. The last column shows the rms_norm_epsilon used to generate the PR results.

Model PPL (master) PPL (PR) rms_norm_epsilon
LLaMA-v1-7B 17.93 14.20 5e-5
LLaMA-v1-13B 9.619 8.941 4e-5
LLaMA-v1-30B 7.564 6.999 2.5e-5
LLaMA-v2-7B 15.33 13.51 1.875e-5
LLaMA-v2-13B 9.167 8.134 2e-5
LLaMA-v2-70B 5.675 5.343 3e-5
Mistral-7B 10.80 11.21 default
Mixtral8x7B 7.261 6.354 default

Apart from Mistral-7B, all results are significantly better. My guess is that we simply got very lucky with Mistral-7B with the previous quantization, considering the unexpectedly large difference between LLaMA-v2-7B and Mistral-7B there.

The new quantization is the exact same size as the one on master, and uses 1.5 bpw (excluding the super-block scale). In the original version these 1.5 bits are spent on a group-of-8 codebook with 512 entries (9 bits), and a 3-bit scale per 8 weights. In the new quantization there are 2048 entries in the codebook (11 bits), along with one 4-bit scale per 32 weights.

@ikawrakow ikawrakow added the breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. label Mar 10, 2024
@Green-Sky
Copy link
Collaborator

Green-Sky commented Mar 10, 2024

It is a breaking change, but I feel this is OK because I don't expect too many 1.5 bpw quants floating around the Internet.

While this is probably true, we still want it to break with a readable message to the user.
I think something like general.quantization_version was supposed to be incremented in situations like this. been a while so I don't remember.

@ikawrakow
Copy link
Contributor Author

@ggerganov I see that with the introduction of ggml-common.h the i-quant data has been declared __constant__ on CUDA. Is this based on a thorough comparison versus the original (which does not use __constant__)? On my GPU (RTX-4080) using __constant__ leads to disastrous performance. E.g., for the new IQ1_S, I get 17 t/s with __constant__ vs 204 t/s without. There is only that much constant memory in a GPU, and, when the data does not fit, what happens is up to the gods. I have therefore changed the definition of GGML_TABLE_BEGIN to not use __constant__ on CUDA.

@ggerganov
Copy link
Owner

It's a mistake on my side - I wasn't aware that this can lead to such drastic changes in the performance. Thanks for fixing it

ggml-common.h Outdated
// So, I'm not sure if there are GPU's out there that like having the i-quant data in
// constant memory. Mine (RTX-4080) definitely does not like it.
//#define GGML_TABLE_BEGIN(type, name, size) static const __device__ __constant__ type name[size] = {
#define GGML_TABLE_BEGIN(type, name, size) static const __device__ type name[size] = {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did tests on RTX 3090, RTX 4090 and A100 and on all of them it is significantly faster to not have the __constant__ specifier, so it's not just RTX 4080 related

@Artefact2
Copy link
Collaborator

Artefact2 commented Mar 10, 2024

What is the consensus for picking rms_norm_epsilon? Brute-force trial and error?

with 2048 lattice points, so same bpw.
This is even better than blocks of 16.
Should I try blocks of 64? But to keep the same
bpw, when I go to 4096 lattice points, I need to
remove blocks alltogether and just have superblocks of
256 weights.
Metal works but TG is dog slow (35 t/s). PP is OKish (493 t/s).
Not seeing the bug in the Neon implementation for now.
TG-128 is now 204 t/s up from 194 t/s.
PP-512 is 5890 t/s, so significantly better than other quants
@ikawrakow ikawrakow merged commit be858f6 into master Mar 11, 2024
50 of 63 checks passed
@ikawrakow ikawrakow deleted the ik/iq1s_blocks16 branch March 11, 2024 06:51
@ikawrakow
Copy link
Contributor Author

The SYCL code needs to be adjusted to the new quants. As I don't have the ability to test I have not done that, which causes the SYCL tests to fail.

NeoZhangJianyu pushed a commit to NeoZhangJianyu/llama.cpp that referenced this pull request Mar 12, 2024
* Trying blocvks of 16 for IQ1_S - seems slightly better

* iq1s_blocks16: Adjust scale fudge factor to 1.125

* iq1s_blocks16: going to blocks of 32

with 2048 lattice points, so same bpw.
This is even better than blocks of 16.
Should I try blocks of 64? But to keep the same
bpw, when I go to 4096 lattice points, I need to
remove blocks alltogether and just have superblocks of
256 weights.

* iq1s_blocks16: Use 2*<x^2> as sigma2 in weight adjustment

* iq1s_blocks16: scalar and AVX2 dot products

* iq1s_blocks16: CUDA dot product

* iq1s_blocks16: Metal works, Neon does not

Metal works but TG is dog slow (35 t/s). PP is OKish (493 t/s).
Not seeing the bug in the Neon implementation for now.

* iq1s_blocks16: fixed Neon

* iq1s_blocks16: very slightly faster TG on Metal

Still pathetic at 37 t/s

* iq1s_blocks16: speedup Metal by packing codebook into uint32_t's

* Formatting

* iq1s_blocks16: uint32_t codebook is also better in CUDA

TG-128 is now 204 t/s up from 194 t/s.
PP-512 is 5890 t/s, so significantly better than other quants

* iq1s_blocks16: slightly faster Neon dot product

* iq1s_blocks16: faster AVX2 dot product

* iq1s_blocks16: adjust to ggml-common.h

---------

Co-authored-by: Iwan Kawrakow <[email protected]>
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
* Trying blocvks of 16 for IQ1_S - seems slightly better

* iq1s_blocks16: Adjust scale fudge factor to 1.125

* iq1s_blocks16: going to blocks of 32

with 2048 lattice points, so same bpw.
This is even better than blocks of 16.
Should I try blocks of 64? But to keep the same
bpw, when I go to 4096 lattice points, I need to
remove blocks alltogether and just have superblocks of
256 weights.

* iq1s_blocks16: Use 2*<x^2> as sigma2 in weight adjustment

* iq1s_blocks16: scalar and AVX2 dot products

* iq1s_blocks16: CUDA dot product

* iq1s_blocks16: Metal works, Neon does not

Metal works but TG is dog slow (35 t/s). PP is OKish (493 t/s).
Not seeing the bug in the Neon implementation for now.

* iq1s_blocks16: fixed Neon

* iq1s_blocks16: very slightly faster TG on Metal

Still pathetic at 37 t/s

* iq1s_blocks16: speedup Metal by packing codebook into uint32_t's

* Formatting

* iq1s_blocks16: uint32_t codebook is also better in CUDA

TG-128 is now 204 t/s up from 194 t/s.
PP-512 is 5890 t/s, so significantly better than other quants

* iq1s_blocks16: slightly faster Neon dot product

* iq1s_blocks16: faster AVX2 dot product

* iq1s_blocks16: adjust to ggml-common.h

---------

Co-authored-by: Iwan Kawrakow <[email protected]>
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* Trying blocvks of 16 for IQ1_S - seems slightly better

* iq1s_blocks16: Adjust scale fudge factor to 1.125

* iq1s_blocks16: going to blocks of 32

with 2048 lattice points, so same bpw.
This is even better than blocks of 16.
Should I try blocks of 64? But to keep the same
bpw, when I go to 4096 lattice points, I need to
remove blocks alltogether and just have superblocks of
256 weights.

* iq1s_blocks16: Use 2*<x^2> as sigma2 in weight adjustment

* iq1s_blocks16: scalar and AVX2 dot products

* iq1s_blocks16: CUDA dot product

* iq1s_blocks16: Metal works, Neon does not

Metal works but TG is dog slow (35 t/s). PP is OKish (493 t/s).
Not seeing the bug in the Neon implementation for now.

* iq1s_blocks16: fixed Neon

* iq1s_blocks16: very slightly faster TG on Metal

Still pathetic at 37 t/s

* iq1s_blocks16: speedup Metal by packing codebook into uint32_t's

* Formatting

* iq1s_blocks16: uint32_t codebook is also better in CUDA

TG-128 is now 204 t/s up from 194 t/s.
PP-512 is 5890 t/s, so significantly better than other quants

* iq1s_blocks16: slightly faster Neon dot product

* iq1s_blocks16: faster AVX2 dot product

* iq1s_blocks16: adjust to ggml-common.h

---------

Co-authored-by: Iwan Kawrakow <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants