Better 1.5 bit quantization #5971

ikawrakow · 2024-03-10T08:23:52Z

While waiting for the 1.58 bit era to become reality, I decided to see if the current 1.5 quantization in llama.cpp can be improved. The answer is yes, and this PR makes the change to IQ1_S. It is a breaking change, but I feel this is OK because I don't expect too many 1.5 bpw quants floating around the Internet.

The table shows a PPL comparison between IQ1_S on master and this PR. Context is 2048 tokens for LLaMA-v1 and 4096 for all other models. The last column shows the rms_norm_epsilon used to generate the PR results.

Model	PPL (master)	PPL (PR)	rms_norm_epsilon
LLaMA-v1-7B	17.93	14.20	5e-5
LLaMA-v1-13B	9.619	8.941	4e-5
LLaMA-v1-30B	7.564	6.999	2.5e-5
LLaMA-v2-7B	15.33	13.51	1.875e-5
LLaMA-v2-13B	9.167	8.134	2e-5
LLaMA-v2-70B	5.675	5.343	3e-5
Mistral-7B	10.80	11.21	default
Mixtral8x7B	7.261	6.354	default

Apart from Mistral-7B, all results are significantly better. My guess is that we simply got very lucky with Mistral-7B with the previous quantization, considering the unexpectedly large difference between LLaMA-v2-7B and Mistral-7B there.

The new quantization is the exact same size as the one on master, and uses 1.5 bpw (excluding the super-block scale). In the original version these 1.5 bits are spent on a group-of-8 codebook with 512 entries (9 bits), and a 3-bit scale per 8 weights. In the new quantization there are 2048 entries in the codebook (11 bits), along with one 4-bit scale per 32 weights.

Green-Sky · 2024-03-10T09:25:52Z

It is a breaking change, but I feel this is OK because I don't expect too many 1.5 bpw quants floating around the Internet.

While this is probably true, we still want it to break with a readable message to the user.
I think something like general.quantization_version was supposed to be incremented in situations like this. been a while so I don't remember.

ikawrakow · 2024-03-10T09:44:28Z

@ggerganov I see that with the introduction of ggml-common.h the i-quant data has been declared __constant__ on CUDA. Is this based on a thorough comparison versus the original (which does not use __constant__)? On my GPU (RTX-4080) using __constant__ leads to disastrous performance. E.g., for the new IQ1_S, I get 17 t/s with __constant__ vs 204 t/s without. There is only that much constant memory in a GPU, and, when the data does not fit, what happens is up to the gods. I have therefore changed the definition of GGML_TABLE_BEGIN to not use __constant__ on CUDA.

ggerganov · 2024-03-10T09:47:17Z

It's a mistake on my side - I wasn't aware that this can lead to such drastic changes in the performance. Thanks for fixing it

ggerganov · 2024-03-10T10:16:17Z

ggml-common.h

+// So, I'm not sure if there are GPU's out there that like having the i-quant data in
+// constant memory. Mine (RTX-4080) definitely does not like it.
+//#define GGML_TABLE_BEGIN(type, name, size) static const __device__ __constant__ type name[size] = {
+#define GGML_TABLE_BEGIN(type, name, size) static const __device__ type name[size] = {


I just did tests on RTX 3090, RTX 4090 and A100 and on all of them it is significantly faster to not have the __constant__ specifier, so it's not just RTX 4080 related

Artefact2 · 2024-03-10T12:47:07Z

What is the consensus for picking rms_norm_epsilon? Brute-force trial and error?

with 2048 lattice points, so same bpw. This is even better than blocks of 16. Should I try blocks of 64? But to keep the same bpw, when I go to 4096 lattice points, I need to remove blocks alltogether and just have superblocks of 256 weights.

Metal works but TG is dog slow (35 t/s). PP is OKish (493 t/s). Not seeing the bug in the Neon implementation for now.

Still pathetic at 37 t/s

TG-128 is now 204 t/s up from 194 t/s. PP-512 is 5890 t/s, so significantly better than other quants

ikawrakow · 2024-03-11T06:53:17Z

The SYCL code needs to be adjusted to the new quants. As I don't have the ability to test I have not done that, which causes the SYCL tests to fail.

* Trying blocvks of 16 for IQ1_S - seems slightly better * iq1s_blocks16: Adjust scale fudge factor to 1.125 * iq1s_blocks16: going to blocks of 32 with 2048 lattice points, so same bpw. This is even better than blocks of 16. Should I try blocks of 64? But to keep the same bpw, when I go to 4096 lattice points, I need to remove blocks alltogether and just have superblocks of 256 weights. * iq1s_blocks16: Use 2*<x^2> as sigma2 in weight adjustment * iq1s_blocks16: scalar and AVX2 dot products * iq1s_blocks16: CUDA dot product * iq1s_blocks16: Metal works, Neon does not Metal works but TG is dog slow (35 t/s). PP is OKish (493 t/s). Not seeing the bug in the Neon implementation for now. * iq1s_blocks16: fixed Neon * iq1s_blocks16: very slightly faster TG on Metal Still pathetic at 37 t/s * iq1s_blocks16: speedup Metal by packing codebook into uint32_t's * Formatting * iq1s_blocks16: uint32_t codebook is also better in CUDA TG-128 is now 204 t/s up from 194 t/s. PP-512 is 5890 t/s, so significantly better than other quants * iq1s_blocks16: slightly faster Neon dot product * iq1s_blocks16: faster AVX2 dot product * iq1s_blocks16: adjust to ggml-common.h --------- Co-authored-by: Iwan Kawrakow <[email protected]>

ikawrakow added the breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. label Mar 10, 2024

ikawrakow force-pushed the ik/iq1s_blocks16 branch from 86bdaa9 to 80487af Compare March 10, 2024 09:22

ggerganov approved these changes Mar 10, 2024

View reviewed changes

Kawrakow added 15 commits March 11, 2024 07:15

Trying blocvks of 16 for IQ1_S - seems slightly better

c9e9acf

iq1s_blocks16: Adjust scale fudge factor to 1.125

cd83a7d

iq1s_blocks16: going to blocks of 32

4c4404a

with 2048 lattice points, so same bpw. This is even better than blocks of 16. Should I try blocks of 64? But to keep the same bpw, when I go to 4096 lattice points, I need to remove blocks alltogether and just have superblocks of 256 weights.

iq1s_blocks16: Use 2*<x^2> as sigma2 in weight adjustment

c55e66f

iq1s_blocks16: scalar and AVX2 dot products

864a5c2

iq1s_blocks16: CUDA dot product

f092d04

iq1s_blocks16: Metal works, Neon does not

fbb001e

Metal works but TG is dog slow (35 t/s). PP is OKish (493 t/s). Not seeing the bug in the Neon implementation for now.

iq1s_blocks16: fixed Neon

15acc79

iq1s_blocks16: very slightly faster TG on Metal

8561139

Still pathetic at 37 t/s

iq1s_blocks16: speedup Metal by packing codebook into uint32_t's

d3da9d1

Formatting

7545d69

iq1s_blocks16: uint32_t codebook is also better in CUDA

156220f

TG-128 is now 204 t/s up from 194 t/s. PP-512 is 5890 t/s, so significantly better than other quants

iq1s_blocks16: slightly faster Neon dot product

101b18d

iq1s_blocks16: faster AVX2 dot product

34bc21f

iq1s_blocks16: adjust to ggml-common.h

9d83171

ikawrakow force-pushed the ik/iq1s_blocks16 branch from 80487af to 9d83171 Compare March 11, 2024 06:12

ikawrakow merged commit be858f6 into master Mar 11, 2024
50 of 63 checks passed

ikawrakow deleted the ik/iq1s_blocks16 branch March 11, 2024 06:51

CISC mentioned this pull request Mar 11, 2024

New IQ1_S somehow much worse than previous version #5996

Closed

ikawrakow mentioned this pull request Mar 11, 2024

1.5 bit: we can do even better #5999

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better 1.5 bit quantization #5971

Better 1.5 bit quantization #5971

ikawrakow commented Mar 10, 2024 •

edited

Loading

Green-Sky commented Mar 10, 2024 •

edited

Loading

ikawrakow commented Mar 10, 2024

ggerganov commented Mar 10, 2024

ggerganov Mar 10, 2024

Artefact2 commented Mar 10, 2024 •

edited

Loading

ikawrakow commented Mar 11, 2024

Better 1.5 bit quantization #5971

Better 1.5 bit quantization #5971

Conversation

ikawrakow commented Mar 10, 2024 • edited Loading

Green-Sky commented Mar 10, 2024 • edited Loading

ikawrakow commented Mar 10, 2024

ggerganov commented Mar 10, 2024

ggerganov Mar 10, 2024

Choose a reason for hiding this comment

Artefact2 commented Mar 10, 2024 • edited Loading

ikawrakow commented Mar 11, 2024

ikawrakow commented Mar 10, 2024 •

edited

Loading

Green-Sky commented Mar 10, 2024 •

edited

Loading

Artefact2 commented Mar 10, 2024 •

edited

Loading