-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IQ4_NL: 4-bit non-linear quants with blocks of 32 #5590
Conversation
* Basics (quantize, dequantize) * CUDA dequantize and dot product * Slightly faster CUDA dot product (120 t/s) * Switch to 6-bit scales * Scalar dot product * AVX2 dot product * ARM_NEON dot product * Works on metal, but still slow * Slightly better Metal dot product * Another small Metal improvement * Metal dot product is getting there * Faster CUDA dot product * Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided * Report the actual bpw * Add _xs mix that is 4.05 bpw for non-MoE models * Remove IQ4_XS for now, slightly adjust kvalues_iq4nl * AVX2 dot product uses Q8_0 instead of Q8_K * Add to test-backend-ops * Minor fix * Also use use Q5_K for attn_output in MoE models * Fixes after merging latest master * Switching to blocks of 32 * AVX2 for blocks of 32 * Scaler dot product for blocks of 32 * ARM_NEON dot product for blocks of 32 * Metal kernels for blocks of 32 * Slightly faster Metal kernels
looking good. so basically, the question is can i have a mix between Q4_K super-block of 256 mixing with 32block of IQ4_nl to get even bigger space saving. |
@sorasoras |
hmm, Could we expect a even denser version IQ4 in the future? |
ROCm benchmarks
|
7900XTX at 400W TGP
It's surprise that NL offer comparable performance to Q4_1 |
Tested on QWEN1.5-14B, saved about 150MB file size on 3K_X_S (3.71 BPW --> 3.63 BPW) with roughly the same PPL. Thanks for the contribution. |
with change introduce by IQ4_NL, IQ2_XS can beat the mainline Q2_K_S in term of PPL with the same imatrix |
@ikawrakow due to the recent big changes and new implementations of k-quant, could you help compile a table showing the difference among all quant types? |
Can not run IQ4_NL with mmq on 4070ti |
* iq4_nl: squash commits for easier rebase * Basics (quantize, dequantize) * CUDA dequantize and dot product * Slightly faster CUDA dot product (120 t/s) * Switch to 6-bit scales * Scalar dot product * AVX2 dot product * ARM_NEON dot product * Works on metal, but still slow * Slightly better Metal dot product * Another small Metal improvement * Metal dot product is getting there * Faster CUDA dot product * Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided * Report the actual bpw * Add _xs mix that is 4.05 bpw for non-MoE models * Remove IQ4_XS for now, slightly adjust kvalues_iq4nl * AVX2 dot product uses Q8_0 instead of Q8_K * Add to test-backend-ops * Minor fix * Also use use Q5_K for attn_output in MoE models * Fixes after merging latest master * Switching to blocks of 32 * AVX2 for blocks of 32 * Scaler dot product for blocks of 32 * ARM_NEON dot product for blocks of 32 * Metal kernels for blocks of 32 * Slightly faster Metal kernels * iq4_nl: Fix after merging with master * iq4_nl: another fix after merging with master * Use IQ4_NL instead of Q4_K when using k-quants is not possible * Fix typo that makes several tests fail * It was the ggml_vdotq thing missed inside the brackets --------- Co-authored-by: Iwan Kawrakow <[email protected]>
* iq4_nl: squash commits for easier rebase * Basics (quantize, dequantize) * CUDA dequantize and dot product * Slightly faster CUDA dot product (120 t/s) * Switch to 6-bit scales * Scalar dot product * AVX2 dot product * ARM_NEON dot product * Works on metal, but still slow * Slightly better Metal dot product * Another small Metal improvement * Metal dot product is getting there * Faster CUDA dot product * Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided * Report the actual bpw * Add _xs mix that is 4.05 bpw for non-MoE models * Remove IQ4_XS for now, slightly adjust kvalues_iq4nl * AVX2 dot product uses Q8_0 instead of Q8_K * Add to test-backend-ops * Minor fix * Also use use Q5_K for attn_output in MoE models * Fixes after merging latest master * Switching to blocks of 32 * AVX2 for blocks of 32 * Scaler dot product for blocks of 32 * ARM_NEON dot product for blocks of 32 * Metal kernels for blocks of 32 * Slightly faster Metal kernels * iq4_nl: Fix after merging with master * iq4_nl: another fix after merging with master * Use IQ4_NL instead of Q4_K when using k-quants is not possible * Fix typo that makes several tests fail * It was the ggml_vdotq thing missed inside the brackets --------- Co-authored-by: Iwan Kawrakow <[email protected]>
Add description source ggerganov/llama.cpp#5590
TL;DR
The main purpose of this PR is to provide a 4-bit quantization type that can be used when k- and i-quants that use blocks of 256 are not available (because the number of columns in some tensors are not a multiple of 256).
In short
IQ4_NL
uses blocks of 32 weights with afp16
block scales exactly likeQ4_0
, so models quantized withIQ4_NL
are the exact same size asQ4_0
andQ4_K_S
.IQ4_NL
uses a non-linear mapping to convert quants to weights (more on this below)Q4_0
and almost on par withQ4_K_S
.Q4_0
except on Metal, where it is 8% (prompt processing) or 20% (token generation) slower thanQ4_0
.fp16
block scales are replaced withint8_t
block scales (plus one floating point scale per row, which adds a negligible amount of bits), this would be a 4.25 bpw quantization, which has the same quantization error as the 4.5 bpwIQ4_NL
added by this PR.PPL comparisons
The following tables show PPL comparisons between
Q4_0
,Q4_K_S
, andIQ4_NL
. We start with the case of not using an importance matrix (I find this to be an important use case as at 4-bit quantization ideally one should not worry too much about having a suitable imatrix to quantize a model).Table 1 PPL comparison without imatrix for context of 512 tokens
The next table is with an imatrix created from
wiki.train.raw
Table 2 PPL comparison with imatrix for context of 512 tokens
Just in case researchers working on quantization happen to see this PR, here are some PPL results for a context of 4096 (LLaMA-v2 and Mistral) or 2048 (LLaMA-v1)
Table 3 PPL comparison with imatrix for context of 4096/2048 tokens
To make the comparison with the approaches that are currently claiming to be SOTA, the next table shows the quantization error defined as
QErr = PPL(Qunatized)/PPL(fp16) - 1
. I took the values for AQLM and QuIP# from the latest QuIP# paper.Table 4 Quantization error comparisons
Performance comparisons
Table 5 shows PP-512 and TG-128 values for a 7B LLaMA on various platforms
Metal
is on an M2-Max 30-core GPUARM_NEON
is on an M2-Max CPU using the 8 performance coresCUDA
is on an RTX-4080AVX2
is on a Ryzen-7950X CPU using 16 (PP-512) or 8 (TG-128) threads.Additional details
It all comes down to this set of 16 magic values
Where do they come from? I had implemented a K-means clustering based quantization in my private repository (similar to what, e.g., SqeezeLLM does), with clustering done per tensor row. Although I was getting similar or even slightly better results than SqeezeLLM, I was not particularly happy with the quantization quality, so decided to see what happens if I apply block-wise scaling before clustering. It turned out that the cluster means end up being (nearly) independent of the tensor/tensor row. I collected statistics of the cluster means from a few quantized model, and saw that the 16 means of the cluster means can be fit with a 3rd order polynomial that maps quant index to a (scaled) model weight. Using the polynomial fit directly results in a very decent performance on CUDA, acceptable performance on Metal, but is a no-go for CPU SIMD instructions. On the CPU the only thing that gives a good performance is a lookup table containing
int8_t
values. So, after scaling the polynomial fit to the fullint8_t
range and rounding to the nearest integer, we end up with the above 16 values.The initial work on this was done before I implemented the importance matrix. Without imatrix, the non-linear quantization was basically on par with
Q4_K
in terms of quantization error (see Table 1), while using ~7% fewer bits (if implemented row-wise with blocks of 32). But after the imatrix was added,Q4_K
became again slightly better (Tables 2 and 3). The non-linear quantization outperformsQ4_K
with blocks of 16. If implemented using super-blocks of 256 with 6-bit block scales, this would be a 4.4375 bpw SOTA quantization (SOTA in the sense that I'm not aware of a quantization approach that achieves a lower quantization error with less than 5 bpw).