-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce New Lookup-Table(LUT)-Based Matrix Multiplication Method #10181
base: master
Are you sure you want to change the base?
Conversation
- Adding support for new tensor types `GGML_TYPE_TQ1_0` and `GGML_TYPE_TQ2_0` - Handling the case when the kcfg is not found for certain tensors (`token_embd.weight` and `output.weight`), displaying a warning message instead of a fatal error
Did you check the KL divergence of the new datatype using |
If we're just looking at 4 bit here Q4_0 has 1 FP16 scale per 32 weights. In the 4 bit INT_N example it's using 1 FP32 scale per 128 weights. I would imagine that perplexity would be worse than Q4_0 in this case. In the 7B 4 bit example on the Intel there's only a 12% improvement between Q4_0 and INT_N with 4 threads. If I quickly hack |
@netrunnereve We haven't yet tested the perplexity and will do it later. I would like to clarify that |
@JohannesGaessler We will do it and put the perplexity results later. |
@JohannesGaessler @netrunnereve We've checked the PPL and KL divergence of some EfficientQAT models which can be supported by our LUT method, as well as some new speed numbers. OverviewIn our tests, the LUT method has the same PPL/KLD for Note that in this reply, we are using EfficientQAT models in GPTQ format for our INT_N type which are different from those in the main PR description. Models here use 1 scale and 1 zero point for each block, while models in the main PR description use only 1 scale. Therefore, the model sizes here are a bit larger. But we can consider using PerplexityUpdate at Nov. 12: added EQAT models with quantized embedding and output weights. The configs are the same as Q2_K and Q4_0, that is, Q2_K embedding & Q6_K output for 2bit, and Q4_0 embedding & Q6_K output for 4bit. We test some more All tests here are conducted using llama-perplexity. We tested Q2_K_pure variants for many times, but the PPL was always that big. And we'd like to clarify that
SpeedAll tests here are tg128. We apologize that the previous numbers of Q4_0 with T-MAC are wrong. There is a further increase from block_size=32 (Q4_0) to block_size=128 (EQAT-w2g128). Thanks @netrunnereve for pointing it out.
|
Thanks for the numbers. I was specifically asking because I primarily work on GPU code and there the constraints are very different. In particular, on GPUs the main memory is comparatively small vs. the amount of available compute. If I read the numbers correctly, EQAT does not compress the data as efficiently as the quantization methods on master (and would thus not be a good fit for GPUs). |
@JohannesGaessler This PR is mainly to add support for LUT-based matrix multiplication kernel library which aims to speed up the CPU inference process.
|
Okay I see it more clearly now, assuming we're using regular Q4_0 on the i7 with Llama 2 7B and 4 threads we get 8.97 t/s with T-MAC off and 9.16 t/s with T-MAC on (2% improvement). Keep in mind that while the INT_N perplexity looks better it's using a specially prepared QAT model. So basically with QAT we can use one scale per 128 weights and get 9.28 t/s (additional 1% improvement). However we have our K-quants and something like Q4_K_M has perplexity of 5.877 on Llama 2 7B which slightly beats the 5.884 of INT_N. And that's with no QAT needed and basically the same or better performance compared to Q4_0. On the other hand it looks like there are some genuine performance improvements on the 2 bit side though, though perplexity is higher than Q2_K. For 4 bit whether we have T-MAC or EQAT I honestly don't think this method is worth it. |
@netrunnereve Thanks for your detailed comment!
Explain the speedup. We are proposing a brand new way to calculate low-bit matmul, using another set of instructions to implement. Memory bandwidth, CPI ratio between the LUT instructions and MUL instructions in different CPUs, both may affect the speedup ratio of LUT over MUL. This may be different from many great optimizations you've made inside MUL method. Our LUT method or any other method cannot beat others when it goes to memory bound. As you can see, LUT beats existing master data type (i.e. quantization type) by a larger percentage in single thread cases. Also welcome to check the T-MAC repo for more numbers. On edge devices, LUT will gain even higher speedup.
Clarify the comparison. I have to note that the comparison between Q4_K_M=5.877 and INT_N=5.884 is not fair. Q4_K_M uses Q6_K in some weights, so the overall bpw grows to over 5. A fair competitor should be Q4_K_S. And I don't know yet why my result is different from that document. In my test, Llama-2-7b F16 model gives 5.7969 PPL and Q2_K_M gives 6.982406, but in https://github.com/ggerganov/llama.cpp/blob/master/examples/perplexity/README.md Q2_K_M is 5.794552 which is even lower than Q4_K and Q6_K. I think you may double-check the numbers in that document.
Thanks for your affirmation on 2bit side. I guess you may misunderstand the proposed INT_N type? (Correct me if I'm wrong!)
|
I won't comment on the TQ types as I'm unfortunately not familiar enough with the implementation and quant methods. For 2-bit there's a good performance increase compared to 2-bit but perplexity is higher (and that's with QAT already). For Q4_0 the current AVX2 implementation you're likely using on your i7 is... suboptimal to say the least, and I think fully optimized it'll perform very similarly to your Q4_0 T-MAC.
The issue here is that it becomes less of a fair comparison considering that our quants don't require EQAT. Only the creator of EQAT is uploading models and the selection is limited compared to the thousands of interesting finetunes on Hugging Face which work great with our K and I quants. It's a good idea to support QAT, but that makes the perplexity comparisons sort of unfair.
Yeah that Q2_K_M number is definitely wrong as it's lower than the Q6_K and Q8_0 results in the same table. When using a Q2_K 7B model the loss of quality is extremely obviously just from looking at the generated text. It's probably best to publish and trust your own benchmarks in this case.
I'm not trying to be nitpicky here and I apologize if I sound that way, I'm just a bit skeptical about the claims. I think it would be a good idea to run some benchmarks and perplexity against our SOTA quants with similar bpw, rather than comparing with Q4_0 and Q2_K which really aren't used nowadays. For 4-bit that's probably Q4_K_S (yeah Q4_K_M is a bit too large), IQ4_NL, and IQ4_XS. For 2-bit Q2_K has been superseded by the I-quants. |
@netrunnereve I've tested Q4_K_S, IQ4_NL and IQ4_XS. Their PPLs are all larger than 5.88 while Q4_K_S has 8 Q5_K tensors and IQ4_XX have 4 Q5_K tensors. IQ4_XS performs the best among the three with smaller model size and best perplexity. For speed, Q4_K_S > IQ4_XS > IQ4_NL. In the 4 thread scenario which you are concerned, compared to INT_N, Q4_K_S is 6% faster on my i7-12700 and 2% slower on M2-Ultra. IQ4_XS is almost the same on i7-12700 while 22% slower on M2-Ultra. So according to the perplexity and speed numbers, I think Q4_K_S and IQ4_XS are two Pareto front points with different trade-off policies, and IQ4_NL is covered by IQ4_XS. The INT_N model I tested is not worse than IQ4_XS in these aspects and faster on M2, while almost not worse than Q4_K_S in these aspects but better in perplexity. So it almost or is close to cover the two master types.
And I have some tests on Raspberry Pi. The peak speed of INT_N is about 20% faster than Q4_0 (3.29 tokens/s V.S. 2.73 tokens/s) and about 50% faster than Q2_K_S (4.98 tokens/s V.S. 3.26 tokens/s). Edge devices are more bound on computation than memory, so the gap becomes obvious. |
@netrunnereve any comments or suggestions on this test results are highly appreciated!
|
Sorry, I missed your comment! In this case I can now see that EQAT-w4g128-INT_N is beating the K and I quants of similar size in terms of perplexity and speed, on the condition that the model is finetuned with EQAT. Honestly I'm not sure how much interest this project has in supporting a specific type of QAT'd model. As this is a new quant type with an associated maintenance requirement you'll probably need the core owners like GG or slaren to look into the viability of accepting this PR. |
Thanks @netrunnereve for the feedback on the new results, appreciate for suggesting connect the core owners. @ggerganov @slaren - may you review and suggest the viability of accepting this PR? If maintenance is a concern, we can help on the maintenance of this feature.
|
Generally my opinion is that if there are some cases where these types perform better than any others, it would be good to merge. From what I understand from the discussion here, that seems to be the case. It looks like the code at the moment depends on a external library, which could be a problem. Is the intention to add the library code here? |
@slaren Thanks for your positive reply! There are two parts to the T-MAC code:
The second part will be an external library. The kernels and wrappers provide an alternative implementation of mul_mat for certain low-bit types (see ggml-cpu.c for the replacement). Once the kernels are generated, the T-MAC module will no longer be involved in the subsequent llama.cpp build and runtime process. However, the second part of the code is still necessary for general use. The kernels may vary depending on factors like weight bits, We see two possible ways to integrate T-MAC:
Which approach would you prefer? |
As a rule of thumb, we should not add dependencies to 3rd party libraries to the ggml code. Backends are an exception since adding 3rd party libraries is usually unavoidable in that case. However, adding a new backend for the T-MAC types would not be a good solution either, we should keep all the CPU code in the CPU backend. This is especially important due to recent changes that add the ability to load backends dynamically. A binary package of llama.cpp may bundle multiple versions of the CPU backend compiled with different instruction sets, and we do not want to also have to bundle multiple versions of a T-MAC backend. My recommendation would be the following:
|
@slaren Thanks for your constructive reply! Since "the 3rdparty code" is implemented in Python, we have to find a proper way to place it and to add in the existing workflow if we add all the codes here. We see it possible to treat them as And we have another two questions:
|
Thanks for the explanation. I see now that I severely underestimated how complicated it would be to bring this code to ggml. Adding Apache TVM as a dependency of ggml is not a possibility. Using python scripts to generate the kernels may be ok depending on the circumstances, but if what this means is that every model needs a different set of kernels, that is likely too far. We absolutely need to be able to distribute binary packages of llama.cpp, since it needs to be able to run on edge/final user devices. I can give you a few pointers, but realistically, I don't see how you could bring all this system into ggml in a way that integrates with the existing code, and would not become an unreasonable maintenance burden, without effectively rewriting large parts of it. At this moment I cannot commit the time that would be necessary to even figure how to fit all of this into the existing ggml code. |
@QingtaoLi1 The NPU support sounds really promising. I have tested the snapdragon 8 elite NPU (supports float16, INT8, INT4) locally with their QNN genie bundle framework, and found the prompt processing performance is more than a magnitude greater than the decode speed, around seven-hundred tokens per second for 7B at 4 bit (40x). The NPU is faster than GPU acceleration, which in my tests only have ~3:1 prompt processing to decoding. Were you thinking of Apple Silicon NPU hardware? |
@slaren Thanks, I get the points. We have discussed about this conflict given llama.cpp community's rules. There is a possible plan to meet an agreement:
As you pointed out, this plan does need to re-write a large part of T-MAC code. |
@Zant12 We are working on Qualcomm NPU, and currently no plan for Apple Silicon NPU. |
logger.warning("Error while parsing weights for quantization_config: {}".format(quantization_config)) | ||
|
||
# For permutation in, e.g., LlamaModel | ||
w = self.modify_tensors(torch.from_numpy(w), name, bid)[0][1].numpy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will break for MoE models, Mixtral / Qwen2MoE .etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not consider the MoE cases in the current convert script. Will find out how to solve it.
@slaren Hi, what do you think of the plan I proposed? We want to make an agreement before doing the modifications since there would be a lot of work to do. |
Hi @QingtaoLi1, sorry, I missed the edit. The plan looks good, but please understand that I cannot guarantee in any case that the code will be merged. Generally, the better the code is integrated with the existing framework, the better will be the chances of merging it. It also needs to show meaningful improvements in practical aspects to justify the maintenance cost of adding a significant amount of new code. |
@slaren Thank you. Very good advice. We will figure out a way to better fit the existing framework. And I would like to know concretely what is the "meaningful improvements in practical aspects" in your mind? Like which aspects do you (and other owners) care the most about, or what do you expect on these new code? In this PR, we see comments from different views such as full-thread CPU performance, GPU constraints and NPU support. Maybe some important points or metrics are not mentioned yet. |
Generally speaking, if we want to add new quantization formats, I would expect them to be better than any other existing formats in llama.cpp, at least in some aspects. That may be performance, file size, quality to file size ratio, power usage, etc. I think it was already established before that it is the case, but I just wanted to make it clear in case this changes after the changes that you are planning to make to the code. |
@slaren Great! Then we will conduct those tests again after code changing. |
This PR introduces a new efficient lookup-table(LUT)-based matrix multiplication method to speed up low-bit LLM inference, and adds a new tensor type named
INT_N
to support it. The method can provide up to 3~4× increase in end-to-end inference throughput and 70% reduction in energy consumption.Unlike the existing quant-dequant methods, the LUT-based method directly supports mixed-precision-GEMM(mpGEMM) without dequantization. It uses bit-wise table lookup to eliminate multiplications and reduce additions required in matrix multiplication. In this PR, we propose the LUT method from T-MAC. As it utilizes the same number of lookup-tables as the weight bits, this LUT-based method provides a unified solution for mpGEMM and the kernels can scale linearly to the weight bit-width, instead of falling back to int8 or uint8 for all low-bit values under 8 bits.
We add a new data type
INT_N
as well as corresponding convert script to support the tensor layout needs of LUT kernels. The bit rate ofINT_N
depends on the scale group size of the model and/or its original data type. The scale group size alternatives are {32, 64, 128, 256}, in which >=64 will fully unveil the efficiency of T-MAC. For example, if the scale group size is 64, the model weights are 4 bits and the scales areF32
, the bit rate will be (256 * 4 + 32) / 256 = 4.125 bpw.How to Use It
Using T-MAC in llama.cpp is similar to using existing quantization methods and models, except for a few commands to compile the LUT kernels for the model to run and convert the models into data types that are currently supported, (for now,
Q4_0
,TQ1_0
,TQ2_0
andINT_N
). ForQ4_0
,TQ1_0
andTQ2_0
, models in gguf format can directly run with T-MAC; while forINT_N
, we support in convert_hf_to_gguf.py to convert HuggingFace models toINT_N
. Compiling LUT kernels requires dependencies of T-MAC modules.K-quant can be supported by T-MAC, which requires some engineering efforts. We plan it as a TODO item.
We have a dockerfile to setup T-MAC environment in Ubuntu-22.04 and a one-stop script to use T-MAC. Note that the script is only a wrapper for convenience and doesn't introduce a brand new way to build the project. Here are some examples:
Speed
Update: for the latest numbers, see below.
We test this PR on an Intel i7-12700 and an Apple M2-Ultra. The numbers below are in token/s. For details of the model, see the next section.
Model size
The
INT_N
model usesF16
embedding and output weights, therefore theQ2_K
andQ4_0
model here uses the same config for a fair comparison.And we use pure
Q2_K
model here since the model size is very closed toINT_N
(2bit).Note that the block size of
INT_N
models here are 64 for 2bit and 128 for 4bit, and the scales are stored inF32
now.* we find that
Q2_K
is actually 2.625 bpw instead of 2.5625 described in #1684.Perplexity
See below.
Note
T-MAC has a public repo which includes llama.cpp as a third-party module. For changes inside llama.cpp, we can directly merge the changes, while the T-MAC modules will stay in that repo. It provides the capability to run arbitrary models with T-MAC. Without it, we can only run those supported by pre-built LUT kernels.
Our LUT-based method is used in the recently open-sourced bitnet.cpp repo which is built on llama.cpp. We can easily generate corresponding kernels and support their models.
Future Work