Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does KV cache belong to Activation? #6

Open
pprp opened this issue Apr 18, 2024 · 1 comment
Open

Does KV cache belong to Activation? #6

pprp opened this issue Apr 18, 2024 · 1 comment

Comments

@pprp
Copy link

pprp commented Apr 18, 2024

The survey discusses the sensitivity of activation quantization and the tolerance of KV cache quantization in the context of post-training quantization (PTQ) for large language models (LLMs). It makes the distinction that while activation quantization is quite sensitive (meaning it can significantly affect performance if not handled carefully), KV cache quantization is more tolerant (implying it can be quantized with less impact on performance).

My question is:
Whether KV cache should be considered part of the activation.

@wln20
Copy link
Contributor

wln20 commented Apr 19, 2024

Hi!

To be exact, "activations" refers to the "temporary activations" in our paper, which serve as the inputs of linear operators, and KV cache comes from the outputs of k_proj and v_proj.

While you may think both temporary activations and KV cache are the "feature maps" within the model, we empirically found that some of their characteristics fairly differ from each other (including their sensitivity towards quantization). So I think it's not a good idea to treat KV cache and the temporary activations as the same kind of tensors.

By the way, some recent works have also shown their finding about the difference between activations and KV cache, aligning with our observations. For example, WKVQuant also indicates the excessive sensitivity of temporary activations, compared with KV cache. Furthermore, KIVI's study on the data distribution of KV cache demostrates that the outlier patterns of KV cache and temporary activations are quite different, so it's not surprising the effect of quantization varies on these two kinds of tensors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants