You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The survey discusses the sensitivity of activation quantization and the tolerance of KV cache quantization in the context of post-training quantization (PTQ) for large language models (LLMs). It makes the distinction that while activation quantization is quite sensitive (meaning it can significantly affect performance if not handled carefully), KV cache quantization is more tolerant (implying it can be quantized with less impact on performance).
My question is:
Whether KV cache should be considered part of the activation.
The text was updated successfully, but these errors were encountered:
To be exact, "activations" refers to the "temporary activations" in our paper, which serve as the inputs of linear operators, and KV cache comes from the outputs of k_proj and v_proj.
While you may think both temporary activations and KV cache are the "feature maps" within the model, we empirically found that some of their characteristics fairly differ from each other (including their sensitivity towards quantization). So I think it's not a good idea to treat KV cache and the temporary activations as the same kind of tensors.
By the way, some recent works have also shown their finding about the difference between activations and KV cache, aligning with our observations. For example, WKVQuant also indicates the excessive sensitivity of temporary activations, compared with KV cache. Furthermore, KIVI's study on the data distribution of KV cache demostrates that the outlier patterns of KV cache and temporary activations are quite different, so it's not surprising the effect of quantization varies on these two kinds of tensors.
The survey discusses the sensitivity of activation quantization and the tolerance of KV cache quantization in the context of post-training quantization (PTQ) for large language models (LLMs). It makes the distinction that while activation quantization is quite sensitive (meaning it can significantly affect performance if not handled carefully), KV cache quantization is more tolerant (implying it can be quantized with less impact on performance).
My question is:
Whether KV cache should be considered part of the activation.
The text was updated successfully, but these errors were encountered: