Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache #5492

Closed
sorasoras opened this issue Feb 14, 2024 · 10 comments
Closed

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache #5492

sorasoras opened this issue Feb 14, 2024 · 10 comments
Labels
enhancement New feature or request stale

Comments

@sorasoras
Copy link

Feature Description

with KV cache quantized in 2bits. This brings 2.6× less peak memory on the Llama/Mistral/Falcon models we evaluated while enabling 4x larger batch size, resulting in 2.35× - 3.47× throughput improvement.

Motivation

Reduce memory use by Kv cache during long context batch inference
https://arxiv.org/abs/2402.02750
https://github.com/jy-yuan/KIVI

it was publish at reddit
https://www.reddit.com/r/LocalLLaMA/comments/1ap3bkt/kv_cache_is_huge_and_bottlenecks_llm_inference_we/

Possible Implementation

https://github.com/jy-yuan/KIVI

I find it quite interesting, it might improve a lot for VRAM poor users even without large batch or long context.

@sorasoras sorasoras added the enhancement New feature or request label Feb 14, 2024
@Green-Sky
Copy link
Collaborator

Note worthy is the fact that llama.cpp supports kv quantization. Going beyond q8_0 usually leads to very poor quality however.

@Dampfinchen
Copy link

Note worthy is the fact that llama.cpp supports kv quantization. Going beyond q8_0 usually leads to very poor quality however.

Llama.cpp only supports 8 bit k cache. V 8 bit is not implemented yet

@BarfingLemurs
Copy link
Contributor

Not true, type Q4_0 and Q4_1 k cache quantization works for me and are documented in this PR:

#4312

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Mar 18, 2024
@DesperateZero
Copy link

Is anyone else still interested in this feature? It would be incredibly helpful for running long contexts on systems with limited VRAM

@slaren slaren removed the stale label Mar 18, 2024
@sorasoras
Copy link
Author

@ikawrakow Any thing you can help with implement this on the project?We have a lots of progress on weight quants but we re still using FP16 kv cache :)

@Green-Sky
Copy link
Collaborator

Green-Sky commented Mar 19, 2024

I have been using q8_0 for the k part of the cache for a long time now without any issues.

llama_new_context_with_model: KV self size = 980.00 MiB, K (q8_0): 340.00 MiB, V (f16): 640.00 MiB

@ikawrakow
Copy link
Contributor

ikawrakow commented Mar 20, 2024

@sorasoras

To me it looks like the topic of quantized cache needs more attention from the project maintainers rather than quantization improvements:

  • Yes, we can have K quantized with Q4_0, Q4_1, Q5_0, Q5_1, or Q8_0, but not V (attempts to use quantized V cache lead to assert in ggml_cuda_cpy_tensor_2d
  • Using quantized K cache leads to a significant drop in inference speed (from 130 t/s to 76 t/s on my RTX-4080). From a quick look the implementation seems far from optimal.
  • Using quantized K cache other than Q8_0 results in significant PPL increase. I personally have a hard time believing that a KV cache quantized with 2 bits as stipulated by this issue and the quoted paper will result in a meaningful generation quality
  • Using more sophisticated quantization techniques, which require significantly more CPU/GPU cycles, will be even more disastrous for performance (at least within the current quantized cache implementation). I did a quick test with IQ4_NL (it seems block size needs to be 32, so IQ4_NL is the only non-legacy quantization type that can be used). I see performance dropping even further to 62 t/s. PPL improves compared to Q4_0, but not compared to Q4_1, so the only thing we gained is a ~17% reduction in the size of the K cache.

@github-actions github-actions bot added the stale label Apr 20, 2024
Copy link
Contributor

github-actions bot commented May 5, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed May 5, 2024
@sorasoras
Copy link
Author

@ggerganov With FA merged, Are there any chance to improve speed of kv quants so it become useful

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests

7 participants