Use HuggingFace's Quanto library KV Cache Quantization for any Transformers-based loader #6126

Interpause · 2024-06-15T12:10:50Z

Description

HuggingFace's Quanto has implemented 4 bit & 2 bit KV cache quantization compatible with Transformers. See: https://huggingface.co/blog/kv-cache-quantization

I may PR when I've time to experiment.

Interpause · 2024-06-28T14:36:23Z

Seems definitely possible: Vahe1994/AQLM#85 (comment)

But man they made me project lead for something in uni so im in a time crunch

github-actions · 2024-12-25T23:16:52Z

This issue has been closed due to inactivity for 6 months. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

Interpause added the enhancement New feature or request label Jun 15, 2024

dinerburger mentioned this issue Dec 7, 2024

Allow more granular KV cache settings #6561

Merged

1 task

github-actions bot added the stale label Dec 25, 2024

github-actions bot closed this as completed Dec 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use HuggingFace's Quanto library KV Cache Quantization for any Transformers-based loader #6126

Use HuggingFace's Quanto library KV Cache Quantization for any Transformers-based loader #6126

Interpause commented Jun 15, 2024

Interpause commented Jun 28, 2024

github-actions bot commented Dec 25, 2024

Use HuggingFace's Quanto library KV Cache Quantization for any Transformers-based loader #6126

Use HuggingFace's Quanto library KV Cache Quantization for any Transformers-based loader #6126

Comments

Interpause commented Jun 15, 2024

Interpause commented Jun 28, 2024

github-actions bot commented Dec 25, 2024