Allow more granular KV cache settings #6561

dinerburger · 2024-12-07T17:42:08Z

Checklist:

I have read the Contributing guidelines.

This PR adds more granular support for KV cache settings, allowing:

Independent K and V cache types for llama.cpp
FP16, FP8, Q8, Q6 and Q4 cache types for exllama

This PR should be expanded to allow for new Quanto types as mentioned in #6126, but before I go too far I wanted to make sure this structure was appropriate.

NOTE: This should probably supersede or compliment #6280 somehow

Merge dev branch

oobabooga · 2024-12-09T15:06:14Z

Looks good, a single flag for all kv cache types and a dropdown menu are an ideal solution

It would be good to change shared.args.kv_cache_type if shared.args.cache_4bit or shared.args.cache_8bit are provided, to avoid a breaking change. Something like

if shared.args.cache_4bit and shared.args.loader.lower() in ['exllamav2', 'exllamav2_hf']:
    shared.args.kv_cache_type = 'q4'

...

Feel free to continue expanding this PR, I'll merge it when you say it's ready.

dinerburger · 2024-12-09T15:18:09Z

Yeah this is why I had marked it as draft; I figured we'd want to work out the little details before I got too far.

Sure it makes sense to transform the legacy KV cache quant options elsewhere. Should that go in shared.py? I can make a transform_kv_cache_options function that will work similarly to fix_loader_name for example.

I'll make this change a little later today. While I have you: are you interested in getting the Transformers Quanto loader going too as in #6126 or should I pass on that for now and focus on Exllama and llama.cpp?

oobabooga · 2024-12-09T15:33:22Z

Should that go in shared.py?

That's what I would personally do. Something simple and explicit, as this is temporary code (although I will probaby keep the old flags for a long while, given that they have existed for many months)

While I have you: are you interested in getting the Transformers Quanto loader going too as in #6126

If the same flag can be reused, that would be a great addition, yes. Does that work out of the box with transformers or does a new requirement have to be added?

dinerburger · 2024-12-09T15:43:04Z

Does that work out of the box with transformers or does a new requirement have to be added?

A very good question. I don't get to work with Transformers much since I'm GPU-poor, but I'll evaluate this on a smaller model today and report back. My current leaning is: if there are no additional requirements I'll probably hammer it in, otherwise we'll circle back and pick it up in another pass.

…fig files.

modules/shared.py

GodEmperor785 · 2024-12-10T14:14:27Z

modules/llamacpp_model.py

@@ -11,6 +11,32 @@
 from modules.text_generation import get_max_prompt_length




Are you sure all of these quant types work with llama.cpp?

I did some search here: #6168
From what I saw in llama.cpp list of supported cache types is shorter than list of all quant types (there were no K-quants in supported cache quantizations).

I checked llama.cpp code a bit more now and found this: common.cpp
From what I understand this function kv_cache_type_from_str determines type for KV cache for llama.cpp.
And it seems to allow less types, it allows only: "f32", "f16", "bf16", "q8_0", "q4_0", "q4_0", "q4_1", "iq4_nl", "q5_0", "q5_1".
In other cases it fails with "Unsupported cache type".

Did you check if all the cache types for llama.cpp added in PR work? Especially those not on the list in llama.cpp code (like q6_k or q4_k)?

EDIT: somehow I misclicked and comment was added above the relevant line in code...

Right, thanks for pointing that out, part of finalizing this PR is checking to ensure the KV cache matrix is correct. Additionally, many of these are only supported if you build with GGML_CUDA_FA_ALL_QUANTS, which I'm not sure if we have enabled on our wheel. I'll ensure this is all to code before I mark ready.

UPDATE:

Unsupported KV type combination for head_size 128. Supported combinations: - K == q4_0, V == q4_0, 4.50 BPV - K == q8_0, V == q8_0, 8.50 BPV - K == f16, V == f16, 16.00 BPV Compile with GGML_CUDA_FA_ALL_QUANTS for all combinations of q4_0, q4_1, q5_0, q5_1, q8_0, and f16.

So it looks like we're clamping these values to either q4_0 or q8_0, and disallowing mixing types. Bummer.

dinerburger · 2024-12-10T15:21:36Z

An update: transformers throws with

ImportError: You need to install optimum-quanto in order to use KV cache quantization with optimum-quanto backend. Please install it via with pip install optimum-quanto if you try to use quanto. I'm gonna branch and stash then fix the llama.cpp support matrix.

dinerburger · 2024-12-10T15:51:09Z

OK that'll do it for the last round of checks. I'm opening this for real review, since everything seems to be humming along nicely now. A note: I don't love the command line arguments, but llamacpp and exllama have pretty different options for kv cache quant. Happy to make changes to those as needed.

dinerburger · 2024-12-10T19:04:14Z

You know, thinking on this for the day I think I'm going to collapse this down to a single command line argument: --cache-bits, and treat it nicely per loader, raising if we get a bad request. This is a bit of a bummer, since 6 is valid for exllamav2 but invalid for other loaders. Maybe I'll do a little bit of gradio magic to hide it in those cases.

EDIT ok done I think this is ready to go.

oobabooga · 2024-12-11T03:11:13Z

About --cache-bits 8, that restricts ExLlamaV2 to the Q8 cache, removing support for the 8bit cache. I liked the previous solution, with a dropdown menu containing explicit text elements. That would also make the implementation more future-proof: maybe llama.cpp will add support the other precisions at some point, like "q4_k", and being able to pass a q4_k to the flag would be ideal instead of working with integers.

dinerburger · 2024-12-11T04:25:50Z

Got it I’ll roll it back tomorrow. Thanks for the feedback!

dinerburger · 2024-12-11T14:35:49Z

@oobabooga ok I've backed out the parameter unification patch and moved back to per-loader string-based quantization specification. I've confirmed everything is still cooking nicely. Thanks again for your patience!

…port. Disallow type mixing.

oobabooga · 2024-12-17T15:33:43Z

Thanks @dinerburger, I have made some small final changes:

Merge the flags into one --cache_type, and add info messages to the UI and the exceptions saying the valid types for each loader (to account for the possibility of future loaders other than llama.cpp and ExLlamaV2, I don't want to add a flag for each one).
Make --cache_type None by default instead of fp16
Rename get_llamacpp_quant_type_for_string to get_llamacpp_cache_type_for_string (since it will only be used for cache for now).

Do things look right after those changes? @dinerburger @GodEmperor785

dinerburger · 2024-12-17T16:17:40Z

Best of all worlds! LGTM! ✌️

GodEmperor785 · 2024-12-17T16:57:50Z

Looks good to me too

modules/ui_model_menu.py

oobabooga · 2024-12-17T20:43:32Z

Thanks @dinerburger @GodEmperor785!

@dinerburger if you feel like implementing quantized cache for Transformers, that would be a nice addition. I assume the additional requirement is pure Python/PyTorch, without compiled wheels, so it should be an easy addition.

dinerburger · 2024-12-18T04:40:29Z

@oobabooga I actually implemented it in this (now out of date) branch. Problem was: it was unusably bad. Generation was pure garbage. Not sure if it was because I missed setting axis-key and axis-value as mentioned in the HF Best Practices Guide, but whatever it was it certainly wasn't a drop-in solution like lcpp or exl. (I thought I saw additional documentation indicating that you shouldn't quantize until a particular cache size was reached, but I can't seem to find it now.)

oobabooga · 2024-12-18T20:34:11Z

Interesting, I see that you tried even the HQQ one. I found this source, which you have probably seen already

https://github.com/huggingface/transformers/blob/9613933b022ddbf085e2c593ed4ceea4c734179a/src/transformers/cache_utils.py#L664

https://github.com/huggingface/transformers/blob/9613933b022ddbf085e2c593ed4ceea4c734179a/src/transformers/cache_utils.py#L754

https://github.com/huggingface/transformers/blob/9613933b022ddbf085e2c593ed4ceea4c734179a/src/transformers/cache_utils.py#L820

int2 and int4 precision is probably round to nearest and too low precision. The HQQ int8 one would be the most likely to perform well.

dinerburger · 2024-12-18T22:14:59Z

It’s funny, I actually forgot to try the HQQ one at all favoring Quanto. I’ll get this branch rebased tonight and give it a shot.

oobabooga and others added 4 commits October 1, 2024 14:48

Merge pull request oobabooga#6421 from oobabooga/dev

3b06cb4

Merge dev branch

Merge pull request oobabooga#6422 from oobabooga/dev

d1af7a4

Merge dev branch

Merge pull request oobabooga#6491 from oobabooga/dev

cc8c7ed

Merge dev branch

Allow more granular KV cache settings

c5b7c70

oobabooga mentioned this pull request Dec 9, 2024

Add q-cache 6 and 8 support for Exllamav2 #6280

Closed

1 task

dinerburger added 3 commits December 9, 2024 14:26

Remove obviously unused f64 datatype

556c022

Remove commented out obsolete cache options

9bba99f

Remove as many references to cache_[84]bit as I can, migrate user con…

204724b

…fig files.

dinerburger commented Dec 9, 2024

View reviewed changes

modules/shared.py Show resolved Hide resolved

Forgot you can't just blindly delete non-existant keys in Python

a0405f5

GodEmperor785 reviewed Dec 10, 2024

View reviewed changes

dinerburger marked this pull request as ready for review December 10, 2024 15:51

dinerburger force-pushed the kv-cache-refactor branch 2 times, most recently from d41d2a6 to 31641ef Compare December 10, 2024 20:50

dinerburger force-pushed the kv-cache-refactor branch from 31641ef to 037caec Compare December 11, 2024 14:29

dinerburger force-pushed the kv-cache-refactor branch from b9d1128 to 9f9c13f Compare December 11, 2024 14:37

Condense lcpp cache types to f16, q4_0 and q8_0 to match prebuilt sup…

82ced8e

…port. Disallow type mixing.

dinerburger force-pushed the kv-cache-refactor branch from 9f9c13f to 82ced8e Compare December 11, 2024 15:03

oobabooga added 2 commits December 17, 2024 07:16

Join the flags in one --cache_type, set --cache_type to None by default

4db086e

Merge branch 'dev' into dinerburger-kv-cache-refactor

a33144e

oobabooga changed the base branch from main to dev December 17, 2024 15:19

oobabooga added 2 commits December 17, 2024 07:25

Revert a change I made to del_key

d6ce891

Formatting

46e65e4

dinerburger commented Dec 17, 2024

View reviewed changes

modules/ui_model_menu.py Outdated Show resolved Hide resolved

Add missing q4 option for exllamav2

e29ccd4

oobabooga merged commit addad3c into oobabooga:dev Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow more granular KV cache settings #6561

Allow more granular KV cache settings #6561

dinerburger commented Dec 7, 2024 •

edited

Loading

oobabooga commented Dec 9, 2024

dinerburger commented Dec 9, 2024

oobabooga commented Dec 9, 2024

dinerburger commented Dec 9, 2024

GodEmperor785 Dec 10, 2024 •

edited

Loading

dinerburger Dec 10, 2024 •

edited

Loading

dinerburger commented Dec 10, 2024

dinerburger commented Dec 10, 2024

dinerburger commented Dec 10, 2024 •

edited

Loading

oobabooga commented Dec 11, 2024

dinerburger commented Dec 11, 2024

dinerburger commented Dec 11, 2024

oobabooga commented Dec 17, 2024 •

edited

Loading

dinerburger commented Dec 17, 2024

GodEmperor785 commented Dec 17, 2024

oobabooga commented Dec 17, 2024

dinerburger commented Dec 18, 2024

oobabooga commented Dec 18, 2024

dinerburger commented Dec 18, 2024

		@@ -11,6 +11,32 @@
		from modules.text_generation import get_max_prompt_length

Allow more granular KV cache settings #6561

Allow more granular KV cache settings #6561

Conversation

dinerburger commented Dec 7, 2024 • edited Loading

Checklist:

oobabooga commented Dec 9, 2024

dinerburger commented Dec 9, 2024

oobabooga commented Dec 9, 2024

dinerburger commented Dec 9, 2024

GodEmperor785 Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

dinerburger Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

dinerburger commented Dec 10, 2024

dinerburger commented Dec 10, 2024

dinerburger commented Dec 10, 2024 • edited Loading

oobabooga commented Dec 11, 2024

dinerburger commented Dec 11, 2024

dinerburger commented Dec 11, 2024

oobabooga commented Dec 17, 2024 • edited Loading

dinerburger commented Dec 17, 2024

GodEmperor785 commented Dec 17, 2024

oobabooga commented Dec 17, 2024

dinerburger commented Dec 18, 2024

oobabooga commented Dec 18, 2024

dinerburger commented Dec 18, 2024

dinerburger commented Dec 7, 2024 •

edited

Loading

GodEmperor785 Dec 10, 2024 •

edited

Loading

dinerburger Dec 10, 2024 •

edited

Loading

dinerburger commented Dec 10, 2024 •

edited

Loading

oobabooga commented Dec 17, 2024 •

edited

Loading