Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow more granular KV cache settings #6561

Merged
merged 14 commits into from
Dec 17, 2024

Conversation

dinerburger
Copy link
Contributor

@dinerburger dinerburger commented Dec 7, 2024

Checklist:

This PR adds more granular support for KV cache settings, allowing:

  • Independent K and V cache types for llama.cpp
  • FP16, FP8, Q8, Q6 and Q4 cache types for exllama

This PR should be expanded to allow for new Quanto types as mentioned in #6126, but before I go too far I wanted to make sure this structure was appropriate.

NOTE: This should probably supersede or compliment #6280 somehow

@oobabooga
Copy link
Owner

Looks good, a single flag for all kv cache types and a dropdown menu are an ideal solution

It would be good to change shared.args.kv_cache_type if shared.args.cache_4bit or shared.args.cache_8bit are provided, to avoid a breaking change. Something like

if shared.args.cache_4bit and shared.args.loader.lower() in ['exllamav2', 'exllamav2_hf']:
    shared.args.kv_cache_type = 'q4'

...

Feel free to continue expanding this PR, I'll merge it when you say it's ready.

@dinerburger
Copy link
Contributor Author

Yeah this is why I had marked it as draft; I figured we'd want to work out the little details before I got too far.

Sure it makes sense to transform the legacy KV cache quant options elsewhere. Should that go in shared.py? I can make a transform_kv_cache_options function that will work similarly to fix_loader_name for example.

I'll make this change a little later today. While I have you: are you interested in getting the Transformers Quanto loader going too as in #6126 or should I pass on that for now and focus on Exllama and llama.cpp?

@oobabooga
Copy link
Owner

Should that go in shared.py?

That's what I would personally do. Something simple and explicit, as this is temporary code (although I will probaby keep the old flags for a long while, given that they have existed for many months)

While I have you: are you interested in getting the Transformers Quanto loader going too as in #6126

If the same flag can be reused, that would be a great addition, yes. Does that work out of the box with transformers or does a new requirement have to be added?

@dinerburger
Copy link
Contributor Author

Does that work out of the box with transformers or does a new requirement have to be added?

A very good question. I don't get to work with Transformers much since I'm GPU-poor, but I'll evaluate this on a smaller model today and report back. My current leaning is: if there are no additional requirements I'll probably hammer it in, otherwise we'll circle back and pick it up in another pass.

@@ -11,6 +11,32 @@
from modules.text_generation import get_max_prompt_length


Copy link
Contributor

@GodEmperor785 GodEmperor785 Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure all of these quant types work with llama.cpp?

I did some search here: #6168
From what I saw in llama.cpp list of supported cache types is shorter than list of all quant types (there were no K-quants in supported cache quantizations).

I checked llama.cpp code a bit more now and found this: common.cpp
From what I understand this function kv_cache_type_from_str determines type for KV cache for llama.cpp.
And it seems to allow less types, it allows only: "f32", "f16", "bf16", "q8_0", "q4_0", "q4_0", "q4_1", "iq4_nl", "q5_0", "q5_1".
In other cases it fails with "Unsupported cache type".

Did you check if all the cache types for llama.cpp added in PR work? Especially those not on the list in llama.cpp code (like q6_k or q4_k)?

EDIT: somehow I misclicked and comment was added above the relevant line in code...

Copy link
Contributor Author

@dinerburger dinerburger Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, thanks for pointing that out, part of finalizing this PR is checking to ensure the KV cache matrix is correct. Additionally, many of these are only supported if you build with GGML_CUDA_FA_ALL_QUANTS, which I'm not sure if we have enabled on our wheel. I'll ensure this is all to code before I mark ready.

UPDATE:

Unsupported KV type combination for head_size 128.
Supported combinations:
  - K == q4_0, V == q4_0,  4.50 BPV
  - K == q8_0, V == q8_0,  8.50 BPV
  - K == f16,  V == f16,  16.00 BPV
Compile with GGML_CUDA_FA_ALL_QUANTS for all combinations of q4_0, q4_1, q5_0, q5_1, q8_0, and f16.

So it looks like we're clamping these values to either q4_0 or q8_0, and disallowing mixing types. Bummer.

@dinerburger
Copy link
Contributor Author

An update: transformers throws with

ImportError: You need to install optimum-quanto in order to use KV cache quantization with optimum-quanto backend. Please install it via with pip install optimum-quanto if you try to use quanto. I'm gonna branch and stash then fix the llama.cpp support matrix.

@dinerburger
Copy link
Contributor Author

OK that'll do it for the last round of checks. I'm opening this for real review, since everything seems to be humming along nicely now. A note: I don't love the command line arguments, but llamacpp and exllama have pretty different options for kv cache quant. Happy to make changes to those as needed.

@dinerburger dinerburger marked this pull request as ready for review December 10, 2024 15:51
@dinerburger
Copy link
Contributor Author

dinerburger commented Dec 10, 2024

You know, thinking on this for the day I think I'm going to collapse this down to a single command line argument: --cache-bits, and treat it nicely per loader, raising if we get a bad request. This is a bit of a bummer, since 6 is valid for exllamav2 but invalid for other loaders. Maybe I'll do a little bit of gradio magic to hide it in those cases.

EDIT ok done I think this is ready to go.

@dinerburger dinerburger force-pushed the kv-cache-refactor branch 2 times, most recently from d41d2a6 to 31641ef Compare December 10, 2024 20:50
@oobabooga
Copy link
Owner

About --cache-bits 8, that restricts ExLlamaV2 to the Q8 cache, removing support for the 8bit cache. I liked the previous solution, with a dropdown menu containing explicit text elements. That would also make the implementation more future-proof: maybe llama.cpp will add support the other precisions at some point, like "q4_k", and being able to pass a q4_k to the flag would be ideal instead of working with integers.

@dinerburger
Copy link
Contributor Author

Got it I’ll roll it back tomorrow. Thanks for the feedback!

@dinerburger
Copy link
Contributor Author

@oobabooga ok I've backed out the parameter unification patch and moved back to per-loader string-based quantization specification. I've confirmed everything is still cooking nicely. Thanks again for your patience!

@oobabooga oobabooga changed the base branch from main to dev December 17, 2024 15:19
@oobabooga
Copy link
Owner

oobabooga commented Dec 17, 2024

Thanks @dinerburger, I have made some small final changes:

  1. Merge the flags into one --cache_type, and add info messages to the UI and the exceptions saying the valid types for each loader (to account for the possibility of future loaders other than llama.cpp and ExLlamaV2, I don't want to add a flag for each one).
  2. Make --cache_type None by default instead of fp16
  3. Rename get_llamacpp_quant_type_for_string to get_llamacpp_cache_type_for_string (since it will only be used for cache for now).

Do things look right after those changes? @dinerburger @GodEmperor785

@dinerburger
Copy link
Contributor Author

Best of all worlds! LGTM! ✌️

@GodEmperor785
Copy link
Contributor

Looks good to me too

modules/ui_model_menu.py Outdated Show resolved Hide resolved
@oobabooga
Copy link
Owner

Thanks @dinerburger @GodEmperor785!

@dinerburger if you feel like implementing quantized cache for Transformers, that would be a nice addition. I assume the additional requirement is pure Python/PyTorch, without compiled wheels, so it should be an easy addition.

@oobabooga oobabooga merged commit addad3c into oobabooga:dev Dec 17, 2024
@dinerburger
Copy link
Contributor Author

@oobabooga I actually implemented it in this (now out of date) branch. Problem was: it was unusably bad. Generation was pure garbage. Not sure if it was because I missed setting axis-key and axis-value as mentioned in the HF Best Practices Guide, but whatever it was it certainly wasn't a drop-in solution like lcpp or exl. (I thought I saw additional documentation indicating that you shouldn't quantize until a particular cache size was reached, but I can't seem to find it now.)

@oobabooga
Copy link
Owner

@dinerburger
Copy link
Contributor Author

It’s funny, I actually forgot to try the HQQ one at all favoring Quanto. I’ll get this branch rebased tonight and give it a shot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants