Q4 quantization support #197

Narsil · 2023-03-17T08:50:33Z

Temporary PR, need to figure out a way to make sure this is usable in practice.

Either make the format work for llama.cpp &co (but the models over there include tokenization so...)
Or make something like smelt work with quantized data.

- Format ggerganov/ggml#27

Narsil · 2023-03-17T14:55:37Z

Converted to draft, I will only merge this after being showcased in a real model example.

philpax · 2023-04-14T08:35:30Z

I've been thinking about this some more from llama-rs's side - I think it would be quite nice for us to use safetensors as a first-class format that could support LLaMA/RWKV/BLOOM/etc in q4 format.

We'd need to store the hyperparameters and vocabulary ((string, f32)[]) - I assume that would be possible in the header?

Narsil · 2023-04-17T09:20:21Z

hyperparameters and vocabulary

Ideally the vocabulary would be in tokenizers https://github.com/huggingface/tokenizers/ which supports all of llama
(file being here) https://huggingface.co/hf-internal-testing/llama-tokenizer/blob/main/tokenizer.json

What do you mean by hyperparameters ?

Now supporting Q4 will require a bit more work, I've deep dived into it, and it's not exactly n-bits per parameter. It's more n-bits per group of 32 q4. And the number is not the same for q4_0 and q4_1 (which I think would be more correctly named q4_0_32, q4_1_32 since the packing size is quite critical (ggerganov/llama.cpp#1004)

philpax · 2023-04-19T20:47:18Z

Ideally the vocabulary would be in tokenizers https://github.com/huggingface/tokenizers/ which supports all of llama

Hmm, fair enough. We prefer single-file deployments for the convenience, but it makes sense to have a standard here.

What do you mean by hyperparameters ?

Vocabulary size, dimensions, heads, layers. The usual. I imagine that's part of HF's config.json?

Now supporting Q4 will require a bit more work, I've deep dived into it, and it's not exactly n-bits per parameter. It's more n-bits per group of 32 q4. And the number is not the same for q4_0 and q4_1 (which I think would be more correctly named q4_0_32, q4_1_32 since the packing size is quite critical (ggerganov/llama.cpp#1004)

Yeah, that makes sense. No rush on this, we'll support it when it's ready :)

Narsil · 2023-04-20T17:17:48Z

Vocabulary size, dimensions, heads, layers. The usual. I imagine that's part of HF's config.json?

Currently yes.

We prefer single-file deployments for the convenience,

Afaik, you always need to graph of computation too, which is neither included in config.json nor in model.safetensors but in the program itself. (In transformers or ggml for instance).

It's currently not a goal to be single file deployments for that reason (and because you can write a different/better program with the same weights which we happen to do quite regularly).

Please let me know when you have q4 support in whatever format I'll take a look on how to enable here.
And if there are specific alignements required too (I don't think there's anything more than regular byte alignment, but I may have misread that).

iacore · 2023-05-02T10:51:17Z

ggml has added 2 packing formats. Those are better.

q4_0, q4_2: $\vec{x}w$
q4_1, q4_3: $\vec{x}w + b$

q4_0: 32 ints packed
q4_1: 32 ints
q4_2: 16 ints
q4_3: 16 ints

Maybe a more descriptive name is better?

Narsil · 2023-05-03T08:49:18Z

Maybe a more descriptive name is better?

I had in mind q4_0_32 and q4_0_16.

The thing is that this format packs the scale+zero point. GPTQ splits those in different tensors: https://github.com/qwopqwop200/GPTQ-for-LLaMa
Which make the current safetensors already valid.

I'm not sure how much the locality helps performance there.

Also adding New formats (especially a matrix of them, since currently there are 3, 4, 5 bits quantization schemes along 16, 32 packing (128 in gptq and full row) and (scale, scale+zero) ) makes a lot of added complexity on the types, and none of them would be loadable in torch, tf and numpy.

It's not at all a problem to add specific types, but since we have to maintain them until the end of time, I think it would be nice to do it when community settles on common grounds on them.

My current understanding is that ggml is recommending q5_1_32, while gptq recommends q4_1_128 ( GPT does different packing scheme which works better than the naive ggml hence the reduced bitsize iiuc)

iacore · 2023-05-04T01:53:23Z

Current safetensors support bfloat16, but is only supported by torch/tf, not numpy.

The problem is that the official loader is too restrictive. On meeting unknown types it just gives up. The upper case dtype naming due to serde is weird too. (F16 instead of f16)

See here: https://github.com/huggingface/safetensors/blob/752c1ab3b52463f4c4efda056e4c6a41e81a7ff3/safetensors/src/tensor.rs#LL594C1-L594C1

Maybe we should have a place to document custom types. Something like IANA registry for types.

Features:

Alignment
bytes per unit
elements per unit (important for quantized types, since it's packed)

The loader code is simple that applications can write their own.

I already made a tool to quantize safetensors models to every quantized format ggml supports: https://github.com/iacore/model-conversions/tree/main/quantize-wizard

Narsil · 2023-05-04T09:43:00Z

Current safetensors support bfloat16, but is only supported by torch/tf, not numpy.

I know and which is why I said it's not a blocker to add custom types. (Just still something to think about when adding things).
torch, tf (and jax sort of since it's mostly on par with tf for types) are primary targets (currently, I would love to see non Python alternatives definitely).

Here q4_{0,1} would be clearly made for llama.cpp (and friends)

Features:

Alignment
bytes per unit
elements per unit (important for quantized types, since it's packed)

I like the idea, but it wouldn't work for GPTQ for instance, since GPTQ splits the packed quantized unit and the scales and zeros into different tensors. Because the "unit" is not even in a single tensor there.

Maybe this splitting is just a bad idea, I haven't formally checked this yet. (Meaning we could stop thinking about GPTQ there and the idea you suggest works.)

iacore · 2023-05-04T11:48:49Z

Here q4_{0,1} would be clearly made for llama.cpp (and friends)

No. Quantization is useful for RWKV (not transformer). Maybe it's also for other ANN as well.

I like the idea, but it wouldn't work for GPTQ for instance, since GPTQ splits the packed quantized unit and the scales and zeros into different tensors. Because the "unit" is not even in a single tensor there.

How does it work? What's the quantized struct in C?

Narsil · 2023-05-04T12:14:31Z

Complete story : https://github.com/qwopqwop200/GPTQ-for-LLaMa/blob/triton/quant/quant_linear.py#L73

Simple story, they store zeros and scales in a different tensor altogether than the 4bits packed "weights" tensor.
There is no "C" struct.

It allows for non linear mapping of packings, which is an important aspect of the method, where they pack quantization with respect to activations, which supposedly handles outliers better (and hence less variance in degradation when doing the quantization).

iacore · 2023-05-05T15:16:12Z

That seems easier. Just store them as different tensors inside .safetensors.

cztomsik · 2023-05-31T07:00:36Z

FYI GGML just got the ability to load/export graphs. It's not exactly what was discussed here but it might be usable for inference.
ggerganov/ggml#108

Narsil added 2 commits March 17, 2023 09:45

Adding Q4_{0/1} support.

bd27e75

- Format ggerganov/ggml#27

Adding link to Q4_01 descriptions.

8f2f2b3

Narsil requested review from McPatate and NouamaneTazi March 17, 2023 08:50

McPatate approved these changes Mar 17, 2023

View reviewed changes

Narsil marked this pull request as draft March 17, 2023 14:55

Narsil mentioned this pull request Apr 5, 2023

Support the new mmap-able ggml format rustformers/llm#93

Closed

Narsil mentioned this pull request May 25, 2023

support more dtypes #256

Closed

philpax mentioned this pull request May 31, 2023

ggml : unified file format ggerganov/ggml#220

Closed

github-actions bot added the Stale label Dec 18, 2023

github-actions bot closed this Dec 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Q4 quantization support #197

Q4 quantization support #197

Narsil commented Mar 17, 2023

Narsil commented Mar 17, 2023

philpax commented Apr 14, 2023

Narsil commented Apr 17, 2023

philpax commented Apr 19, 2023

Narsil commented Apr 20, 2023

iacore commented May 2, 2023 •

edited

Loading

Narsil commented May 3, 2023

iacore commented May 4, 2023 •

edited

Loading

Narsil commented May 4, 2023 •

edited

Loading

iacore commented May 4, 2023

Narsil commented May 4, 2023 •

edited

Loading

iacore commented May 5, 2023

cztomsik commented May 31, 2023

Q4 quantization support #197

Q4 quantization support #197

Conversation

Narsil commented Mar 17, 2023

Narsil commented Mar 17, 2023

philpax commented Apr 14, 2023

Narsil commented Apr 17, 2023

philpax commented Apr 19, 2023

Narsil commented Apr 20, 2023

iacore commented May 2, 2023 • edited Loading

Narsil commented May 3, 2023

iacore commented May 4, 2023 • edited Loading

Narsil commented May 4, 2023 • edited Loading

iacore commented May 4, 2023

Narsil commented May 4, 2023 • edited Loading

iacore commented May 5, 2023

cztomsik commented May 31, 2023

iacore commented May 2, 2023 •

edited

Loading

iacore commented May 4, 2023 •

edited

Loading

Narsil commented May 4, 2023 •

edited

Loading

Narsil commented May 4, 2023 •

edited

Loading