Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gguf : add 64-bit support (GGUF v2) #2821

Merged
merged 10 commits into from
Aug 27, 2023
Merged

gguf : add 64-bit support (GGUF v2) #2821

merged 10 commits into from
Aug 27, 2023

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Aug 26, 2023

Adding 64-bit support as discussed: ggerganov/ggml#302 (comment)

Help with testing is appreciated. Should be backward compatible with v1

@klosax
Copy link
Contributor

klosax commented Aug 26, 2023

We should add types uint64_t , int64_t and double
And we should change to uint64_t on all lengths / sizes / counts just to be safe and future-proof, not only for tensor dimensions.

@ggerganov
Copy link
Owner Author

Need some help with the Python code

In the meantime, I will now add V1 backward comp in ggml.c reading

@ggerganov ggerganov marked this pull request as ready for review August 26, 2023 19:12
@ggerganov ggerganov changed the title gguf : add 64-bit support gguf : add 64-bit support (GGUF v2) Aug 26, 2023
@klosax
Copy link
Contributor

klosax commented Aug 26, 2023

We should change to uint64_t on all lengths / sizes / counts just to be safe and future-proof, not only change tensor dimensions.

ggml.c Outdated Show resolved Hide resolved
@KerfuffleV2
Copy link
Collaborator

I tested loading a couple GGUF v1 models, the backward compatibility seems to work fine.

@ghost
Copy link

ghost commented Aug 26, 2023

Similarly, no issues loading various v1 models.

Copy link
Contributor

@klosax klosax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both versions work good.

@klosax
Copy link
Contributor

klosax commented Aug 26, 2023

We can actually use quantizeto losslessly convert gguf v1 to v2, if the same format is chosen.

@philpax
Copy link

philpax commented Aug 26, 2023

Looks good, is the plan to update the metadata values for the lengths/etc before merge?

@ghost
Copy link

ghost commented Aug 26, 2023

We can actually use quantizeto losslessly convert gguf v1 to v2, if the same format is chosen.

@klosax Ah, that's useful. For a 7b q4_0 model, I use ./quantize ~/wizardLM.gguf 2 3

I don't need --allow-requantize or --leave-output-tensor, right?

@klosax
Copy link
Contributor

klosax commented Aug 26, 2023

I don't need --allow-requantize or --leave-output-tensor, right?

I dont think those parameters are needed. Maybe we should have a new parameter --copy-all-tensors instead so quant format wont matter.

@KerfuffleV2
Copy link
Collaborator

I dont think those parameters are needed.

llama.cpp/llama.cpp

Lines 4743 to 4746 in 730d9c6

// quantize only 2D tensors
quantize &= (tensor->n_dims == 2);
quantize &= params->quantize_output_tensor || name != "output.weight";
quantize &= quantized_type != tensor->type;

That logic is actually kind of wrong because the k-quants stuff can choose a different type than quantized_type. There's also no check after that part to see if the special k-quants type is the same as the current tensor type, it just tries to quantize (or fails if --allow-requantize isn't set).

It probably will work for the non-k-quants types but pretty sure k-quants won't work. (There were also some changes to the decisions k-quants makes for LLaMA2 70B models so in that particular case it wouldn't pass through all the tensors even if the other issues were dealt with.)

@ghost
Copy link

ghost commented Aug 26, 2023

Thanks. I used quantize q4_0 wizardlm and llama2. They load as GGUF V2, and appear working. I'll beware quantize and k-quants.

@ggerganov
Copy link
Owner Author

Thanks everyone for testing. We should merge this - anything else we won't to try before this?

akawrykow pushed a commit to akawrykow/llama.cpp that referenced this pull request Aug 29, 2023
* gguf : bump version to 2

* gguf : add support for 64-bit (no backwards comp yet)

* gguf : v1 backwards comp

* gguf.py : bump GGUF version

* gguf.py : uint64_t on all lengths, sizes and counts, enums still uint32_t

* gguf.py : string lengths uint32_t

* gguf : update all counts to 64-bit

* gguf.py : string len uint64_t and n_dims uint32_t

* gguf : fix typo

* llama.cpp : print gguf version

---------

Co-authored-by: klosax <[email protected]>
@pudepiedj
Copy link
Contributor

I am a long-term enthusiast for whisper.cpp which I use by default nowadays to transcribe my podcast Unmaking Sense.
I am new to Llama, so apologise if this isn't useful, but a few comments:

  1. A big thank you for all the work on both these projects, which are exemplary.
  2. I've quantized successfully this morning from the original Meta AI download llama-2-13B-chat through F16 to q8_0 GGUF and it runs straight away on a MacBook Pro M2 Max with 32GB RAM. q4_0 also runs, of course, but that isn't new.
  3. It says "Ctrl-C" allows interaction, but mine just aborts when running in terminal, rather as I would expect. I am obviously missing something.
  4. I am not sufficiently technically competent to offer much by way of coding collaboration but I'd happily write some user documentation if it would help coming from someone starting out on this journey who asks daft questions based on impressive levels of ignorance.
  5. Question: is there a way to prevent Llama-2-13B from producing random responses of indeterminate length and almost no relevance? In the screenshots in the repo you seem to have managed to force it to do the "meaning of life" question repeatedly, but I have no idea how you make it do that or control what kind of content it produces.
  6. Sorry to waste your time if this isn't helpful or breaches github protocols in some way.

@Green-Sky
Copy link
Collaborator

  1. It says "Ctrl-C" allows interaction, but mine just aborts when running in terminal, rather as I would expect. I am obviously missing something.

did you press it more than once? It queues a stop and gives you the control, and then if pressed again, exits the program. try to play with it a bit more :)

  1. Question: is there a way to prevent Llama-2-13B from producing random responses of indeterminate length and almost no relevance? In the screenshots in the repo you seem to have managed to force it to do the "meaning of life" question repeatedly, but I have no idea how you make it do that or control what kind of content it produces.

did you use the prompt template?

@pudepiedj
Copy link
Contributor

  1. It says "Ctrl-C" allows interaction, but mine just aborts when running in terminal, rather as I would expect. I am obviously missing something.

did you press it more than once? It queues a stop and gives you the control, and then if pressed again, exits the program. try to play with it a bit more :)

It seems that if you use Ctrl-C while the assistant is printing a reply, it behaves as expected and described, but if you press it afterwards, it aborts. Thanks for the hint.

  1. Question: is there a way to prevent Llama-2-13B from producing random responses of indeterminate length and almost no relevance? In the screenshots in the repo you seem to have managed to force it to do the "meaning of life" question repeatedly, but I have no idea how you make it do that or control what kind of content it produces.

did you use the prompt template?

I hadn't, but now I have. Thank you, again. Unfortunately it seems to lead to a collapse of the quality of the response to a point where it is worthless, but I therefore obviously need to investigate the process more.

@KerfuffleV2
Copy link
Collaborator

If you'd need to follow up, I'd suggest making an issue specifically to discuss your problem. This is a pull request that doesn't seem directly related.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants