Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GGUF #3695

Merged
merged 8 commits into from
Aug 27, 2023
Merged

GGUF #3695

merged 8 commits into from
Aug 27, 2023

Conversation

oobabooga
Copy link
Owner

@oobabooga oobabooga commented Aug 26, 2023

Updates llama-cpp-python and deprecates GGML in favor of the new GGUF format.

The conversion from old ggml to gguf through convert-llama-ggmlv3-to-gguf.py is not automatic (some command-line flags have to be set manually), so this will be quite messy.

Adds GGUF support while keeping GGML support thanks to @jllllll.

@jllllll
Copy link
Contributor

jllllll commented Aug 26, 2023

The conversion script is, unfortunately, not guaranteed to work with every model.
One such model was just discovered in the Discord server: MythoMax-L2-13B
The issue with models like that one has been fixed.

I have made renamed packages of llama-cpp-python 0.1.78 and written code to load GGML models through them: jllllll@4a999e3
This solution is pretty messy, so I wouldn't blame you if you didn't want to use it.
If you do, I can make a PR for it.

ctransformers can still load GGML models, but it is slower than llama-cpp-python and has other issues like not being able to unload models. This can serve as an alternative solution, but is far from ideal.

Not sure what TheBloke's plans are as far as converting previous models to GGUF. His latest uploads have been in both GGML and GGUF, so it seems that he still intends to support GGML.

@oobabooga
Copy link
Owner Author

oobabooga commented Aug 26, 2023

That's a valid solution. The situation is not good at the moment honestly:

  1. There are no GGUF versions of base llama available on Hugging Face: https://huggingface.co/models?search=llama%20gguf.
  2. The llama.cpp README and documentation do not contain any instructions on how to convert existing GGMLs to GGUF.
  3. The --help of convert-llama-ggmlv3-to-gguf.py claims that --eps should be set to 1e-5 for LLaMA2, when it has been demonstrated that 5e-6 leads to lower perplexities.
  4. It is not clear to me how the --eps hardcoded into a GGUF file gets used and if it can still be changed manually.
  5. The same for --context-length.
  6. No idea what --model-metadata-dir, --vocab-dir, and --vocabtype do.

Ideally, I would like to simply convert the 20 GGML models that I have to GGUF and move on, but that may not be possible.

@jllllll
Copy link
Contributor

jllllll commented Aug 26, 2023

It seems that n_gqa and rms_norm_eps were removed from llama-cpp-python. So, it is probably safe to assume that they can't be changed and the hardcoded values must be used. I have added some code to account for this in the PR I linked below.

--model-metadata-dir is intended to point to the original HF model dir.
GGUF supports additional features like special tokens. This info needs to be retrieved from the HF model for a full conversion since GGML doesn't have it. That option can be omitted, but a full conversion would be better.
I haven't tried it, but it may be possible to download the HF model with --text-only and not need the .bin files.

--vocab-dir and --vocabtype are used in combination with --model-metadata-dir.
--vocab-dir points to the directory containing tokenizer.model and is used if the tokenizer isn't in the HF model dir.
--vocabtype is set to the type of the tokenizer being loaded (or being created, not sure). It can be either spm or bpe, with spm being the default. Not sure what the difference between the 2 are or what models they are used with. Presumably, spm is what LLaMa uses.


I have made a PR with the commit linked before to be merged into the gguf2 branch here: #3697

@berkut1
Copy link
Contributor

berkut1 commented Aug 27, 2023

Yes, when llama.cpp merged GGUF they removed n_gqa and rms_norm_eps, and maybe some information we manually add to config.yaml can now be accessed from GGUF metadata (?)

There are more gguf https://huggingface.co/models?pipeline_tag=text-generation&sort=trending&search=gguf now.

@oobabooga
Copy link
Owner Author

So, it is probably safe to assume that they can't be changed and the hardcoded values must be used.

Yes, I have tested this and confirmed that the hardcoded values are used no matter what value you supply.

I think that your idea of moving llama-cpp-python==0.1.78 into a different namespace and using it for GGML models is a great one that will prevent a chaotic situation. Eventually I want to move to GGUF exclusively, but I speculate that will only be reasonable in 2-3 months.

@berkut1 thank you. The next step will be to parse the GGUF metadata and use that information in the UI (I haven't found how to read a GGUF in Python yet).

@oobabooga oobabooga merged commit e6eda5c into main Aug 27, 2023
@berkut1
Copy link
Contributor

berkut1 commented Aug 27, 2023

@oobabooga looks like llama-cpp-python just added GGUF support.
There are places where deprecated lines still exist, for example: https://github.com/abetlen/llama-cpp-python/blob/9ab49bc1d47a3da011107d0277184ab1948f1364/llama_cpp/llama.py#L229

Maybe as temporary, we can just read console outputs with starting rows llm_load_print_meta .
I searched in llama-cpp-python but didn't find how to get it via API.

For example llama.cpp has some methods https://github.com/ggerganov/llama.cpp/blob/789c8c945a2814e1487e18e68823d9926e3b1454/ggml.h#L1856

@oobabooga oobabooga deleted the gguf2 branch August 30, 2023 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants