Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add llama 3.1 rope scaling factors to llama conversion and inference #8676

Merged
merged 6 commits into from
Jul 27, 2024

Conversation

jmorganca
Copy link
Contributor

Hi all, this commit generates the rope factors on conversion and adds them to the resulting model as a tensor. At inference time, these factors are passed to the ggml_rope_ext rope operation.

From our testing, this really improves results for context windows above 8192 for Llama 3.1.

This should fix #8650

@github-actions github-actions bot added the python python script changes label Jul 24, 2024
convert_hf_to_gguf.py Outdated Show resolved Hide resolved
@@ -1514,6 +1514,35 @@ def set_gguf_parameters(self):
if self.hparams.get("vocab_size", 32000) == 49152:
self.gguf_writer.add_add_bos_token(False)

if rope_scaling := self.find_hparam(["rope_scaling"], optional=True):
if rope_scaling.get("rope_type", '').lower() == "llama3":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's best if you merge this into the rope scaling above (line 1501), so that all the rope operations are in one place.

@jmorganca jmorganca marked this pull request as draft July 24, 2024 20:32
@jmorganca
Copy link
Contributor Author

Marking this as draft as there may also be a change required to the rope kernels – investigating!

@github-actions github-actions bot added the Nvidia GPU Issues specific to Nvidia GPUs label Jul 24, 2024
@jxy
Copy link
Contributor

jxy commented Jul 24, 2024

Probably it's not a good idea to change the kernels. I just tested creating the factors instead as in my comment above. Here's the changes on top of your PR,

diff --git a/convert_hf_to_gguf.py b/convert_hf_to_gguf.py
index 4422948f..897a6e33 100755
--- a/convert_hf_to_gguf.py
+++ b/convert_hf_to_gguf.py
@@ -1518,7 +1518,7 @@ class LlamaModel(Model):
             if rope_scaling.get("rope_type", '').lower() == "llama3":
                 base = hparams.get("rope_theta", 10000.0)
                 dim = int((hparams["hidden_size"] // hparams["num_attention_heads"]) * hparams.get("partial_rotary_embeddings", 1.0))
-                inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2, dtype=torch.float32) / dim))
+                freqs = 1.0 / (base ** (torch.arange(0, dim, 2, dtype=torch.float32) / dim))
 
                 factor = rope_scaling.get("factor", 8.0)
                 low_freq_factor = rope_scaling.get("low_freq_factor", 1.0)
@@ -1528,20 +1528,20 @@ class LlamaModel(Model):
                 low_freq_wavelen = old_context_len / low_freq_factor
                 high_freq_wavelen = old_context_len / high_freq_factor
 
-                rope_freqs = []
-                for freq in inv_freq:
+                rope_factors = []
+                for freq in freqs:
                     wavelen = 2 * math.pi / freq
                     if wavelen < high_freq_wavelen:
-                        rope_freqs.append(freq)
+                        rope_factors.append(1)
                     elif wavelen > low_freq_wavelen:
-                        rope_freqs.append(freq / factor)
+                        rope_factors.append(factor)
                     else:
                         assert low_freq_wavelen != high_freq_wavelen
                         smooth = (old_context_len / wavelen - low_freq_factor) / (high_freq_factor - low_freq_factor)
-                        rope_freqs.append((1 - smooth) * freq / factor + smooth * freq)
+                        rope_factors.append(1 / ((1 - smooth) / factor + smooth))
 
                 self.gguf_writer.add_rope_scaling_attn_factors(1.0)
-                self.gguf_writer.add_tensor(gguf.TENSOR_NAMES[gguf.MODEL_TENSOR.ROPE_FREQS] + ".weight", 1.0 / np.array(rope_freqs))
+                self.gguf_writer.add_tensor(gguf.TENSOR_NAMES[gguf.MODEL_TENSOR.ROPE_FREQS] + ".weight", np.array(rope_factors, dtype = np.float32))
 
     @staticmethod
     def permute(weights: Tensor, n_head: int, n_head_kv: int | None):

A quick test shows perplexity is slightly better than master branch ([1]4.6850,[2]5.9231),

perplexity: calculating perplexity over 17 chunks, n_ctx=16384, batch_size=2048, n_seq=1
perplexity: 191.29 seconds per pass - ETA 54.18 minutes
[1]4.3272,[2]5.5469,^C

@jmorganca jmorganca marked this pull request as ready for review July 24, 2024 22:30
else:
assert low_freq_wavelen != high_freq_wavelen
smooth = (old_context_len / wavelen - low_freq_factor) / (high_freq_factor - low_freq_factor)
rope_factors.append(1 / (1 - smooth) * factor + smooth)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want the invert of https://github.com/meta-llama/llama-models/blob/1b5892739868e5333fb7f022ba91218f0ae5f9c2/models/llama3_1/api/model.py#L62 factoring out the freq, which should be 1/((1 - smooth) / scale_factor + smooth)

@jxy
Copy link
Contributor

jxy commented Jul 24, 2024

It should be

1 / ((1 - smooth) / factor + smooth)

@jmorganca
Copy link
Contributor Author

Awesome, thanks @jxy. Should be updated now. Agreed much better without the kernel changes.

convert_hf_to_gguf.py Outdated Show resolved Hide resolved
@kallewoof
Copy link
Contributor

Nice. Tested doing a bunch of summaries using up the entire 128k context and the output looks good whereas on master it outputs broken garbage.

@MoonRide303
Copy link

MoonRide303 commented Jul 25, 2024

Initial tests show improvement over master (model able to reason within bigger context - I tested up to -c 32768), but I've noticed one potential problem - GGUF made with this PR has n_tensors = 292 instead of 291 on master (extra rope_freqs.weight). It might cause errors in some apps checking it (like koboldcpp):

llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 292, got 291
llama_load_model_from_file: failed to load model
Traceback (most recent call last):
  File "koboldcpp.py", line 4208, in <module>
    main(parser.parse_args(),start_server=True)
  File "koboldcpp.py", line 3883, in main
    loadok = load_model(modelname)
  File "koboldcpp.py", line 773, in load_model
    ret = handle.load_model(inputs)
OSError: exception: access violation reading 0x0000000000001894
[21452] Failed to execute script 'koboldcpp' due to unhandled exception!

FYI @LostRuins

@tristandruyen
Copy link
Contributor

tristandruyen commented Jul 25, 2024

Initial tests show improvement over master (model able to reason within bigger context - I tested up to -c 32768), but I've noticed one potential problem - GGUF made with this PR has n_tensors = 292 instead of 291 on master (extra rope_freqs.weight). It might cause errors in some apps checking it (like koboldcpp):

Isn't this just a side-effect of kobold.cpp using a different llama.cpp version internally, which would need to be updated anyways to take advantage of these changes ?

@Nexesenex
Copy link
Contributor

Nexesenex commented Jul 25, 2024

@MoonRide303 : I merged this PR in Kobold.cpp Frankenstein this morning (v171013), and it works (sensical inference) without problem beyond 8k context now.

@MoonRide303
Copy link

MoonRide303 commented Jul 25, 2024

Initial tests show improvement over master (model able to reason within bigger context - I tested up to -c 32768), but I've noticed one potential problem - GGUF made with this PR has n_tensors = 292 instead of 291 on master (extra rope_freqs.weight). It might cause errors in some apps checking it (like koboldcpp):

Isn't this just a side-effect of kobold.cpp using a different llama.cpp version internally, which would need to be updated anyways to take advantage of these changes ?

I tested it on koboldcpp 1.71 (released today, having merged upstream changes yesterday) - but you might be right some changes from original llama.cpp are missing (or some extra changes might be problematic). I am not sure if it's something to worry about on the llama.cpp project side, just pointing it out as potential side effect - for some people new GGUFs might not work.

@LostRuins
Copy link
Collaborator

I wasn't aware that this PR breaks previously quantized Llama 3.1 models, I thought that all prior models would continue to work.
Shouldn't the tensor count always match what's declared in the metadata regardless?

Anyway, I'll do a patch release once this is merged into master.

@MoonRide303
Copy link

MoonRide303 commented Jul 25, 2024

I wasn't aware that this PR breaks previously quantized Llama 3.1 models, I thought that all prior models would continue to work. Shouldn't the tensor count always match what's declared in the metadata regardless?

Anyway, I'll do a patch release once this is merged into master.

Old GGUFs will continue to work, it's just newly converted models (with this PR changes) that might not work with older llama.cpp-based apps (like koboldcpp 1.71).

@LostRuins
Copy link
Collaborator

Yeah but that's my point - it's failing to load because the tensor counts don't match the metadata, which it should in both cases, unless I misunderstand the error.

@kallewoof
Copy link
Contributor

Merging with koboldcpp concedo_experimental works fine on my end, FWIW.

@schmorp
Copy link

schmorp commented Jul 25, 2024

Why couldn't this tensor be added by llama.cpp when loading? superficially it doesn't make much sense to bake the rope config into the model at conversion time, and prevents bugfixes at a later time.

rope_factors.append(1 / ((1 - smooth) / factor + smooth))

self.gguf_writer.add_rope_scaling_attn_factors(1.0)
self.gguf_writer.add_tensor(self.format_tensor_name(gguf.MODEL_TENSOR.ROPE_FREQS), np.array(rope_factors, dtype=np.float32))
Copy link
Collaborator

@compilade compilade Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be included in --vocab-only models because it's in set_gguf_parameters, while the tensor data is not included in vocab-only files.

But the tensor count is still in the header of GGUF models, so I expect vocab-only files of Llama-3.1 to be malformed because of this.

This tensor could likely instead be "inserted" in the override of self.prepare_tensors to add it before the others and then call super().prepare_tensors(). If it's cleaner, it's possible to keep inside set_gguf_parameters the code which builds the tensor content, as long as the tensor is only "added" in prepare_tensors.

@jmorganca
Copy link
Contributor Author

jmorganca commented Jul 25, 2024

Thanks! Given the hard check on the tensor count at load time, it seems we might need to instead add the rope parameters as metadata and calculate it at runtime. Definitely did not want to break backwards compatibility with old GGUF files with this change sorry!

@ggerganov
Copy link
Owner

Thank you for looking into this. Btw even if it is in the metadata and the rope freqs are computed at runtime, it will still not be backwards compatible because the old GGUF files will not have the metadata

@gilbertgong
Copy link

gilbertgong commented Jul 25, 2024

Storing the information as metadata and computing at runtime seems preferable to me, as I think it seems more correct to have the tensor count match that of loading the model in HF transformers, for example (and just reduce potential confusion and people making incorrect assumptions based on mismatched tensor count). One could bake in defaults for when the metadata is missing from the gguf - that would make it backwards compatible, if desired.

@3Simplex
Copy link

Thank you for your effort on this, we are all waiting patiently for the rope to be finished! We have some great functionality in GPT4All with this new model!

image

@qnixsynapse
Copy link
Contributor

A program crash will queue the user to investigate what's wrong. Degraded performance just might make the user think the software/model is of low quality.

Exactly! I see no reason to make it backward compatible.

@LostRuins
Copy link
Collaborator

LostRuins commented Jul 26, 2024

Crashing outright is bad, a better result would be a warning printed to stdout. It's always better to fail gracefully when possible.

sort of like how we did this:

llama.cpp/src/llama.cpp

Lines 5344 to 5352 in 01245f5

if (tokenizer_pre.empty()) {
LLAMA_LOG_WARN("%s: missing pre-tokenizer type, using: 'default'\n", __func__);
LLAMA_LOG_WARN("%s: \n", __func__);
LLAMA_LOG_WARN("%s: ************************************ \n", __func__);
LLAMA_LOG_WARN("%s: GENERATION QUALITY WILL BE DEGRADED! \n", __func__);
LLAMA_LOG_WARN("%s: CONSIDER REGENERATING THE MODEL \n", __func__);
LLAMA_LOG_WARN("%s: ************************************ \n", __func__);
LLAMA_LOG_WARN("%s: \n", __func__);
vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_DEFAULT;

@slaren
Copy link
Collaborator

slaren commented Jul 26, 2024

It doesn't crash, it causes llama_load_model_from_file to return an error.

elif wavelen > low_freq_wavelen:
rope_factors.append(factor)
else:
assert low_freq_wavelen != high_freq_wavelen
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we assert if they're even or just not apply it at all and return gracefully?

Copy link
Collaborator

@compilade compilade Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this assertion should be outside of the loop. These values don't change within the loop. If it's gracefully handled, it should at least print a warning with something like logger.warning("rope freq high and low wavelengths can't be equal") (or something more informative).

(EDIT: woah, answering a comment while making a review is weird and seems to duplicate the message, with both instances having the same permalink)

@gilbertgong
Copy link

Another way to make it backwards compatible if desired is to just use the old rope calculation if the metadata/tensor is missing.

@oldgithubman
Copy link

oldgithubman commented Jul 26, 2024

Another way to make it backwards compatible if desired is to just use the old rope calculation if the metadata/tensor is missing.

We need more of this mentality around here. There's usually a way

@m18coppola
Copy link
Contributor

Another way to make it backwards compatible if desired is to just use the old rope calculation if the metadata/tensor is missing.

If so, I think there should be a warning similar to what @LostRuins mentioned.

convert_hf_to_gguf.py Outdated Show resolved Hide resolved
elif wavelen > low_freq_wavelen:
rope_factors.append(factor)
else:
assert low_freq_wavelen != high_freq_wavelen
Copy link
Collaborator

@compilade compilade Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this assertion should be outside of the loop. These values don't change within the loop. If it's gracefully handled, it should at least print a warning with something like logger.warning("rope freq high and low wavelengths can't be equal") (or something more informative).

(EDIT: woah, answering a comment while making a review is weird and seems to duplicate the message, with both instances having the same permalink)

@compilade
Copy link
Collaborator

Another way to make it backwards compatible if desired is to just use the old rope calculation if the metadata/tensor is missing.

That's what the code already seems to do. It only uses the rope freqs when present.

If so, I think there should be a warning

I'm not sure there can be such a warning, because older models like llama-2 still use the same model architecture (LLM_ARCH_LLAMA, so internally they can't really be distinguished), yet they don't require to use these modified rope freqs.

jmorganca and others added 4 commits July 26, 2024 15:01
This commit generates the rope factors on conversion and adds them to the resulting model as a tensor. At inference time, these factors are passed to the `ggml_rope_ext` rope oepration, improving results for context windows above 8192
@ddh0
Copy link
Contributor

ddh0 commented Jul 27, 2024

Just cloned branch jmorganca:masteras of commit 90fd87df4155aef5f099812a99c1e06c0b588c0d and used it to convert llama 3.1 8B instruct model and quantize to q8_0. Everything is working perfectly as far as I can tell. The only thing that I see potentially wrong is the pre-tokenizer is reported as smaug-bpe. Not sure if that's affecting anything or not.

src/llama.cpp Outdated Show resolved Hide resolved
convert_hf_to_gguf.py Outdated Show resolved Hide resolved
@compilade
Copy link
Collaborator

compilade commented Jul 27, 2024

The only thing that I see potentially wrong is the pre-tokenizer is reported as smaug-bpe. Not sure if that's affecting anything or not.

@ddh0 When I convert the latest Llama-3.1-8B-Instruct from https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct (at HF commit 07eb05b21) with the changes from this PR (at 90fd87d), the pre-tokenizer is detected as llama-bpe, so I guess the pre-tokenizer differences have been fixed upstream.

$ sha256sum Meta-Llama-3.1-8B-Instruct/tokenizer{,_config}.json 
79e3e522635f3171300913bb421464a87de6222182a0570b9b2ccba2a964b2b4  Meta-Llama-3.1-8B-Instruct/tokenizer.json
24e8a6dc2547164b7002e3125f10b415105644fcf02bf9ad8b674c87b1eaaed6  Meta-Llama-3.1-8B-Instruct/tokenizer_config.json

@ddh0
Copy link
Contributor

ddh0 commented Jul 27, 2024

Thanks @compilade, you were right. After pulling the latest repo the tokenizer is detected as llama-bpe. Thanks

jmorganca and others added 2 commits July 27, 2024 00:39
@ggerganov ggerganov merged commit b5e9546 into ggerganov:master Jul 27, 2024
55 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 27, 2024
* Add llama 3.1 rope scaling factors to llama conversion and inference

This commit generates the rope factors on conversion and adds them to the resulting model as a tensor. At inference time, these factors are passed to the `ggml_rope_ext` rope oepration, improving results for context windows above 8192

* Update convert_hf_to_gguf.py

Co-authored-by: compilade <[email protected]>

* address comments

* address comments

* Update src/llama.cpp

Co-authored-by: compilade <[email protected]>

* Update convert_hf_to_gguf.py

Co-authored-by: compilade <[email protected]>

---------

Co-authored-by: compilade <[email protected]>
manyoso pushed a commit to nomic-ai/llama.cpp that referenced this pull request Jul 27, 2024
* Add llama 3.1 rope scaling factors to llama conversion and inference

This commit generates the rope factors on conversion and adds them to the resulting model as a tensor. At inference time, these factors are passed to the `ggml_rope_ext` rope oepration, improving results for context windows above 8192

* Update convert_hf_to_gguf.py

Co-authored-by: compilade <[email protected]>

* address comments

* address comments

* Update src/llama.cpp

Co-authored-by: compilade <[email protected]>

* Update convert_hf_to_gguf.py

Co-authored-by: compilade <[email protected]>

---------

Co-authored-by: compilade <[email protected]>
LostRuins added a commit to LostRuins/koboldcpp that referenced this pull request Jul 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Nvidia GPU Issues specific to Nvidia GPUs python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: Proper Llama 3.1 Support in llama.cpp