Replies: 21 comments 34 replies
-
I made #1967 if anyone wants to try playing with this. edit: Hide evidence of my shame... Current version of that pull now reproduces GG's perplexity results at 4,096 context with scale 0.5. |
Beta Was this translation helpful? Give feedback.
-
I will prioritize VRAM optimizations; that's already useful on its own but especially if context can be extended the extra VRAM will be very useful. My top llama.cpp priorities will be to try and do a dequantize + matrix multiplication kernel and to look into whether the KV cache can be quantized. I think patching the CUDA implementation of RoPE won't be too difficult. Right now the CUDA code does not support LoRAs though. If this technique produces good results we should also think about how to specify RoPE scaling. Since finetuning with the same scaling seems to be important I think the ideal solution would be to specify the correct scaling in the model file. Still, if we want to support the user setting an arbitrary scaling at runtime we will also need a CLI argument that can override whatever the model file says. |
Beta Was this translation helpful? Give feedback.
-
This is amazing! If the context is getting so long now, I have some concerns about the KV cache size. We already know that there is no difference to perplexity when it is stored in F16 as opposed to F32, but has anyone tested quantizing it? Even just Q8_0 would cut it down a lot. |
Beta Was this translation helpful? Give feedback.
-
edit: I'm just going to delete this since weird stuff might have been going on with CUDA BLAS. It doesn't seem like the cuBLAS rope operations actually check that the arguments are the right type/length like the normal CPU ops so it might have been trying to use the new args format as if it was the old one. |
Beta Was this translation helpful? Give feedback.
-
I have added a version of the same dataset with no scaling, so you can compare the difference in ppl between the two versions and see what is the effect when finetuned using the scaling patch. This version is also trained using 4096 cutoff, but will quickly deteriorate past ~2400, even with the scaling applied during inference https://huggingface.co/kaiokendev/superhot-13b-8k-no-rlhf-test/tree/main/no_scaling |
Beta Was this translation helpful? Give feedback.
-
I made a PR that fixes LoRAs and CUDA acceleration not being usable at the same time: #1970 . I then also merged in the PR by @KerfuffleV2 #1967 which lets you set the rope scaling by compile time and pushed the branch here. |
Beta Was this translation helpful? Give feedback.
-
I tested this as well, using my own GPTQ/CUDA-based implementation. Here are some preliminary results: All tests are on the same set of 40 8k-token sequences, truncated at different lengths along the x axis. Base (red) is plain Llama-13B, 4-bit GPTQ. As is evident, average perplexity goes off the chart as soon as seq_len exceeds the base model's pretraining. The previous experiment (yellow) is something I tried out a few months back, finetuning on 6k-token examples just to see what would happen. I was able to overcome the limit at 2048 tokens, but only barely, and perplexity still starts to climb after that point, if slower than before. It should be continuing to drop. My conclusion at the time was that (much) more tuning was needed for this approach to work. SuperHOT (blue) is the interesting part. Positional embeddings are condensed by a factor of four, stretching the original 2048 positions across the 8192 spaces evenly. The SuperHOT LoRA is applied to give the model a chance with the new positional embeddings. The results are very encouraging, I'd say. The model is clearly taking advantage of the longer context provided, making better and better predictions the more context it has to work with. That doesn't mean it's better enough, of course. I think more testing is going to be needed, but I'm cautiously optimistic and anxious to try the 33B version soon. For completeness, the command line to reproduce the results with ExLlama:
|
Beta Was this translation helpful? Give feedback.
-
I did some testing here with @kaiokendev's 16k lora, looks like PPL is lower on the 16K model even at 2K context, and it does better at higher contexts (16K) too. We might find benefit in scaling this further, but we might reach a point where it starts hurting more than helping too |
Beta Was this translation helpful? Give feedback.
-
From Meta: https://arxiv.org/abs/2306.15595
|
Beta Was this translation helpful? Give feedback.
-
Wihtout finetuning, I find that you can get a little bit more extension wihtout degradation by not doing a simple scaling... if you do a gradual increasing scaling https://colab.research.google.com/drive/18Ou_Isi1HiqtWqkbKfBZ46Q2hHES3Jp8?authuser=2
|
Beta Was this translation helpful? Give feedback.
-
Haven't read the ^ paper yet, but i found that this method works best with larger models and there's probably a relationship between how well this method works and who many param you have. |
Beta Was this translation helpful? Give feedback.
-
That's a good idea. Maybe the most flexible way to handle this extended RoPE operation is to allow passing in a cached scale like that. It can just use the last scale item when the context size exceeds the scale length (probably has the best chance of being reasonable). Even for absurd context lengths, this would only use a pretty small amount of memory (256KiB for context 65,535 assuming 32bit float scale values). This approach also would handle the default scale gracefully (1 scale item of |
Beta Was this translation helpful? Give feedback.
-
I think people are missing the point of this a little bit. The idea is to retrain the model so that it works naturally at a different scale, not to make it tolerate multiple different scales while still preferring 1:1. Working at multiple scales is more demanding of the model than simply finetuning it on one new scaling factor. |
Beta Was this translation helpful? Give feedback.
-
A new method of interpolation has been proposed here: From what I could see it indeed gives coherent output even without fine-tuning. const float theta_scale = powf(10000.0, -2.0f/n_dims); to const float theta_scale = powf(10000.0 * powf(8.0, n_dims / (n_dims - 2.0)), -2.0f/n_dims); |
Beta Was this translation helpful? Give feedback.
-
Try #2054 I don't have beefy enough system at $home to run any numbers. But a vanilla vicuna 13b v1.3.0 q8_0 works alright with |
Beta Was this translation helpful? Give feedback.
-
Any new news on this front? |
Beta Was this translation helpful? Give feedback.
-
Well, this just landed: https://arxiv.org/abs/2307.02486 I cannot see any source code in the unilm repository that the paper links to, so this is more like a heads up about what is hopefully coming our way soon. |
Beta Was this translation helpful? Give feedback.
-
The advantage of RoPE scaling is it's something that can work with existing models (possibly after some fine-tuning). Stuff like the link is almost certainly going to require training completely new models from scratch so even if it worked perfectly and 100% of the information was available it still wouldn't really be "soon". |
Beta Was this translation helpful? Give feedback.
-
Just to confirm the current status of RoPE:
|
Beta Was this translation helpful? Give feedback.
-
Yes. Anything that can extend context size to 500K - 1M tk would be great. As gradient did with mistral. |
Beta Was this translation helpful? Give feedback.
-
quick question. I assume the when loading a model pretrained with a target context of 32k. I tested with simple |
Beta Was this translation helpful? Give feedback.
-
Intro
This is a discussion about a recently proposed strategy of extending the context size of LLaMA models.
Make sure to first get familiar with the info in the links above as there has already been ongoing discussions and results.
So far the discussion seems to focus on the coherency of the generated texts when using large context. I think what we can do here in
llama.cpp
in order to support these investigations is to provide a more objective way of evaluating the proposed method by computing the perplexity at different context sizes with and without fine-tuning. Very initial results already suggest that this idea might be viable, but we should carefully check that we are doing the computations correctlyPreliminary tests with LLaMA 7B
Applied the following simple patch as proposed by Reddit user pseudonerv in this comment:
This patch "scales" the RoPE position by a factor of
0.5
which should correspond to extending the max context size from 2048 to 4096.Running the following perplexity calculation for 7B LLaMA Q4_0 with context of 4096 yields:
Final result:
5.8945
:This is already looking very promising since without applying the "RoPE scaling" patch, the perplexity is extremely bad - starts off above
110.0
, which can be expected since the vanilla computation does not support context size beyond 2048.Additional tests with context size of 2048:
[163] 5.4708
0.5
:[163] 6.0642
I'm currently running the computations on the CPU as I have more confidence in the changes being correct, but we should look into updating the GPU code to support the RoPE scaling and doing more calculations to determine how the perplexity behaves for different context sizes.
The author of this idea @kaiokendev suggests that this approach should work even better with fine-tuned models (https://www.reddit.com/r/LocalLLaMA/comments/14fgjqj/comment/jp2dchb/?utm_source=share&utm_medium=web2x&context=3), so we should also do some tests with those models.
Result summary (live updates)
wiki.test.raw
Q4_0
1.0
5.4708
Q4_0
0.5
6.0642
Q4_0
1.0
inf
Q4_0
0.5
5.8945
Beta Was this translation helpful? Give feedback.
All reactions