Skip to content

Commit

Permalink
Add RoPE scaling support for transformers (including dynamic NTK)
Browse files Browse the repository at this point in the history
  • Loading branch information
oobabooga committed Aug 9, 2023
1 parent f4caaf3 commit d8fb506
Show file tree
Hide file tree
Showing 5 changed files with 16 additions and 9 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -299,12 +299,12 @@ Optionally, you can use the following command-line flags:
| `--rwkv-strategy RWKV_STRATEGY` | RWKV: The strategy to use while loading the model. Examples: "cpu fp32", "cuda fp16", "cuda fp16i8". |
| `--rwkv-cuda-on` | RWKV: Compile the CUDA kernel for better performance. |

#### RoPE (for llama.cpp and ExLlama only)
#### RoPE (for llama.cpp, ExLlama, and transformers)

| Flag | Description |
|------------------|-------------|
|`--compress_pos_emb COMPRESS_POS_EMB` | Positional embeddings compression factor. Should typically be set to max_seq_len / 2048. |
|`--alpha_value ALPHA_VALUE` | Positional embeddings alpha factor for NTK RoPE scaling. Scaling is not identical to embedding compression. Use either this or compress_pos_emb, not both. |
|`--alpha_value ALPHA_VALUE` | Positional embeddings alpha factor for NTK RoPE scaling. Use either this or compress_pos_emb, not both. |

#### Gradio

Expand Down
10 changes: 6 additions & 4 deletions modules/loaders.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,8 @@
'low_vram',
'mlock',
'llama_cpp_seed',
'compress_pos_emb',
'alpha_value',
'compress_pos_emb',
'cpu',
],
'llamacpp_HF': [
Expand All @@ -54,8 +54,8 @@
'low_vram',
'mlock',
'llama_cpp_seed',
'compress_pos_emb',
'alpha_value',
'compress_pos_emb',
'cpu',
'llamacpp_HF_info',
],
Expand All @@ -73,20 +73,22 @@
'quant_type',
'compute_dtype',
'trust_remote_code',
'alpha_value',
'compress_pos_emb',
'transformers_info'
],
'ExLlama': [
'gpu_split',
'max_seq_len',
'compress_pos_emb',
'alpha_value',
'compress_pos_emb',
'exllama_info',
],
'ExLlama_HF': [
'gpu_split',
'max_seq_len',
'compress_pos_emb',
'alpha_value',
'compress_pos_emb',
'exllama_HF_info',
]
}
Expand Down
7 changes: 6 additions & 1 deletion modules/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,7 @@ def huggingface_loader(model_name):
LoaderClass = AutoModelForCausalLM

# Load the model in simple 16-bit mode by default
if not any([shared.args.cpu, shared.args.load_in_8bit, shared.args.load_in_4bit, shared.args.auto_devices, shared.args.disk, shared.args.deepspeed, shared.args.gpu_memory is not None, shared.args.cpu_memory is not None]):
if not any([shared.args.cpu, shared.args.load_in_8bit, shared.args.load_in_4bit, shared.args.auto_devices, shared.args.disk, shared.args.deepspeed, shared.args.gpu_memory is not None, shared.args.cpu_memory is not None, shared.args.compress_pos_emb > 1, shared.args.alpha_value > 1]):
model = LoaderClass.from_pretrained(Path(f"{shared.args.model_dir}/{model_name}"), low_cpu_mem_usage=True, torch_dtype=torch.bfloat16 if shared.args.bf16 else torch.float16, trust_remote_code=shared.args.trust_remote_code)
if torch.backends.mps.is_available():
device = torch.device('mps')
Expand Down Expand Up @@ -215,6 +215,11 @@ def huggingface_loader(model_name):
no_split_module_classes=model._no_split_modules
)

if shared.args.compress_pos_emb > 1:
params['rope_scaling'] = {'type': 'linear', 'factor': shared.args.compress_pos_emb}
elif shared.args.alpha_value > 1:
params['rope_scaling'] = {'type': 'dynamic', 'factor': shared.args.alpha_value}

model = LoaderClass.from_pretrained(checkpoint, **params)

return model
Expand Down
2 changes: 1 addition & 1 deletion modules/shared.py
Original file line number Diff line number Diff line change
Expand Up @@ -164,7 +164,7 @@ def str2bool(v):

# RoPE
parser.add_argument('--compress_pos_emb', type=int, default=1, help="Positional embeddings compression factor. Should typically be set to max_seq_len / 2048.")
parser.add_argument('--alpha_value', type=int, default=1, help="Positional embeddings alpha factor for NTK RoPE scaling. Scaling is not identical to embedding compression. Use either this or compress_pos_emb, not both.")
parser.add_argument('--alpha_value', type=int, default=1, help="Positional embeddings alpha factor for NTK RoPE scaling. Use either this or compress_pos_emb, not both.")

# Gradio
parser.add_argument('--listen', action='store_true', help='Make the web UI reachable from your local network.')
Expand Down
2 changes: 1 addition & 1 deletion modules/ui_model_menu.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,8 +89,8 @@ def create_ui():
shared.gradio['autogptq_info'] = gr.Markdown('* ExLlama_HF is recommended over AutoGPTQ for models derived from LLaMA.')
shared.gradio['gpu_split'] = gr.Textbox(label='gpu-split', info='Comma-separated list of VRAM (in GB) to use per GPU. Example: 20,7,7')
shared.gradio['max_seq_len'] = gr.Slider(label='max_seq_len', minimum=0, maximum=16384, step=256, info='Maximum sequence length.', value=shared.args.max_seq_len)
shared.gradio['alpha_value'] = gr.Slider(label='alpha_value', minimum=1, maximum=8, step=0.1, info='Positional embeddings alpha factor for NTK RoPE scaling. Use either this or compress_pos_emb, not both.', value=shared.args.alpha_value)
shared.gradio['compress_pos_emb'] = gr.Slider(label='compress_pos_emb', minimum=1, maximum=8, step=1, info='Positional embeddings compression factor. Should typically be set to max_seq_len / 2048.', value=shared.args.compress_pos_emb)
shared.gradio['alpha_value'] = gr.Slider(label='alpha_value', minimum=1, maximum=8, step=0.1, info='Positional embeddings alpha factor for NTK RoPE scaling. Scaling is not identical to embedding compression. Use either this or compress_pos_emb, not both.', value=shared.args.alpha_value)

with gr.Column():
shared.gradio['triton'] = gr.Checkbox(label="triton", value=shared.args.triton)
Expand Down

3 comments on commit d8fb506

@Ph0rk0z
Copy link
Contributor

@Ph0rk0z Ph0rk0z commented on d8fb506 Aug 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's it do to gptq-for-llama and autogptq since they use part of transformers?

I just tried alpha it didn't work, started repeating. Compressed embedding might work since people used to use the monkeypatch but I have no model like that to test here.

@oobabooga
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a test with alpha = 2 yesterday on llama-2-7b-hf and it generated coherent output with 5200 tokens context. The dynamic RoPE scaling here is supposed to be better than both the llama.cpp and the ExLlama NTK that is available at the moment.

@Ph0rk0z
Copy link
Contributor

@Ph0rk0z Ph0rk0z commented on d8fb506 Aug 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HF should work since they merged it. But I need to see if the GPTQs can handle it too. So far no alpha but the compressed embedding may yet work. I got to around 2100 before I got repetition on vanilla GPTQ models w/alpha. I need to try a model with compressed embedding and GPTQ to see if the patch is still necessary (it was a monkeypatch to transformers) or if the native functionality can go into the loaders.

Please sign in to comment.