Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge dev branch #4632

Merged
merged 4 commits into from
Nov 17, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -298,7 +298,7 @@ Optionally, you can use the following command-line flags:
| `--xformers` | Use xformer's memory efficient attention. This is really old and probably doesn't do anything. |
| `--sdp-attention` | Use PyTorch 2.0's SDP attention. Same as above. |
| `--trust-remote-code` | Set `trust_remote_code=True` while loading the model. Necessary for some models. |
| `--use_fast` | Set `use_fast=True` while loading the tokenizer. |
| `--no_use_fast` | Set use_fast=False while loading the tokenizer (it's True by default). Use this if you have any problems related to use_fast. |
| `--use_flash_attention_2` | Set use_flash_attention_2=True while loading the model. |

#### Accelerate 4-bit
Expand All @@ -325,6 +325,7 @@ Optionally, you can use the following command-line flags:
| `--mlock` | Force the system to keep the model in RAM. |
| `--n-gpu-layers N_GPU_LAYERS` | Number of layers to offload to the GPU. |
| `--tensor_split TENSOR_SPLIT` | Split the model across multiple GPUs. Comma-separated list of proportions. Example: 18,17. |
| `--llama_cpp_seed SEED` | Seed for llama-cpp models. Default is 0 (random). |
| `--numa` | Activate NUMA task allocation for llama.cpp. |
| `--logits_all`| Needs to be set for perplexity evaluation to work. Otherwise, ignore it, as it makes prompt processing slower. |
| `--cache-capacity CACHE_CAPACITY` | Maximum cache capacity (llama-cpp-python). Examples: 2000MiB, 2GiB. When provided without units, bytes will be assumed. |
Expand Down
5 changes: 3 additions & 2 deletions docs/04 - Model Tab.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,14 +21,14 @@ Options:
* **alpha_value**: Used to extend the context length of a model with a minor loss in quality. I have measured 1.75 to be optimal for 1.5x context, and 2.5 for 2x context. That is, with alpha = 2.5 you can make a model with 4096 context length go to 8192 context length.
* **rope_freq_base**: Originally another way to write "alpha_value", it ended up becoming a necessary parameter for some models like CodeLlama, which was fine-tuned with this set to 1000000 and hence needs to be loaded with it set to 1000000 as well.
* **compress_pos_emb**: The first and original context-length extension method, discovered by [kaiokendev](https://kaiokendev.github.io/til). When set to 2, the context length is doubled, 3 and it's tripled, etc. It should only be used for models that have been fine-tuned with this parameter set to different than 1. For models that have not been tuned to have greater context length, alpha_value will lead to a smaller accuracy loss.
* **cpu**: Loads the model in CPU mode using Pytorch. The model will be loaded in 32-bit precision, so a lot of RAM will be used. CPU inference with transformers is older than llama.cpp and it works, but it's a lot slower.
* **cpu**: Loads the model in CPU mode using Pytorch. The model will be loaded in 32-bit precision, so a lot of RAM will be used. CPU inference with transformers is older than llama.cpp and it works, but it's a lot slower. Note: this parameter has a different interpretation in the llama.cpp loader (see below).
* **load-in-8bit**: Load the model in 8-bit precision using bitsandbytes. The 8-bit kernel in that library has been optimized for training and not inference, so load-in-8bit is slower than load-in-4bit (but more accurate).
* **bf16**: Use bfloat16 precision instead of float16 (the default). Only applies when quantization is not used.
* **auto-devices**: When checked, the backend will try to guess a reasonable value for "gpu-memory" to allow you to load a model with CPU offloading. I recommend just setting "gpu-memory" manually instead. This parameter is also needed for loading GPTQ models, in which case it needs to be checked before loading the model.
* **disk**: Enable disk offloading for layers that don't fit into the GPU and CPU combined.
* **load-in-4bit**: Load the model in 4-bit precision using bitsandbytes.
* **trust-remote-code**: Some models use custom Python code to load the model or the tokenizer. For such models, this option needs to be set. It doesn't download any remote content: all it does is execute the .py files that get downloaded with the model. Those files can potentially include malicious code; I have never seen it happen, but it is in principle possible.
* **use_fast**: Use the "fast" version of the tokenizer. Especially useful for Llama models, which originally had a "slow" tokenizer that received an update. If your local files are in the old "slow" format, checking this option may trigger a conversion that takes several minutes. The fast tokenizer is mostly useful if you are generating 50+ tokens/second using ExLlama_HF or if you are tokenizing a huge dataset for training.
* **no_use_fast**: Do not use the "fast" version of the tokenizer. Can usually be ignored; only check this if you can't load the tokenizer for your model otherwise.
* **use_flash_attention_2**: Set use_flash_attention_2=True while loading the model. Possibly useful for training.
* **disable_exllama**: Only applies when you are loading a GPTQ model through the transformers loader. It needs to be checked if you intend to train LoRAs with the model.

Expand Down Expand Up @@ -97,6 +97,7 @@ Example: https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF
* **no-mmap**: Loads the model into memory at once, possibly preventing I/O operations later on at the cost of a longer load time.
* **mlock**: Force the system to keep the model in RAM rather than swapping or compressing (no idea what this means, never used it).
* **numa**: May improve performance on certain multi-cpu systems.
* **cpu**: Force a version of llama.cpp compiled without GPU acceleration to be used. Can usually be ignored. Only set this if you want to use CPU only and llama.cpp doesn't work otherwise.
* **tensor_split**: For multi-gpu only. Sets the amount of memory to allocate per GPU.
* **Seed**: The seed for the llama.cpp random number generator. Not very useful as it can only be set once (that I'm aware).

Expand Down
12 changes: 2 additions & 10 deletions extensions/openai/script.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,16 +67,8 @@ def verify_api_key(authorization: str = Header(None)) -> None:
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["GET", "HEAD", "OPTIONS", "POST", "PUT"],
allow_headers=[
"Origin",
"Accept",
"X-Requested-With",
"Content-Type",
"Access-Control-Request-Method",
"Access-Control-Request-Headers",
"Authorization",
],
allow_methods=["*"],
allow_headers=["*"]
)


Expand Down
35 changes: 27 additions & 8 deletions modules/llamacpp_hf.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
from pathlib import Path
from typing import Any, Dict, Optional, Union

import llama_cpp
import torch
from torch.nn import CrossEntropyLoss
from transformers import GenerationConfig, PretrainedConfig, PreTrainedModel
Expand All @@ -11,6 +10,23 @@
from modules import RoPE, shared
from modules.logging_colors import logger

try:
import llama_cpp
except:
llama_cpp = None

try:
import llama_cpp_cuda
except:
llama_cpp_cuda = None


def llama_cpp_lib():
if (shared.args.cpu and llama_cpp is not None) or llama_cpp_cuda is None:
return llama_cpp
else:
return llama_cpp_cuda


class LlamacppHF(PreTrainedModel):
def __init__(self, model, path):
Expand All @@ -23,7 +39,7 @@ def __init__(self, model, path):
'n_tokens': self.model.n_tokens,
'input_ids': self.model.input_ids,
'scores': self.model.scores,
'ctx': self.model._ctx.ctx
'ctx': self.model.ctx
}

if shared.args.cfg_cache:
Expand All @@ -32,7 +48,7 @@ def __init__(self, model, path):
'n_tokens': self.model.n_tokens,
'input_ids': self.model.input_ids.copy(),
'scores': self.model.scores.copy(),
'ctx': llama_cpp.llama_new_context_with_model(model.model, model.context_params)
'ctx': llama_cpp_lib().llama_new_context_with_model(model.model, model.context_params)
}

def _validate_model_class(self):
Expand All @@ -49,28 +65,28 @@ def save_cache(self):
'n_tokens': self.model.n_tokens,
'input_ids': self.model.input_ids,
'scores': self.model.scores,
'ctx': self.model._ctx.ctx
'ctx': self.model.ctx
})

def save_negative_cache(self):
self.llamacpp_cache_negative.update({
'n_tokens': self.model.n_tokens,
'input_ids': self.model.input_ids,
'scores': self.model.scores,
'ctx': self.model._ctx.ctx
'ctx': self.model.ctx
})

def load_cache(self):
self.model.n_tokens = self.llamacpp_cache['n_tokens']
self.model.input_ids = self.llamacpp_cache['input_ids']
self.model.scores = self.llamacpp_cache['scores']
self.model._ctx.ctx = self.llamacpp_cache['ctx']
self.model.ctx = self.llamacpp_cache['ctx']

def load_negative_cache(self):
self.model.n_tokens = self.llamacpp_cache_negative['n_tokens']
self.model.input_ids = self.llamacpp_cache_negative['input_ids']
self.model.scores = self.llamacpp_cache_negative['scores']
self.model._ctx.ctx = self.llamacpp_cache_negative['ctx']
self.model.ctx = self.llamacpp_cache_negative['ctx']

@property
def device(self) -> torch.device:
Expand Down Expand Up @@ -176,6 +192,7 @@ def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.P
params = {
'model_path': str(model_file),
'n_ctx': shared.args.n_ctx,
'seed': int(shared.args.llama_cpp_seed),
'n_threads': shared.args.threads or None,
'n_threads_batch': shared.args.threads_batch or None,
'n_batch': shared.args.n_batch,
Expand All @@ -190,5 +207,7 @@ def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.P
'logits_all': shared.args.logits_all,
}

model = llama_cpp.Llama(**params)
Llama = llama_cpp_lib().Llama
model = Llama(**params)

return LlamacppHF(model, model_file)
40 changes: 30 additions & 10 deletions modules/llamacpp_model.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
import re
from functools import partial

import llama_cpp
import numpy as np
import torch

Expand All @@ -10,6 +9,23 @@
from modules.logging_colors import logger
from modules.text_generation import get_max_prompt_length

try:
import llama_cpp
except:
llama_cpp = None

try:
import llama_cpp_cuda
except:
llama_cpp_cuda = None


def llama_cpp_lib():
if (shared.args.cpu and llama_cpp is not None) or llama_cpp_cuda is None:
return llama_cpp
else:
return llama_cpp_cuda


def ban_eos_logits_processor(eos_token, input_ids, logits):
logits[eos_token] = -float('inf')
Expand All @@ -34,6 +50,10 @@ def __del__(self):

@classmethod
def from_pretrained(self, path):

Llama = llama_cpp_lib().Llama
LlamaCache = llama_cpp_lib().LlamaCache

result = self()
cache_capacity = 0
if shared.args.cache_capacity is not None:
Expand All @@ -54,6 +74,7 @@ def from_pretrained(self, path):
params = {
'model_path': str(path),
'n_ctx': shared.args.n_ctx,
'seed': int(shared.args.llama_cpp_seed),
'n_threads': shared.args.threads or None,
'n_threads_batch': shared.args.threads_batch or None,
'n_batch': shared.args.n_batch,
Expand All @@ -67,9 +88,9 @@ def from_pretrained(self, path):
'rope_freq_scale': 1.0 / shared.args.compress_pos_emb,
}

result.model = llama_cpp.Llama(**params)
result.model = Llama(**params)
if cache_capacity > 0:
result.model.set_cache(llama_cpp.LlamaCache(capacity_bytes=cache_capacity))
result.model.set_cache(LlamaCache(capacity_bytes=cache_capacity))

# This is ugly, but the model and the tokenizer are the same object in this library.
return result, result
Expand All @@ -93,13 +114,13 @@ def load_grammar(self, string):
if string != self.grammar_string:
self.grammar_string = string
if string.strip() != '':
self.grammar = llama_cpp.LlamaGrammar.from_string(string)
self.grammar = llama_cpp_lib().LlamaGrammar.from_string(string)
else:
self.grammar = None

def generate(self, prompt, state, callback=None):

LogitsProcessorList = llama_cpp.LogitsProcessorList
LogitsProcessorList = llama_cpp_lib().LogitsProcessorList

prompt = prompt if type(prompt) is str else prompt.decode()

Expand All @@ -123,16 +144,15 @@ def generate(self, prompt, state, callback=None):
max_tokens=state['max_new_tokens'],
temperature=state['temperature'],
top_p=state['top_p'],
frequency_penalty=state['frequency_penalty'],
presence_penalty=state['presence_penalty'],
repeat_penalty=state['repetition_penalty'],
top_k=state['top_k'],
stream=True,
seed=int(state['seed']) if state['seed'] != -1 else None,
repeat_penalty=state['repetition_penalty'],
presence_penalty=state['presence_penalty'],
frequency_penalty=state['frequency_penalty'],
tfs_z=state['tfs'],
mirostat_mode=int(state['mirostat_mode']),
mirostat_tau=state['mirostat_tau'],
mirostat_eta=state['mirostat_eta'],
stream=True,
logits_processor=logit_processors,
grammar=self.grammar
)
Expand Down
18 changes: 10 additions & 8 deletions modules/loaders.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
'quant_type',
'compute_dtype',
'trust_remote_code',
'use_fast',
'no_use_fast',
'use_flash_attention_2',
'alpha_value',
'rope_freq_base',
Expand All @@ -34,7 +34,7 @@
'rope_freq_base',
'compress_pos_emb',
'cfg_cache',
'use_fast',
'no_use_fast',
'exllama_HF_info',
],
'ExLlamav2_HF': [
Expand All @@ -45,7 +45,7 @@
'cache_8bit',
'alpha_value',
'compress_pos_emb',
'use_fast',
'no_use_fast',
],
'ExLlama': [
'gpu_split',
Expand Down Expand Up @@ -78,15 +78,15 @@
'disk',
'auto_devices',
'trust_remote_code',
'use_fast',
'no_use_fast',
'autogptq_info',
],
'GPTQ-for-LLaMa': [
'wbits',
'groupsize',
'model_type',
'pre_layer',
'use_fast',
'no_use_fast',
'gptq_for_llama_info',
],
'llama.cpp': [
Expand All @@ -99,9 +99,11 @@
'no_mmap',
'mlock',
'no_mul_mat_q',
'llama_cpp_seed',
'alpha_value',
'rope_freq_base',
'compress_pos_emb',
'cpu',
'numa',
],
'llamacpp_HF': [
Expand All @@ -117,9 +119,10 @@
'alpha_value',
'rope_freq_base',
'compress_pos_emb',
'cpu',
'numa',
'cfg_cache',
'use_fast',
'no_use_fast',
'logits_all',
'llamacpp_HF_info',
],
Expand All @@ -139,7 +142,7 @@
'max_seq_len',
'no_inject_fused_attention',
'trust_remote_code',
'use_fast',
'no_use_fast',
]
})

Expand Down Expand Up @@ -363,7 +366,6 @@
'repetition_penalty',
'presence_penalty',
'frequency_penalty',
'seed',
'mirostat_mode',
'mirostat_tau',
'mirostat_eta',
Expand Down
12 changes: 6 additions & 6 deletions modules/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -114,13 +114,13 @@ def load_tokenizer(model_name, model):
if any(s in model_name.lower() for s in ['gpt-4chan', 'gpt4chan']) and Path(f"{shared.args.model_dir}/gpt-j-6B/").exists():
tokenizer = AutoTokenizer.from_pretrained(Path(f"{shared.args.model_dir}/gpt-j-6B/"))
elif path_to_model.exists():
if shared.args.use_fast:
logger.info('Loading the tokenizer with use_fast=True.')
if shared.args.no_use_fast:
logger.info('Loading the tokenizer with use_fast=False.')

tokenizer = AutoTokenizer.from_pretrained(
path_to_model,
trust_remote_code=shared.args.trust_remote_code,
use_fast=shared.args.use_fast
use_fast=not shared.args.no_use_fast
)

return tokenizer
Expand Down Expand Up @@ -262,13 +262,13 @@ def llamacpp_HF_loader(model_name):
logger.error("Could not load the model because a tokenizer in transformers format was not found. Please download oobabooga/llama-tokenizer.")
return None, None

if shared.args.use_fast:
logger.info('Loading the tokenizer with use_fast=True.')
if shared.args.no_use_fast:
logger.info('Loading the tokenizer with use_fast=False.')

tokenizer = AutoTokenizer.from_pretrained(
path,
trust_remote_code=shared.args.trust_remote_code,
use_fast=shared.args.use_fast
use_fast=not shared.args.no_use_fast
)

model = LlamacppHF.from_pretrained(model_name)
Expand Down
Loading