Skip to content

Commit

Permalink
Remove exllamav1 loaders (oobabooga#5128)
Browse files Browse the repository at this point in the history
  • Loading branch information
oobabooga authored and PoetOnTheRun committed Feb 22, 2024
1 parent a778ff1 commit 9ea3a23
Show file tree
Hide file tree
Showing 18 changed files with 28 additions and 635 deletions.
17 changes: 5 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Its goal is to become the [AUTOMATIC1111/stable-diffusion-webui](https://github.
## Features

* 3 interface modes: default (two columns), notebook, and chat.
* Multiple model backends: [Transformers](https://github.com/huggingface/transformers), [llama.cpp](https://github.com/ggerganov/llama.cpp) (through [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)), [ExLlama](https://github.com/turboderp/exllama), [ExLlamaV2](https://github.com/turboderp/exllamav2), [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), [AutoAWQ](https://github.com/casper-hansen/AutoAWQ), [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa), [CTransformers](https://github.com/marella/ctransformers), [QuIP#](https://github.com/Cornell-RelaxML/quip-sharp).
* Multiple model backends: [Transformers](https://github.com/huggingface/transformers), [llama.cpp](https://github.com/ggerganov/llama.cpp) (through [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)), [ExLlamaV2](https://github.com/turboderp/exllamav2), [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), [AutoAWQ](https://github.com/casper-hansen/AutoAWQ), [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa), [CTransformers](https://github.com/marella/ctransformers), [QuIP#](https://github.com/Cornell-RelaxML/quip-sharp).
* Dropdown menu for quickly switching between different models.
* Large number of extensions (built-in and user-contributed), including Coqui TTS for realistic voice outputs, Whisper STT for voice inputs, translation, [multimodal pipelines](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/multimodal), vector databases, Stable Diffusion integration, and a lot more. See [the wiki](https://github.com/oobabooga/text-generation-webui/wiki/07-%E2%80%90-Extensions) and [the extensions directory](https://github.com/oobabooga/text-generation-webui-extensions) for details.
* [Chat with custom characters](https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab#character).
Expand Down Expand Up @@ -140,13 +140,6 @@ Then browse to
3) Manually install AutoGPTQ: [Installation](https://github.com/PanQiWei/AutoGPTQ#install-from-source).
* Perform the from-source installation - there are no prebuilt ROCm packages for Windows.

4) Manually install [ExLlama](https://github.com/turboderp/exllama) by simply cloning it into the `repositories` folder (it will be automatically compiled at runtime after that):

```sh
cd text-generation-webui
git clone https://github.com/turboderp/exllama repositories/exllama
```

##### Older NVIDIA GPUs

1) For Kepler GPUs and older, you will need to install CUDA 11.8 instead of 12:
Expand Down Expand Up @@ -216,7 +209,7 @@ List of command-line flags

| Flag | Description |
|--------------------------------------------|-------------|
| `--loader LOADER` | Choose the model loader manually, otherwise, it will get autodetected. Valid options: Transformers, llama.cpp, llamacpp_HF, ExLlama_HF, ExLlamav2_HF, AutoGPTQ, AutoAWQ, GPTQ-for-LLaMa, ExLlama, ExLlamav2, ctransformers, QuIP#. |
| `--loader LOADER` | Choose the model loader manually, otherwise, it will get autodetected. Valid options: Transformers, llama.cpp, llamacpp_HF, ExLlamav2_HF, ExLlamav2, AutoGPTQ, AutoAWQ, GPTQ-for-LLaMa, ctransformers, QuIP#. |

#### Accelerate/transformers

Expand Down Expand Up @@ -265,13 +258,13 @@ List of command-line flags
| `--no_offload_kqv` | Do not offload the K, Q, V to the GPU. This saves VRAM but reduces the performance. |
| `--cache-capacity CACHE_CAPACITY` | Maximum cache capacity (llama-cpp-python). Examples: 2000MiB, 2GiB. When provided without units, bytes will be assumed. |

#### ExLlama
#### ExLlamav2

| Flag | Description |
|------------------|-------------|
|`--gpu-split` | Comma-separated list of VRAM (in GB) to use per GPU device for model layers. Example: 20,7,7. |
|`--max_seq_len MAX_SEQ_LEN` | Maximum sequence length. |
|`--cfg-cache` | ExLlama_HF: Create an additional cache for CFG negative prompts. Necessary to use CFG with that loader, but not necessary for CFG with base ExLlama. |
|`--cfg-cache` | ExLlamav2_HF: Create an additional cache for CFG negative prompts. Necessary to use CFG with that loader. |
|`--no_flash_attn` | Force flash-attention to not be used. |
|`--cache_8bit` | Use 8-bit cache to save VRAM. |
|`--num_experts_per_token NUM_EXPERTS_PER_TOKEN` | Number of experts to use for generation. Applies to MoE models like Mixtral. |
Expand Down Expand Up @@ -326,7 +319,7 @@ List of command-line flags
| `--rwkv-strategy RWKV_STRATEGY` | RWKV: The strategy to use while loading the model. Examples: "cpu fp32", "cuda fp16", "cuda fp16i8". |
| `--rwkv-cuda-on` | RWKV: Compile the CUDA kernel for better performance. |

#### RoPE (for llama.cpp, ExLlama, ExLlamaV2, and transformers)
#### RoPE (for llama.cpp, ExLlamaV2, and transformers)

| Flag | Description |
|------------------|-------------|
Expand Down
21 changes: 5 additions & 16 deletions docs/04 - Model Tab.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,32 +32,21 @@ Options:
* **use_flash_attention_2**: Set use_flash_attention_2=True while loading the model. Possibly useful for training.
* **disable_exllama**: Only applies when you are loading a GPTQ model through the transformers loader. It needs to be checked if you intend to train LoRAs with the model.

### ExLlama_HF
### ExLlamav2_HF

Loads: GPTQ models. They usually have GPTQ in the model name, or alternatively something like "-4bit-128g" in the name.
Loads: GPTQ and EXL2 models. EXL2 models usually have "EXL2" in the model name, while GPTQ models usually have GPTQ in the model name, or alternatively something like "-4bit-128g" in the name.

Example: https://huggingface.co/TheBloke/Llama-2-13B-chat-GPTQ
Examples:

ExLlama_HF is the v1 of ExLlama (https://github.com/turboderp/exllama) connected to the transformers library for sampling, tokenizing, and detokenizing. It is very fast and memory-efficient.
* https://huggingface.co/turboderp/Llama2-70B-exl2
* https://huggingface.co/TheBloke/Llama-2-13B-chat-GPTQ

* **gpu-split**: If you have multiple GPUs, the amount of memory to allocate per GPU should be set in this field. Make sure to set a lower value for the first GPU, as that's where the cache is allocated.
* **max_seq_len**: The maximum sequence length for the model. In ExLlama, the cache is preallocated, so the higher this value, the higher the VRAM. It is automatically set to the maximum sequence length for the model based on its metadata, but you may need to lower this value be able to fit the model into your GPU. After loading the model, the "Truncate the prompt up to this length" parameter under "Parameters" > "Generation" is automatically set to your chosen "max_seq_len" so that you don't have to set the same thing twice.
* **cfg-cache**: Creates a second cache to hold the CFG negative prompts. You need to set this if and only if you intend to use CFG in the "Parameters" > "Generation" tab. Checking this parameter doubles the cache VRAM usage.
* **no_flash_attn**: Disables flash attention. Otherwise, it is automatically used as long as the library is installed.
* **cache_8bit**: Create a 8-bit precision cache instead of a 16-bit one. This saves VRAM but increases perplexity (I don't know by how much).

### ExLlamav2_HF

Loads: GPTQ and EXL2 models. EXL2 models usually have "EXL2" in the model name.

Example: https://huggingface.co/turboderp/Llama2-70B-exl2

The parameters are the same as in ExLlama_HF.

### ExLlama

The same as ExLlama_HF but using the internal samplers of ExLlama instead of the ones in the Transformers library.

### ExLlamav2

The same as ExLlamav2_HF but using the internal samplers of ExLlamav2 instead of the ones in the Transformers library.
Expand Down
2 changes: 0 additions & 2 deletions docs/What Works.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,7 @@
| Loader | Loading 1 LoRA | Loading 2 or more LoRAs | Training LoRAs | Multimodal extension | Perplexity evaluation |
|----------------|----------------|-------------------------|----------------|----------------------|-----------------------|
| Transformers ||*** |* |||
| ExLlama_HF ||||||
| ExLlamav2_HF ||||||
| ExLlama ||||| use ExLlama_HF |
| ExLlamav2 ||||| use ExLlamav2_HF |
| AutoGPTQ ||||||
| GPTQ-for-LLaMa |** |*** ||||
Expand Down
44 changes: 0 additions & 44 deletions modules/LoRA.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,6 @@
def add_lora_to_model(lora_names):
if 'GPTQForCausalLM' in shared.model.__class__.__name__ or shared.args.loader == 'AutoGPTQ':
add_lora_autogptq(lora_names)
elif shared.model.__class__.__name__ in ['ExllamaModel', 'ExllamaHF'] or shared.args.loader == 'ExLlama':
add_lora_exllama(lora_names)
elif shared.model.__class__.__name__ in ['Exllamav2Model', 'Exllamav2HF'] or shared.args.loader == ['ExLlamav2', 'ExLlamav2_HF']:
add_lora_exllamav2(lora_names)
else:
Expand All @@ -28,48 +26,6 @@ def get_lora_path(lora_name):
return Path(f"{shared.args.lora_dir}/{lora_name}")


def add_lora_exllama(lora_names):

try:
from exllama.lora import ExLlamaLora
except:
try:
from repositories.exllama.lora import ExLlamaLora
except:
logger.error("Could not find the file repositories/exllama/lora.py. Make sure that exllama is cloned inside repositories/ and is up to date.")
return

if len(lora_names) == 0:
if shared.model.__class__.__name__ == 'ExllamaModel':
shared.model.generator.lora = None
else:
shared.model.lora = None

shared.lora_names = []
return
else:
if len(lora_names) > 1:
logger.warning('ExLlama can only work with 1 LoRA at the moment. Only the first one in the list will be loaded.')

lora_path = get_lora_path(lora_names[0])
lora_config_path = lora_path / "adapter_config.json"
for file_name in ["adapter_model.safetensors", "adapter_model.bin"]:
file_path = lora_path / file_name
if file_path.is_file():
lora_adapter_path = file_path

logger.info("Applying the following LoRAs to {}: {}".format(shared.model_name, ', '.join([lora_names[0]])))
if shared.model.__class__.__name__ == 'ExllamaModel':
lora = ExLlamaLora(shared.model.model, str(lora_config_path), str(lora_adapter_path))
shared.model.generator.lora = lora
else:
lora = ExLlamaLora(shared.model.ex_model, str(lora_config_path), str(lora_adapter_path))
shared.model.lora = lora

shared.lora_names = [lora_names[0]]
return


def add_lora_exllamav2(lora_names):

from exllamav2 import ExLlamaV2Lora
Expand Down
Loading

0 comments on commit 9ea3a23

Please sign in to comment.