Remove exllamav1 loaders (oobabooga#5128)

PoetOnTheRun · Feb 22, 2024 · 9ea3a23 · 9ea3a23
1 parent a778ff1
commit 9ea3a23
Show file tree

Hide file tree

Showing 18 changed files with 28 additions and 635 deletions.
diff --git a/README.md b/README.md
@@ -11,7 +11,7 @@ Its goal is to become the [AUTOMATIC1111/stable-diffusion-webui](https://github.
 ## Features
 
 * 3 interface modes: default (two columns), notebook, and chat.
-* Multiple model backends: [Transformers](https://github.com/huggingface/transformers), [llama.cpp](https://github.com/ggerganov/llama.cpp) (through [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)), [ExLlama](https://github.com/turboderp/exllama), [ExLlamaV2](https://github.com/turboderp/exllamav2), [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), [AutoAWQ](https://github.com/casper-hansen/AutoAWQ), [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa), [CTransformers](https://github.com/marella/ctransformers), [QuIP#](https://github.com/Cornell-RelaxML/quip-sharp).
+* Multiple model backends: [Transformers](https://github.com/huggingface/transformers), [llama.cpp](https://github.com/ggerganov/llama.cpp) (through [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)), [ExLlamaV2](https://github.com/turboderp/exllamav2), [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), [AutoAWQ](https://github.com/casper-hansen/AutoAWQ), [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa), [CTransformers](https://github.com/marella/ctransformers), [QuIP#](https://github.com/Cornell-RelaxML/quip-sharp).
 * Dropdown menu for quickly switching between different models.
 * Large number of extensions (built-in and user-contributed), including Coqui TTS for realistic voice outputs, Whisper STT for voice inputs, translation, [multimodal pipelines](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/multimodal), vector databases, Stable Diffusion integration, and a lot more. See [the wiki](https://github.com/oobabooga/text-generation-webui/wiki/07-%E2%80%90-Extensions) and [the extensions directory](https://github.com/oobabooga/text-generation-webui-extensions) for details.
 * [Chat with custom characters](https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab#character).
@@ -140,13 +140,6 @@ Then browse to
 3) Manually install AutoGPTQ: [Installation](https://github.com/PanQiWei/AutoGPTQ#install-from-source).
     * Perform the from-source installation - there are no prebuilt ROCm packages for Windows.
 
-4) Manually install [ExLlama](https://github.com/turboderp/exllama) by simply cloning it into the `repositories` folder (it will be automatically compiled at runtime after that):
-
-```sh
-cd text-generation-webui
-git clone https://github.com/turboderp/exllama repositories/exllama
-```
-
 ##### Older NVIDIA GPUs
 
 1) For Kepler GPUs and older, you will need to install CUDA 11.8 instead of 12:
@@ -216,7 +209,7 @@ List of command-line flags
 
 | Flag                                       | Description |
 |--------------------------------------------|-------------|
-| `--loader LOADER`                          | Choose the model loader manually, otherwise, it will get autodetected. Valid options: Transformers, llama.cpp, llamacpp_HF, ExLlama_HF, ExLlamav2_HF, AutoGPTQ, AutoAWQ, GPTQ-for-LLaMa, ExLlama, ExLlamav2, ctransformers, QuIP#. |
+| `--loader LOADER`                          | Choose the model loader manually, otherwise, it will get autodetected. Valid options: Transformers, llama.cpp, llamacpp_HF, ExLlamav2_HF, ExLlamav2, AutoGPTQ, AutoAWQ, GPTQ-for-LLaMa, ctransformers, QuIP#. |
 
 #### Accelerate/transformers
 
@@ -265,13 +258,13 @@ List of command-line flags
 | `--no_offload_kqv` | Do not offload the K, Q, V to the GPU. This saves VRAM but reduces the performance. |
 | `--cache-capacity CACHE_CAPACITY`   | Maximum cache capacity (llama-cpp-python). Examples: 2000MiB, 2GiB. When provided without units, bytes will be assumed. |
 
-#### ExLlama
+#### ExLlamav2
 
 | Flag             | Description |
 |------------------|-------------|
 |`--gpu-split`     | Comma-separated list of VRAM (in GB) to use per GPU device for model layers. Example: 20,7,7. |
 |`--max_seq_len MAX_SEQ_LEN`           | Maximum sequence length. |
-|`--cfg-cache`                         | ExLlama_HF: Create an additional cache for CFG negative prompts. Necessary to use CFG with that loader, but not necessary for CFG with base ExLlama. |
+|`--cfg-cache`                         | ExLlamav2_HF: Create an additional cache for CFG negative prompts. Necessary to use CFG with that loader. |
 |`--no_flash_attn`                     | Force flash-attention to not be used. |
 |`--cache_8bit`                        | Use 8-bit cache to save VRAM. |
 |`--num_experts_per_token NUM_EXPERTS_PER_TOKEN` |  Number of experts to use for generation. Applies to MoE models like Mixtral. |
@@ -326,7 +319,7 @@ List of command-line flags
 | `--rwkv-strategy RWKV_STRATEGY` | RWKV: The strategy to use while loading the model. Examples: "cpu fp32", "cuda fp16", "cuda fp16i8". |
 | `--rwkv-cuda-on`                | RWKV: Compile the CUDA kernel for better performance. |
 
-#### RoPE (for llama.cpp, ExLlama, ExLlamaV2, and transformers)
+#### RoPE (for llama.cpp, ExLlamaV2, and transformers)
 
 | Flag             | Description |
 |------------------|-------------|

diff --git a/docs/04 - Model Tab.md b/docs/04 - Model Tab.md
@@ -32,32 +32,21 @@ Options:
 * **use_flash_attention_2**: Set use_flash_attention_2=True while loading the model. Possibly useful for training.
 * **disable_exllama**: Only applies when you are loading a GPTQ model through the transformers loader. It needs to be checked if you intend to train LoRAs with the model.
 
-### ExLlama_HF
+### ExLlamav2_HF
 
-Loads: GPTQ models. They usually have GPTQ in the model name, or alternatively something like "-4bit-128g" in the name.
+Loads: GPTQ and EXL2 models. EXL2 models usually have "EXL2" in the model name, while GPTQ models usually have GPTQ in the model name, or alternatively something like "-4bit-128g" in the name.
 
-Example: https://huggingface.co/TheBloke/Llama-2-13B-chat-GPTQ
+Examples:
 
-ExLlama_HF is the v1 of ExLlama (https://github.com/turboderp/exllama) connected to the transformers library for sampling, tokenizing, and detokenizing. It is very fast and memory-efficient.
+* https://huggingface.co/turboderp/Llama2-70B-exl2
+* https://huggingface.co/TheBloke/Llama-2-13B-chat-GPTQ
 
 * **gpu-split**: If you have multiple GPUs, the amount of memory to allocate per GPU should be set in this field. Make sure to set a lower value for the first GPU, as that's where the cache is allocated.
 * **max_seq_len**: The maximum sequence length for the model. In ExLlama, the cache is preallocated, so the higher this value, the higher the VRAM. It is automatically set to the maximum sequence length for the model based on its metadata, but you may need to lower this value be able to fit the model into your GPU. After loading the model, the "Truncate the prompt up to this length" parameter under "Parameters" > "Generation" is automatically set to your chosen "max_seq_len" so that you don't have to set the same thing twice.
 * **cfg-cache**: Creates a second cache to hold the CFG negative prompts. You need to set this if and only if you intend to use CFG in the "Parameters" > "Generation" tab. Checking this parameter doubles the cache VRAM usage.
 * **no_flash_attn**: Disables flash attention. Otherwise, it is automatically used as long as the library is installed.
 * **cache_8bit**: Create a 8-bit precision cache instead of a 16-bit one. This saves VRAM but increases perplexity (I don't know by how much).
 
-### ExLlamav2_HF
-
-Loads: GPTQ and EXL2 models. EXL2 models usually have "EXL2" in the model name.
-
-Example: https://huggingface.co/turboderp/Llama2-70B-exl2
-
-The parameters are the same as in ExLlama_HF.
-
-### ExLlama
-
-The same as ExLlama_HF but using the internal samplers of ExLlama instead of the ones in the Transformers library.
-
 ### ExLlamav2
 
 The same as ExLlamav2_HF but using the internal samplers of ExLlamav2 instead of the ones in the Transformers library.

diff --git a/docs/What Works.md b/docs/What Works.md
@@ -3,9 +3,7 @@
 | Loader         | Loading 1 LoRA | Loading 2 or more LoRAs | Training LoRAs | Multimodal extension | Perplexity evaluation |
 |----------------|----------------|-------------------------|----------------|----------------------|-----------------------|
 | Transformers   |       ✅       |           ✅***            |       ✅*       |          ✅          |           ✅          |
-| ExLlama_HF     |       ✅       |           ❌            |       ❌       |          ❌          |           ✅          |
 | ExLlamav2_HF   |       ✅       |           ✅            |       ❌       |          ❌          |           ✅          |
-| ExLlama        |       ✅       |           ❌            |       ❌       |          ❌          |           use ExLlama_HF      |
 | ExLlamav2      |       ✅       |           ✅            |       ❌       |          ❌          |           use ExLlamav2_HF    |
 | AutoGPTQ       |       ✅       |           ❌            |       ❌       |          ✅          |           ✅          |
 | GPTQ-for-LLaMa |       ✅**       |           ✅***            |       ✅       |          ✅          |           ✅          |

diff --git a/modules/LoRA.py b/modules/LoRA.py
@@ -12,8 +12,6 @@
 def add_lora_to_model(lora_names):
     if 'GPTQForCausalLM' in shared.model.__class__.__name__ or shared.args.loader == 'AutoGPTQ':
         add_lora_autogptq(lora_names)
-    elif shared.model.__class__.__name__ in ['ExllamaModel', 'ExllamaHF'] or shared.args.loader == 'ExLlama':
-        add_lora_exllama(lora_names)
     elif shared.model.__class__.__name__ in ['Exllamav2Model', 'Exllamav2HF'] or shared.args.loader == ['ExLlamav2', 'ExLlamav2_HF']:
         add_lora_exllamav2(lora_names)
     else:
@@ -28,48 +26,6 @@ def get_lora_path(lora_name):
     return Path(f"{shared.args.lora_dir}/{lora_name}")
 
 
-def add_lora_exllama(lora_names):
-
-    try:
-        from exllama.lora import ExLlamaLora
-    except:
-        try:
-            from repositories.exllama.lora import ExLlamaLora
-        except:
-            logger.error("Could not find the file repositories/exllama/lora.py. Make sure that exllama is cloned inside repositories/ and is up to date.")
-            return
-
-    if len(lora_names) == 0:
-        if shared.model.__class__.__name__ == 'ExllamaModel':
-            shared.model.generator.lora = None
-        else:
-            shared.model.lora = None
-
-        shared.lora_names = []
-        return
-    else:
-        if len(lora_names) > 1:
-            logger.warning('ExLlama can only work with 1 LoRA at the moment. Only the first one in the list will be loaded.')
-
-        lora_path = get_lora_path(lora_names[0])
-        lora_config_path = lora_path / "adapter_config.json"
-        for file_name in ["adapter_model.safetensors", "adapter_model.bin"]:
-            file_path = lora_path / file_name
-            if file_path.is_file():
-                lora_adapter_path = file_path
-
-        logger.info("Applying the following LoRAs to {}: {}".format(shared.model_name, ', '.join([lora_names[0]])))
-        if shared.model.__class__.__name__ == 'ExllamaModel':
-            lora = ExLlamaLora(shared.model.model, str(lora_config_path), str(lora_adapter_path))
-            shared.model.generator.lora = lora
-        else:
-            lora = ExLlamaLora(shared.model.ex_model, str(lora_config_path), str(lora_adapter_path))
-            shared.model.lora = lora
-
-        shared.lora_names = [lora_names[0]]
-        return
-
-
 def add_lora_exllamav2(lora_names):
 
     from exllamav2 import ExLlamaV2Lora