Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TensorRT-LLM support #5715

Merged
merged 11 commits into from
Jun 24, 2024
Merged

Add TensorRT-LLM support #5715

merged 11 commits into from
Jun 24, 2024

Conversation

oobabooga
Copy link
Owner

@oobabooga oobabooga commented Mar 17, 2024

TensorRT-LLM (https://github.com/NVIDIA/TensorRT-LLM/) is a new inference backend developed by NVIDIA.

  • It only works on NVIDIA GPUs.
  • It supports several quantization methods (GPTQ, AWQ, FP8, SmoothQuant), as well as 16-bit inference.

In my testing, I found it to be consistently faster than ExLlamaV2 in both prompt processing and evaluation. That makes it the new SOTA inference backend in terms of speed.

Speed tests

Model Precision Backend Prompt processing (3200 tokens, t/s) Generation (512 tokens, t/s)
TheBloke/Llama-2-7B-GPTQ 4-bit TRT-LLM + ModelRunnerCpp 8014.99 138.84
TheBloke/Llama-2-7B-GPTQ 4-bit TRT-LLM + ModelRunner 7572.49 125.45
TheBloke/Llama-2-7B-GPTQ 4-bit ExLlamaV2 6138.32 130.16
         
TheBloke/Llama-2-13B-GPTQ 4-bit TRT-LLM + ModelRunnerCpp 4553.43 80.69
TheBloke/Llama-2-13B-GPTQ 4-bit TRT-LLM + ModelRunner 4161.57 75.80
TheBloke/Llama-2-13B-GPTQ 4-bit ExLlamaV2 3499.26 75.28
         
NousResearch_Llama-2-7b-hf 16-bit TRT-LLM + ModelRunnerCpp 8465.27 55.54
NousResearch_Llama-2-7b-hf 16-bit TRT-LLM + ModelRunner 7676.80 53.33
NousResearch_Llama-2-7b-hf 16-bit ExLlamaV2 6511.87 53.02
         
NousResearch_Llama-2-13b-hf 16-bit TRT-LLM + ModelRunnerCpp 4621.76 29.95
NousResearch_Llama-2-13b-hf 16-bit TRT-LLM + ModelRunner 4299.16 29.22
NousResearch_Llama-2-13b-hf 16-bit ExLlamaV2 3881.43 29.11

I provided the models with a 3200 tokens input and measured the time to process those 3200 tokens and the time to generate 512 tokens afterwards. I did this over API, and each number in the table above is a median out of 20 measurements.

To accurately measure the TensorRT-LLM speeds, it was necessary to do a warmup generation before starting the measurements, as the first generation has an overhead due to module imports. The same warmup was done for ExLlamaV2 as well.

The tests were carried out in an RTX 6000 Ada GPU.

Installation

Option 1: Docker

Just use the included Dockerfile under docker/TensorRT-LLM/Dockerfile, which will automatically set everything up from scratch.

II find the following commands useful (make sure to run them after moving into the folder containing the Dockerfile with cd):

# Build the image
docker build -t mylocalimage:debug .

# Run the container mapping port 7860 from the host to port 7860 in the container
docker run -p 7860:7860 mylocalimage:debug

# Run the container with GPU support
docker run -p 7860:7860 --gpus all mylocalimage:debug

# Run the container interactively (-it), spawning a Bash shell (/bin/bash) within the container
docker run -p 7860:7860 -it mylocalimage:debug /bin/bash

Option 2: Manually

TensorRT-LLM only works on Python 3.10 at the moment, while this project uses Python 3.11 by default, so it's necessary to create a separate Python 3.10 conda environment:

# Install system-wide TensorRT-LLM requirements
sudo apt-get -y install openmpi-bin libopenmpi-dev

# Create a Python 3.10 environment
conda create -n tensorrt python=3.10
conda activate tensorrt

# Install text-generation-webui
git clone https://github.com/oobabooga/text-generation-webui/
cd text-generation-webui
pip install -r requirements.txt

# This is needed to avoid an error about "Failed to build mpi4py" in the next command
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH

# Install TensorRT-LLM
pip3 install tensorrt_llm==0.10.0 -U --pre --extra-index-url https://pypi.nvidia.com

Make sure to paste the commands above in the specified order.

For Windows setup and more information about installation, consult the official README.

Converting a model

Contrary to what happens with other backends, it's necessary to convert the model before using it so it gets optimized for your GPU (or GPUs). These are the commands that I have used:

FP16 models
#!/bin/bash

CHECKPOINT_DIR=/home/me/text-generation-webui/models/NousResearch_Meta-Llama-3-8B-Instruct

cd /home/me/TensorRT-LLM/

python examples/llama/convert_checkpoint.py \
    --model_dir $CHECKPOINT_DIR \
    --output_dir $(basename "$CHECKPOINT_DIR")_checkpoint \
    --dtype float16

trtllm-build \
    --checkpoint_dir $(basename "$CHECKPOINT_DIR")_checkpoint \
    --output_dir ${CHECKPOINT_DIR}_TensorRT \
    --gemm_plugin float16 \
    --max_input_len 8192 \
    # --max_output_len 512

# Copy the tokenizer files
cp ${CHECKPOINT_DIR}/{tokenizer*,special_tokens_map.json} ${CHECKPOINT_DIR}_TensorRT
GPTQ models
#!/bin/bash

CHECKPOINT_DIR=/home/me/text-generation-webui/models/NousResearch_Llama-2-7b-hf
QUANTIZED_DIR=/home/me/text-generation-webui/models/TheBloke_Llama-2-7B-GPTQ
QUANTIZED_FILE="model.safetensors"

cd /home/me/TensorRT-LLM/

python examples/llama/convert_checkpoint.py \
    --model_dir $CHECKPOINT_DIR \
    --output_dir $(basename "$CHECKPOINT_DIR")_checkpoint \
    --dtype float16 \
    --ammo_quant_ckpt_path "${QUANTIZED_DIR}/$QUANTIZED_FILE" \
    --use_weight_only \
    --weight_only_precision int4_gptq \
    --per_group

trtllm-build \
    --checkpoint_dir $(basename "$CHECKPOINT_DIR")_checkpoint \
    --output_dir ${QUANTIZED_DIR}_TensorRT \
    --gemm_plugin float16 \
    --max_input_len 4096 \
    # --max_output_len 512

# Copy the tokenizer files
cp ${CHECKPOINT_DIR}/{tokenizer*,special_tokens_map.json} ${QUANTIZED_DIR}_TensorRT

More commands can be found on this page:

https://github.com/NVIDIA/TensorRT-LLM/tree/728cc0044bb76d1fafbcaa720c403e8de4f81906/examples/llama

Make sure to use this commit of TensorRT-LLM for the commands above to work:

git clone https://github.com/NVIDIA/TensorRT-LLM/
cd TensorRT-LLM
git checkout 9bd15f1937f52658fb116f30d58fea786ce5d03b

They will generate folders named like this, containing both the converted model and a copy of the tokenizer files:

NousResearch_Llama-2-7b-hf_TensorRT
NousResearch_Llama-2-13b-hf_TensorRT
TheBloke_Llama-2-7B-GPTQ_TensorRT
TheBloke_Llama-2-13B-GPTQ_TensorRT

Loading a model

Here is an example:

python server.py \
  --model TheBloke_Llama-2-7B-GPTQ_TensorRT \
  --loader TensorRT-LLM \
  --max_seq_len 4096

Details

=* There are two ways to load the model: with a class called ModelRunnerCpp or with another one called ModelRunner. The first is faster but it does not support streaming yet. You can use it with the --cpp-runner flag.

TODO

  • Figure out prefix matching. This is already implemented, but there is no clear documentation on how to use it -- see issues #1043 and #620. Does it work by default?
  • Create a TensorRT-LLM_HF loader integrated with the existing sampling functions in the project. Left this for later.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Mar 17, 2024

Heh, I was able to have flash attention with torch 2.2.1. I had tried TRT on the SD side and the hassle wasn't worth it. I wonder how this does with multi-gpu inference. It not using flash attention also probably really balloons the context. Will be fun to find out.

@Nurgl
Copy link

Nurgl commented Apr 25, 2024

TensorRT and Triton Interface Server can reserve memory for several video cards at once and respond to several users in parallel. Is it possible to transfer this functionality to the Text-generation-web ui?

@oobabooga
Copy link
Owner Author

TensorRT and Triton Interface Server can reserve memory for several video cards at once and respond to several users in parallel. Is it possible to transfer this functionality to the Text-generation-web ui?

It should be possible -- the first step would be to remove the semaphore from modules/text_generation.py and figure out how to connect things together, maybe with a command-line flag for the maximum number of concurrent users. A PR with that addition would be welcome.

@Nurgl
Copy link

Nurgl commented May 2, 2024

it would be nice to make both a queue mode and a parallel processing mode

@oobabooga oobabooga merged commit 577a8cd into dev Jun 24, 2024
@oobabooga oobabooga deleted the tensorrt branch June 24, 2024 05:36
@ohso4
Copy link

ohso4 commented Sep 26, 2024

I tried this, and I get this error:
It seems like autoawq and autoawq-kernels require torch==2.2.2, but if I install that then tensorrt won't work because it requires torch 2.4.0

╭────────────────────────────────────────── Traceback (most recent call last) ──────────────────────────────────────────╮
│ /home/alexander/nvidia_tensorrt/text-generation-webui/server.py:40 in <module>                                        │
│                                                                                                                       │
│    39 import modules.extensions as extensions_module                                                                  │
│ ❱  40 from modules import (                                                                                           │
│    41     chat,                                                                                                       │
│                                                                                                                       │
│ /home/alexander/nvidia_tensorrt/text-generation-webui/modules/training.py:21 in <module>                              │
│                                                                                                                       │
│    20 from datasets import Dataset, load_dataset                                                                      │
│ ❱  21 from peft import (                                                                                              │
│    22     LoraConfig,                                                                                                 │
│                                                                                                                       │
│ /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/peft/__init__.py:22 in <module>                 │
│                                                                                                                       │
│    21                                                                                                                 │
│ ❱  22 from .auto import (                                                                                             │
│    23     AutoPeftModel,                                                                                              │
│                                                                                                                       │
│ /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/peft/auto.py:32 in <module>                     │
│                                                                                                                       │
│    31 from .config import PeftConfig                                                                                  │
│ ❱  32 from .mapping import MODEL_TYPE_TO_PEFT_MODEL_MAPPING                                                           │
│    33 from .peft_model import (                                                                                       │
│                                                                                                                       │
│ /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/peft/mapping.py:22 in <module>                  │
│                                                                                                                       │
│    21                                                                                                                 │
│ ❱  22 from peft.tuners.xlora.model import XLoraModel                                                                  │
│    23                                                                                                                 │
│                                                                                                                       │
│                                                ... 2 frames hidden ...                                                │
│                                                                                                                       │
│ /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/peft/tuners/lora/model.py:50 in <module>        │
│                                                                                                                       │
│    49 from .aqlm import dispatch_aqlm                                                                                 │
│ ❱  50 from .awq import dispatch_awq                                                                                   │
│    51 from .config import LoraConfig                                                                                  │
│                                                                                                                       │
│ /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/peft/tuners/lora/awq.py:26 in <module>          │
│                                                                                                                       │
│    25 if is_auto_awq_available():                                                                                     │
│ ❱  26     from awq.modules.linear import WQLinear_GEMM                                                                │
│    27                                                                                                                 │
│                                                                                                                       │
│ /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/awq/__init__.py:2 in <module>                   │
│                                                                                                                       │
│   1 __version__ = "0.2.6"                                                                                             │
│ ❱ 2 from awq.models.auto import AutoAWQForCausalLM                                                                    │
│   3                                                                                                                   │
│                                                                                                                       │
│ /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/awq/models/__init__.py:21 in <module>           │
│                                                                                                                       │
│   20 from .llava_next import LlavaNextAWQForCausalLM                                                                  │
│ ❱ 21 from .phi3 import Phi3AWQForCausalLM                                                                             │
│   22 from .cohere import CohereAWQForCausalLM                                                                         │
│                                                                                                                       │
│ /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/awq/models/phi3.py:7 in <module>                │
│                                                                                                                       │
│     6 from awq.modules.fused.model import Phi3Model as AWQPhi3Model                                                   │
│ ❱   7 from transformers.models.phi3.modeling_phi3 import (                                                            │
│     8     Phi3DecoderLayer as OldPhi3DecoderLayer,                                                                    │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
ModuleNotFoundError: No module named 'transformers.models.phi3'
(tensorrt) alexander@debian-server:~/nvidia_tensorrt/text-generation-webui$ 

because of this:

tensorrt-llm 0.12.0 requires torch<=2.4.0,>=2.4.0a0, but you have torch 2.2.2 which is incompatible.

However if I try to fix this by installing torch 2.4 I get a different error:

autoawq 0.2.6 requires torch==2.2.2, but you have torch 2.4.0 which is incompatible.
autoawq-kernels 0.0.7 requires torch==2.2.2, but you have torch 2.4.0 which is incompatible.

And this is what happens when I try to run the webui

/home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/awq/modules/linear/exllama.py:12: UserWarning: AutoAWQ could not load ExLlama kernels extension. Details: /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/exl_ext.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi
  warnings.warn(f"AutoAWQ could not load ExLlama kernels extension. Details: {ex}")
/home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/awq/modules/linear/exllamav2.py:13: UserWarning: AutoAWQ could not load ExLlamaV2 kernels extension. Details: /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/exlv2_ext.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi
  warnings.warn(f"AutoAWQ could not load ExLlamaV2 kernels extension. Details: {ex}")
/home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/awq/modules/linear/gemm.py:14: UserWarning: AutoAWQ could not load GEMM kernels extension. Details: /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/awq_ext.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi
  warnings.warn(f"AutoAWQ could not load GEMM kernels extension. Details: {ex}")
/home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/awq/modules/linear/gemv.py:11: UserWarning: AutoAWQ could not load GEMV kernels extension. Details: /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/awq_ext.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi
  warnings.warn(f"AutoAWQ could not load GEMV kernels extension. Details: {ex}")
╭────────────────────────────────────────── Traceback (most recent call last) ──────────────────────────────────────────╮
│ /home/alexander/nvidia_tensorrt/text-generation-webui/server.py:40 in <module>                                        │
│                                                                                                                       │
│    39 import modules.extensions as extensions_module                                                                  │
│ ❱  40 from modules import (                                                                                           │
│    41     chat,                                                                                                       │
│                                                                                                                       │
│ /home/alexander/nvidia_tensorrt/text-generation-webui/modules/training.py:21 in <module>                              │
│                                                                                                                       │
│    20 from datasets import Dataset, load_dataset                                                                      │
│ ❱  21 from peft import (                                                                                              │
│    22     LoraConfig,                                                                                                 │
│                                                                                                                       │
│ /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/peft/__init__.py:22 in <module>                 │
│                                                                                                                       │
│    21                                                                                                                 │
│ ❱  22 from .auto import (                                                                                             │
│    23     AutoPeftModel,                                                                                              │
│                                                                                                                       │
│ /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/peft/auto.py:32 in <module>                     │
│                                                                                                                       │
│    31 from .config import PeftConfig                                                                                  │
│ ❱  32 from .mapping import MODEL_TYPE_TO_PEFT_MODEL_MAPPING                                                           │
│    33 from .peft_model import (                                                                                       │
│                                                                                                                       │
│ /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/peft/mapping.py:22 in <module>                  │
│                                                                                                                       │
│    21                                                                                                                 │
│ ❱  22 from peft.tuners.xlora.model import XLoraModel                                                                  │
│    23                                                                                                                 │
│                                                                                                                       │
│                                                ... 8 frames hidden ...                                                │
│                                                                                                                       │
│ /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/awq/quantize/quantizer.py:10 in <module>        │
│                                                                                                                       │
│     9 from awq.utils.calib_data import get_calib_dataset                                                              │
│ ❱  10 from awq.quantize.scale import apply_scale, apply_clip                                                          │
│    11 from awq.utils.utils import clear_memory, get_best_device                                                       │
│                                                                                                                       │
│ /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/awq/quantize/scale.py:8 in <module>             │
│                                                                                                                       │
│     7 from transformers.models.bloom.modeling_bloom import BloomGelu                                                  │
│ ❱   8 from transformers.models.llama.modeling_llama import LlamaRMSNorm                                               │
│     9 from transformers.models.gemma.modeling_gemma import GemmaRMSNorm                                               │
│                                                                                                                       │
│ /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:55  │
│ in <module>                                                                                                           │
│                                                                                                                       │
│     54 if is_flash_attn_2_available():                                                                                │
│ ❱   55     from flash_attn import flash_attn_func, flash_attn_varlen_func                                             │
│     56     from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input  # noqa                       │
│                                                                                                                       │
│ /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/flash_attn/__init__.py:3 in <module>            │
│                                                                                                                       │
│    2                                                                                                                  │
│ ❱  3 from flash_attn.flash_attn_interface import (                                                                    │
│    4     flash_attn_func,                                                                                             │
│                                                                                                                       │
│ /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py:10 in        │
│ <module>                                                                                                              │
│                                                                                                                       │
│      9 # We need to import the CUDA kernels after importing torch                                                     │
│ ❱   10 import flash_attn_2_cuda as flash_attn_cuda                                                                    │
│     11                                                                                                                │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
ImportError: 
/home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: 
undefined symbol: _ZN3c104cuda9SetDeviceEi
(tensorrt) alexander@debian-server:~/nvidia_tensorrt/text-generation-webui$ 

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Sep 27, 2024

Just compile the autoAWQ kernels yourself for your torch version.

PoetOnTheRun pushed a commit to PoetOnTheRun/text-generation-webui that referenced this pull request Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants