Add TensorRT-LLM support #5715

oobabooga · 2024-03-17T15:25:17Z

TensorRT-LLM (https://github.com/NVIDIA/TensorRT-LLM/) is a new inference backend developed by NVIDIA.

It only works on NVIDIA GPUs.
It supports several quantization methods (GPTQ, AWQ, FP8, SmoothQuant), as well as 16-bit inference.

In my testing, I found it to be consistently faster than ExLlamaV2 in both prompt processing and evaluation. That makes it the new SOTA inference backend in terms of speed.

Speed tests

Model	Precision	Backend	Prompt processing (3200 tokens, t/s)	Generation (512 tokens, t/s)
TheBloke/Llama-2-7B-GPTQ	4-bit	TRT-LLM + ModelRunnerCpp	8014.99	138.84
TheBloke/Llama-2-7B-GPTQ	4-bit	TRT-LLM + ModelRunner	7572.49	125.45
TheBloke/Llama-2-7B-GPTQ	4-bit	ExLlamaV2	6138.32	130.16

TheBloke/Llama-2-13B-GPTQ	4-bit	TRT-LLM + ModelRunnerCpp	4553.43	80.69
TheBloke/Llama-2-13B-GPTQ	4-bit	TRT-LLM + ModelRunner	4161.57	75.80
TheBloke/Llama-2-13B-GPTQ	4-bit	ExLlamaV2	3499.26	75.28

NousResearch_Llama-2-7b-hf	16-bit	TRT-LLM + ModelRunnerCpp	8465.27	55.54
NousResearch_Llama-2-7b-hf	16-bit	TRT-LLM + ModelRunner	7676.80	53.33
NousResearch_Llama-2-7b-hf	16-bit	ExLlamaV2	6511.87	53.02

NousResearch_Llama-2-13b-hf	16-bit	TRT-LLM + ModelRunnerCpp	4621.76	29.95
NousResearch_Llama-2-13b-hf	16-bit	TRT-LLM + ModelRunner	4299.16	29.22
NousResearch_Llama-2-13b-hf	16-bit	ExLlamaV2	3881.43	29.11

I provided the models with a 3200 tokens input and measured the time to process those 3200 tokens and the time to generate 512 tokens afterwards. I did this over API, and each number in the table above is a median out of 20 measurements.

To accurately measure the TensorRT-LLM speeds, it was necessary to do a warmup generation before starting the measurements, as the first generation has an overhead due to module imports. The same warmup was done for ExLlamaV2 as well.

The tests were carried out in an RTX 6000 Ada GPU.

Installation

Option 1: Docker

Just use the included Dockerfile under docker/TensorRT-LLM/Dockerfile, which will automatically set everything up from scratch.

II find the following commands useful (make sure to run them after moving into the folder containing the Dockerfile with cd):

# Build the image
docker build -t mylocalimage:debug .

# Run the container mapping port 7860 from the host to port 7860 in the container
docker run -p 7860:7860 mylocalimage:debug

# Run the container with GPU support
docker run -p 7860:7860 --gpus all mylocalimage:debug

# Run the container interactively (-it), spawning a Bash shell (/bin/bash) within the container
docker run -p 7860:7860 -it mylocalimage:debug /bin/bash

Option 2: Manually

TensorRT-LLM only works on Python 3.10 at the moment, while this project uses Python 3.11 by default, so it's necessary to create a separate Python 3.10 conda environment:

# Install system-wide TensorRT-LLM requirements
sudo apt-get -y install openmpi-bin libopenmpi-dev

# Create a Python 3.10 environment
conda create -n tensorrt python=3.10
conda activate tensorrt

# Install text-generation-webui
git clone https://github.com/oobabooga/text-generation-webui/
cd text-generation-webui
pip install -r requirements.txt

# This is needed to avoid an error about "Failed to build mpi4py" in the next command
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH

# Install TensorRT-LLM
pip3 install tensorrt_llm==0.10.0 -U --pre --extra-index-url https://pypi.nvidia.com

Make sure to paste the commands above in the specified order.

For Windows setup and more information about installation, consult the official README.

Converting a model

Contrary to what happens with other backends, it's necessary to convert the model before using it so it gets optimized for your GPU (or GPUs). These are the commands that I have used:

FP16 models

#!/bin/bash

CHECKPOINT_DIR=/home/me/text-generation-webui/models/NousResearch_Meta-Llama-3-8B-Instruct

cd /home/me/TensorRT-LLM/

python examples/llama/convert_checkpoint.py \
    --model_dir $CHECKPOINT_DIR \
    --output_dir $(basename "$CHECKPOINT_DIR")_checkpoint \
    --dtype float16

trtllm-build \
    --checkpoint_dir $(basename "$CHECKPOINT_DIR")_checkpoint \
    --output_dir ${CHECKPOINT_DIR}_TensorRT \
    --gemm_plugin float16 \
    --max_input_len 8192 \
    # --max_output_len 512

# Copy the tokenizer files
cp ${CHECKPOINT_DIR}/{tokenizer*,special_tokens_map.json} ${CHECKPOINT_DIR}_TensorRT

GPTQ models

#!/bin/bash

CHECKPOINT_DIR=/home/me/text-generation-webui/models/NousResearch_Llama-2-7b-hf
QUANTIZED_DIR=/home/me/text-generation-webui/models/TheBloke_Llama-2-7B-GPTQ
QUANTIZED_FILE="model.safetensors"

cd /home/me/TensorRT-LLM/

python examples/llama/convert_checkpoint.py \
    --model_dir $CHECKPOINT_DIR \
    --output_dir $(basename "$CHECKPOINT_DIR")_checkpoint \
    --dtype float16 \
    --ammo_quant_ckpt_path "${QUANTIZED_DIR}/$QUANTIZED_FILE" \
    --use_weight_only \
    --weight_only_precision int4_gptq \
    --per_group

trtllm-build \
    --checkpoint_dir $(basename "$CHECKPOINT_DIR")_checkpoint \
    --output_dir ${QUANTIZED_DIR}_TensorRT \
    --gemm_plugin float16 \
    --max_input_len 4096 \
    # --max_output_len 512

# Copy the tokenizer files
cp ${CHECKPOINT_DIR}/{tokenizer*,special_tokens_map.json} ${QUANTIZED_DIR}_TensorRT

More commands can be found on this page:

https://github.com/NVIDIA/TensorRT-LLM/tree/728cc0044bb76d1fafbcaa720c403e8de4f81906/examples/llama

Make sure to use this commit of TensorRT-LLM for the commands above to work:

git clone https://github.com/NVIDIA/TensorRT-LLM/
cd TensorRT-LLM
git checkout 9bd15f1937f52658fb116f30d58fea786ce5d03b

They will generate folders named like this, containing both the converted model and a copy of the tokenizer files:

NousResearch_Llama-2-7b-hf_TensorRT
NousResearch_Llama-2-13b-hf_TensorRT
TheBloke_Llama-2-7B-GPTQ_TensorRT
TheBloke_Llama-2-13B-GPTQ_TensorRT

Loading a model

Here is an example:

python server.py \
  --model TheBloke_Llama-2-7B-GPTQ_TensorRT \
  --loader TensorRT-LLM \
  --max_seq_len 4096

Details

=* There are two ways to load the model: with a class called ModelRunnerCpp or with another one called ModelRunner. The first is faster but it does not support streaming yet. You can use it with the --cpp-runner flag.

TODO

~~Figure out prefix matching. This is already implemented, but there is no clear documentation on how to use it -- see issues #1043 and #620.~~ Does it work by default?
~~Create a TensorRT-LLM_HF loader integrated with the existing sampling functions in the project.~~ Left this for later.

Ph0rk0z · 2024-03-17T18:44:51Z

Heh, I was able to have flash attention with torch 2.2.1. I had tried TRT on the SD side and the hassle wasn't worth it. I wonder how this does with multi-gpu inference. It not using flash attention also probably really balloons the context. Will be fun to find out.

Nurgl · 2024-04-25T12:58:13Z

TensorRT and Triton Interface Server can reserve memory for several video cards at once and respond to several users in parallel. Is it possible to transfer this functionality to the Text-generation-web ui?

oobabooga · 2024-04-25T22:28:51Z

TensorRT and Triton Interface Server can reserve memory for several video cards at once and respond to several users in parallel. Is it possible to transfer this functionality to the Text-generation-web ui?

It should be possible -- the first step would be to remove the semaphore from modules/text_generation.py and figure out how to connect things together, maybe with a command-line flag for the maximum number of concurrent users. A PR with that addition would be welcome.

Nurgl · 2024-05-02T06:32:06Z

it would be nice to make both a queue mode and a parallel processing mode

ohso4 · 2024-09-26T21:06:40Z

I tried this, and I get this error:
It seems like autoawq and autoawq-kernels require torch==2.2.2, but if I install that then tensorrt won't work because it requires torch 2.4.0

╭────────────────────────────────────────── Traceback (most recent call last) ──────────────────────────────────────────╮
│ /home/alexander/nvidia_tensorrt/text-generation-webui/server.py:40 in <module>                                        │
│                                                                                                                       │
│    39 import modules.extensions as extensions_module                                                                  │
│ ❱  40 from modules import (                                                                                           │
│    41     chat,                                                                                                       │
│                                                                                                                       │
│ /home/alexander/nvidia_tensorrt/text-generation-webui/modules/training.py:21 in <module>                              │
│                                                                                                                       │
│    20 from datasets import Dataset, load_dataset                                                                      │
│ ❱  21 from peft import (                                                                                              │
│    22     LoraConfig,                                                                                                 │
│                                                                                                                       │
│ /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/peft/__init__.py:22 in <module>                 │
│                                                                                                                       │
│    21                                                                                                                 │
│ ❱  22 from .auto import (                                                                                             │
│    23     AutoPeftModel,                                                                                              │
│                                                                                                                       │
│ /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/peft/auto.py:32 in <module>                     │
│                                                                                                                       │
│    31 from .config import PeftConfig                                                                                  │
│ ❱  32 from .mapping import MODEL_TYPE_TO_PEFT_MODEL_MAPPING                                                           │
│    33 from .peft_model import (                                                                                       │
│                                                                                                                       │
│ /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/peft/mapping.py:22 in <module>                  │
│                                                                                                                       │
│    21                                                                                                                 │
│ ❱  22 from peft.tuners.xlora.model import XLoraModel                                                                  │
│    23                                                                                                                 │
│                                                                                                                       │
│                                                ... 2 frames hidden ...                                                │
│                                                                                                                       │
│ /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/peft/tuners/lora/model.py:50 in <module>        │
│                                                                                                                       │
│    49 from .aqlm import dispatch_aqlm                                                                                 │
│ ❱  50 from .awq import dispatch_awq                                                                                   │
│    51 from .config import LoraConfig                                                                                  │
│                                                                                                                       │
│ /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/peft/tuners/lora/awq.py:26 in <module>          │
│                                                                                                                       │
│    25 if is_auto_awq_available():                                                                                     │
│ ❱  26     from awq.modules.linear import WQLinear_GEMM                                                                │
│    27                                                                                                                 │
│                                                                                                                       │
│ /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/awq/__init__.py:2 in <module>                   │
│                                                                                                                       │
│   1 __version__ = "0.2.6"                                                                                             │
│ ❱ 2 from awq.models.auto import AutoAWQForCausalLM                                                                    │
│   3                                                                                                                   │
│                                                                                                                       │
│ /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/awq/models/__init__.py:21 in <module>           │
│                                                                                                                       │
│   20 from .llava_next import LlavaNextAWQForCausalLM                                                                  │
│ ❱ 21 from .phi3 import Phi3AWQForCausalLM                                                                             │
│   22 from .cohere import CohereAWQForCausalLM                                                                         │
│                                                                                                                       │
│ /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/awq/models/phi3.py:7 in <module>                │
│                                                                                                                       │
│     6 from awq.modules.fused.model import Phi3Model as AWQPhi3Model                                                   │
│ ❱   7 from transformers.models.phi3.modeling_phi3 import (                                                            │
│     8     Phi3DecoderLayer as OldPhi3DecoderLayer,                                                                    │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
ModuleNotFoundError: No module named 'transformers.models.phi3'
(tensorrt) alexander@debian-server:~/nvidia_tensorrt/text-generation-webui$

because of this:

tensorrt-llm 0.12.0 requires torch<=2.4.0,>=2.4.0a0, but you have torch 2.2.2 which is incompatible.

However if I try to fix this by installing torch 2.4 I get a different error:

autoawq 0.2.6 requires torch==2.2.2, but you have torch 2.4.0 which is incompatible.
autoawq-kernels 0.0.7 requires torch==2.2.2, but you have torch 2.4.0 which is incompatible.

And this is what happens when I try to run the webui

/home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/awq/modules/linear/exllama.py:12: UserWarning: AutoAWQ could not load ExLlama kernels extension. Details: /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/exl_ext.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi
  warnings.warn(f"AutoAWQ could not load ExLlama kernels extension. Details: {ex}")
/home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/awq/modules/linear/exllamav2.py:13: UserWarning: AutoAWQ could not load ExLlamaV2 kernels extension. Details: /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/exlv2_ext.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi
  warnings.warn(f"AutoAWQ could not load ExLlamaV2 kernels extension. Details: {ex}")
/home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/awq/modules/linear/gemm.py:14: UserWarning: AutoAWQ could not load GEMM kernels extension. Details: /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/awq_ext.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi
  warnings.warn(f"AutoAWQ could not load GEMM kernels extension. Details: {ex}")
/home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/awq/modules/linear/gemv.py:11: UserWarning: AutoAWQ could not load GEMV kernels extension. Details: /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/awq_ext.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi
  warnings.warn(f"AutoAWQ could not load GEMV kernels extension. Details: {ex}")
╭────────────────────────────────────────── Traceback (most recent call last) ──────────────────────────────────────────╮
│ /home/alexander/nvidia_tensorrt/text-generation-webui/server.py:40 in <module>                                        │
│                                                                                                                       │
│    39 import modules.extensions as extensions_module                                                                  │
│ ❱  40 from modules import (                                                                                           │
│    41     chat,                                                                                                       │
│                                                                                                                       │
│ /home/alexander/nvidia_tensorrt/text-generation-webui/modules/training.py:21 in <module>                              │
│                                                                                                                       │
│    20 from datasets import Dataset, load_dataset                                                                      │
│ ❱  21 from peft import (                                                                                              │
│    22     LoraConfig,                                                                                                 │
│                                                                                                                       │
│ /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/peft/__init__.py:22 in <module>                 │
│                                                                                                                       │
│    21                                                                                                                 │
│ ❱  22 from .auto import (                                                                                             │
│    23     AutoPeftModel,                                                                                              │
│                                                                                                                       │
│ /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/peft/auto.py:32 in <module>                     │
│                                                                                                                       │
│    31 from .config import PeftConfig                                                                                  │
│ ❱  32 from .mapping import MODEL_TYPE_TO_PEFT_MODEL_MAPPING                                                           │
│    33 from .peft_model import (                                                                                       │
│                                                                                                                       │
│ /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/peft/mapping.py:22 in <module>                  │
│                                                                                                                       │
│    21                                                                                                                 │
│ ❱  22 from peft.tuners.xlora.model import XLoraModel                                                                  │
│    23                                                                                                                 │
│                                                                                                                       │
│                                                ... 8 frames hidden ...                                                │
│                                                                                                                       │
│ /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/awq/quantize/quantizer.py:10 in <module>        │
│                                                                                                                       │
│     9 from awq.utils.calib_data import get_calib_dataset                                                              │
│ ❱  10 from awq.quantize.scale import apply_scale, apply_clip                                                          │
│    11 from awq.utils.utils import clear_memory, get_best_device                                                       │
│                                                                                                                       │
│ /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/awq/quantize/scale.py:8 in <module>             │
│                                                                                                                       │
│     7 from transformers.models.bloom.modeling_bloom import BloomGelu                                                  │
│ ❱   8 from transformers.models.llama.modeling_llama import LlamaRMSNorm                                               │
│     9 from transformers.models.gemma.modeling_gemma import GemmaRMSNorm                                               │
│                                                                                                                       │
│ /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:55  │
│ in <module>                                                                                                           │
│                                                                                                                       │
│     54 if is_flash_attn_2_available():                                                                                │
│ ❱   55     from flash_attn import flash_attn_func, flash_attn_varlen_func                                             │
│     56     from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input  # noqa                       │
│                                                                                                                       │
│ /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/flash_attn/__init__.py:3 in <module>            │
│                                                                                                                       │
│    2                                                                                                                  │
│ ❱  3 from flash_attn.flash_attn_interface import (                                                                    │
│    4     flash_attn_func,                                                                                             │
│                                                                                                                       │
│ /home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py:10 in        │
│ <module>                                                                                                              │
│                                                                                                                       │
│      9 # We need to import the CUDA kernels after importing torch                                                     │
│ ❱   10 import flash_attn_2_cuda as flash_attn_cuda                                                                    │
│     11                                                                                                                │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
ImportError: 
/home/alexander/miniconda3/envs/tensorrt/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: 
undefined symbol: _ZN3c104cuda9SetDeviceEi
(tensorrt) alexander@debian-server:~/nvidia_tensorrt/text-generation-webui$

Ph0rk0z · 2024-09-27T12:08:24Z

Just compile the autoAWQ kernels yourself for your torch version.

Add TensorRT-LLM support

0434d2a

oobabooga added 10 commits June 23, 2024 19:54

Merge branch 'dev' into tensorrt

3dcd13c

Merge branch 'dev' into tensorrt

f62ff37

Update version to 0.10.0 in the dockerfile

7e42b67

Not necessary to uninstall flash-attention anymore

8bacdc0

Merge branch 'dev' into tensorrt

38ba51d

Merge branch 'dev' into tensorrt

82accb7

Merge branch 'dev' into tensorrt

4be2e8a

Merge branch 'dev' into tensorrt

a8798c7

Fix getting the model metadata

28f87c2

Update the link in the UI

b820841

oobabooga merged commit 577a8cd into dev Jun 24, 2024

oobabooga deleted the tensorrt branch June 24, 2024 05:36

PoetOnTheRun pushed a commit to PoetOnTheRun/text-generation-webui that referenced this pull request Oct 22, 2024

Add TensorRT-LLM support (oobabooga#5715)

0aa40ca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TensorRT-LLM support #5715

Add TensorRT-LLM support #5715

oobabooga commented Mar 17, 2024 •

edited

Loading

Ph0rk0z commented Mar 17, 2024

Nurgl commented Apr 25, 2024

oobabooga commented Apr 25, 2024

Nurgl commented May 2, 2024

ohso4 commented Sep 26, 2024 •

edited

Loading

Ph0rk0z commented Sep 27, 2024

Add TensorRT-LLM support #5715

Add TensorRT-LLM support #5715

Conversation

oobabooga commented Mar 17, 2024 • edited Loading

Speed tests

Installation

Option 1: Docker

Option 2: Manually

Converting a model

Loading a model

Details

TODO

Ph0rk0z commented Mar 17, 2024

Nurgl commented Apr 25, 2024

oobabooga commented Apr 25, 2024

Nurgl commented May 2, 2024

ohso4 commented Sep 26, 2024 • edited Loading

Ph0rk0z commented Sep 27, 2024

oobabooga commented Mar 17, 2024 •

edited

Loading

ohso4 commented Sep 26, 2024 •

edited

Loading