-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TensorRT-LLM support #5715
Add TensorRT-LLM support #5715
Conversation
Heh, I was able to have flash attention with torch 2.2.1. I had tried TRT on the SD side and the hassle wasn't worth it. I wonder how this does with multi-gpu inference. It not using flash attention also probably really balloons the context. Will be fun to find out. |
TensorRT and Triton Interface Server can reserve memory for several video cards at once and respond to several users in parallel. Is it possible to transfer this functionality to the Text-generation-web ui? |
It should be possible -- the first step would be to remove the semaphore from modules/text_generation.py and figure out how to connect things together, maybe with a command-line flag for the maximum number of concurrent users. A PR with that addition would be welcome. |
it would be nice to make both a queue mode and a parallel processing mode |
I tried this, and I get this error:
because of this:
However if I try to fix this by installing torch 2.4 I get a different error:
And this is what happens when I try to run the webui
|
Just compile the autoAWQ kernels yourself for your torch version. |
TensorRT-LLM (https://github.com/NVIDIA/TensorRT-LLM/) is a new inference backend developed by NVIDIA.
In my testing, I found it to be consistently faster than ExLlamaV2 in both prompt processing and evaluation. That makes it the new SOTA inference backend in terms of speed.
Speed tests
I provided the models with a 3200 tokens input and measured the time to process those 3200 tokens and the time to generate 512 tokens afterwards. I did this over API, and each number in the table above is a median out of 20 measurements.
To accurately measure the TensorRT-LLM speeds, it was necessary to do a warmup generation before starting the measurements, as the first generation has an overhead due to module imports. The same warmup was done for ExLlamaV2 as well.
The tests were carried out in an RTX 6000 Ada GPU.
Installation
Option 1: Docker
Just use the included Dockerfile under
docker/TensorRT-LLM/Dockerfile
, which will automatically set everything up from scratch.II find the following commands useful (make sure to run them after moving into the folder containing the Dockerfile with
cd
):Option 2: Manually
TensorRT-LLM only works on Python 3.10 at the moment, while this project uses Python 3.11 by default, so it's necessary to create a separate Python 3.10 conda environment:
Make sure to paste the commands above in the specified order.
For Windows setup and more information about installation, consult the official README.
Converting a model
Contrary to what happens with other backends, it's necessary to convert the model before using it so it gets optimized for your GPU (or GPUs). These are the commands that I have used:
FP16 models
GPTQ models
More commands can be found on this page:
https://github.com/NVIDIA/TensorRT-LLM/tree/728cc0044bb76d1fafbcaa720c403e8de4f81906/examples/llama
Make sure to use this commit of TensorRT-LLM for the commands above to work:
They will generate folders named like this, containing both the converted model and a copy of the tokenizer files:
Loading a model
Here is an example:
Details
=* There are two ways to load the model: with a class called
ModelRunnerCpp
or with another one calledModelRunner
. The first is faster but it does not support streaming yet. You can use it with the--cpp-runner
flag.TODO
Figure out prefix matching. This is already implemented, but there is no clear documentation on how to use it -- see issues #1043 and #620.Does it work by default?Create aLeft this for later.TensorRT-LLM_HF
loader integrated with the existing sampling functions in the project.