本项目是在英伟达的GPU显卡环境对谷歌的t5模型用tensorrt C++的api进行推理加速,本项目包含了推理demo和http的API服务,支持并发、支持linux/win
若需要CPU环境的加速,请移步至另外一个子项目https://github.com/apple2333cream/t5-ort-cpp.git
原始模型仓库:https://huggingface.co/google-t5/t5-base
tensorrt 10.0.1 (理论上8.9.6以上版本即可)
cudnn 8.9.4
cuda 12.4
python3.10
pip install -r HuggingFace/requirements.txt
2.1 huggingface->onnx
2.2 onnx->tensrrt engine
cd ./HuggingFace
bash gen_t5_bs1_beam2.sh (fp16)
说明,encoder和decoder分开转换,若合成一个模型导出在转trt时需要手写BeamSearch(下个版本中会导出一个模型进行推理)
mkdir build && cd build make -j8
- demo ./t5_engine --use_mode=0
- test ./t5_engine --use_mode=1
- api ./t5_engine --use_mode=2
服务请求示例:
curl -X POST -d "{ "RequestID": "65423221", "InputText": "translate English to French: I was a victim of a series of accidents." }" http://127.0.0.1:17653/T5/register
├── CMakeLists.txt
├── HuggingFace 转换模型代码,参考 (https://github.com/kshitizgupta21/triton-trt-oss.git)
├── main.cpp
├── onnx2tensorrt.sh
├── readme.md
├── src
├── third_party
以t5-base为例,tensorrt(fp16)相较于原生的pytorch有2.78倍的加速
CPU 内存占用2.2G
| 推理框架 | 显存(Gb) | 时间(ms) |
|---------------|----------|--------|
| torch | 1.4 |513 |
| tensorrt(fp32)| 1.2 |275 |
| tensorrt(fp16)| 1.2 |184 |
测试环境 V100 ,tensorrt 10.0.1,cudnn8.9.4, cuda12.4
- 20240728 v1.0.0 update:
- 提交t5 tensorrt C++ api推理代码
qq:807876904
[triton-trt-oss]https://github.com/kshitizgupta21/triton-trt-oss.git
[t5-ort-cpp]https://github.com/apple2333cream/t5-ort-cpp.git
https://github.com/onnx/models/tree/main/validated/text/machine_comprehension/t5