Inference yolov5_face model with Torchserve and TensorRT backend(recommended)with 11ms latency,query per second (QPS) 700 on T4 GPU server
Traditional Torchserve pipeline with Jit TorchScript backend see torchserve/ With higher latency and lower throughput.
this repo is adapted from (Warning: GNU LICENSE)
- Add Torchserve as Inference server
- Accerlerated with TensorRT by torch2trt toolkit, with 10x lower latency and 2x larger throughput. This is the first demo to show how serve TensorRT model on Torchserve as far as I know.
- Add Docker and logging.
Where Torchserve is a performant, flexible and easy to use tool for serving PyTorch eager mode and torschripted models
TensorRT is a library developed by NVIDIA for faster inference on NVIDIA graphics processing units (GPUs). ... It can give around 4 to 5 times faster inference on many real-time services and embedded applications.
"torch2trt" is a PyTorch to TensorRT converter which utilizes the TensorRT Python API. It remain the input/ouput of the model as Torch Tensor format.
torchserve can serve torch2trt model pretty well, simply by rewriting the handler like this.
from torch2trt import TRTModule
class Yolov5FaceHandler(BaseHandler):
def initialize(self, context):
serialized_file = context.manifest["model"]["serializedFile"]
if serialized_file.split(".")[-1] == "torch2trt": #if serializedFile ends with .torch2trt instead of .pt
self._load_torchscript_model = self._load_torch2trt_model # overwrite load model function
def _load_torch2trt_model(self, torch2trt_path):"Loading torch2trt model")
model_trt = TRTModule()
return model_trt
see pytorch/serve#1243 for discussion
- currently (20210904) a SOTA face-detection model on widerface benchmark, balanced between speed and accuracy.
- based on pytorch,easy to finetuning,easy to build inference server via torchserve.
python torchserve/ --mode 1 --vis 1 --image data/images/test.jpg
1、decode the image from jpg,resize and padding to lower resolution as 320×320 for acceleration。 2、batch inference with TensorRT backend 3、revert face coords to the size of original resolution and return result with json format
input:jpg data stream, output:json format, (bounding box, confidence, 5 landmarks)
"xywh_ratio": [0.7689772367477417, 0.25734335581461587, 0.11677041053771975, 0.26296865675184466],
"conf": 0.8641895651817322,
"landmarks_ratio": [0.754405927658081, 0.22680193583170574, 0.8030961990356446, 0.23478228251139324, 0.7799828529357911, 0.2754765404595269, 0.7510656356811524, 0.31618389553493925, 0.7911150932312012, 0.32295591566297743]
"xywh_ratio": [0.4645264148712158, 0.47456512451171873, 0.12120456695556636, 0.29619462754991316],
"conf": 0.7263935804367065,
"landmarks_ratio": [0.4809267997741699, 0.44996253119574653, 0.5082815647125244, 0.4542162577311198, 0.5047649383544922, 0.5095860799153645, 0.4696146011352539, 0.5512683444552952, 0.4905359745025635, 0.5559690687391493]
every image consists of N faces,every face include 3 keys:
- xywh_ratio is face center coordinates and width and height, as ratio of the image size.
- conf is the confidence of face detection from 0 to 1
- landmarks_ratio are 5 coords of face landmarks, as ratio of the image size
Follow below instructions to deploy yolov5face
- cd yolov5face/docker
- docker-compose up -d
Configurations The yolov5face configurations are actually configures to torchserve. The configuration file locates at: yolov5face/config/
The configuration items are the ip addresses and port that the service is binded to.
The worker number of torchserve is currently hard fixed to 4.
Bottlenecks: Each yolov5face torchserve consumes 2.5G memory in average, so memory of the system is a bottleneck.
pip install -r requirements.txt
install java11 dependence.
On cloud server, if cuda version is different from cuda10.2, manually edit the pytorch version in requirements.txt
download 50M file yolov5s
unzip to folder
download TensorRT-7 (compatible with torch2trt tool on 2021, maybe TensorRT-8 is also compatible for now) recommend to install via tar.gz, which is compatible with conda environment be aware to write correct ubuntu version,cuda version,and cudnn version
cd ~/
git clone
cd torch2trt
python install
- pack models and python code to torchserve .mar format. Backends with TensorRT.
python ./torchserve/ --trt 1
will generate a file "./torchserve/model_store/trt_fd1.mar".
- start server
torchserve --start --ncs --model-store ./torchserve/model_store/
- localhost register model
curl -X POST ""
Note that
- url=trt_fd1.mar
- batch_size=1
- initial_workers=2 where 2 is the number of cpu cores on your server, and require 3 * 2 GB system memory.
git clone --branch v0.3.0
docker pull
why this step? docker build need GPU and torch2trt to convert the model,see
- Install nvidia-container-runtime:
sudo apt-get install nvidia-container-runtime
- Edit/create the /etc/docker/daemon.json with content:
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
"default-runtime": "nvidia"
- Restart docker daemon:
sudo systemctl restart docker
docker build -f Dockerfile_base --tag base --progress=plain .
docker build -f Dockerfile_torchserve_tensorrt --tag ts_trt --progress=plain .
docker run --gpus all -it --rm --name test -p 8080:8080 -p 8081:8081 -p 8082:8082 -p 7070:7070 -p 7071:7071 ts_trt
It will run torchserve/ inside the image.
1、start torchserve
2、register the model
success log:
{"status": "Model "fd1" Version: 1.0 registered with 4 initial workers"}
QPS test
python torchserve/ --mode 3
python torchserve/ --mode 3 --vis 1
python torchserve/
curl -T ./data/images/zidane.jpg
bash ./torchserve/
python torchserve/ --mode 3
QPS test on local torchscript model
python ./torchserve/ --trt 0
python torchserve/ --mode 1
QPS test on local tensorrt model
python ./torchserve/ --trt 1
python torchserve/ --mode 2