Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TensorRT infer support #57

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

xiang-wuu
Copy link

@xiang-wuu xiang-wuu commented Jul 8, 2022

This PR is intended to export model from PyTorch to onnx, and then serialize the exported onnx model to native TRT engine, which will be inferred using TensorRT I.E,

  • Implement onnx_to_tensorrt.py script
  • Export onnx model to TensorRT engine
  • Implement python module to infer from serialized TRT engine
  • Integrate pre-process & post process functions from detect.py script to TensorRT infer script
  • Drawing bounding box detection's over sample image
  • Custom Detect plugin integration for TensorRT < 8.0
  • Implement INT8 calibrator script for INT8 serialization

@philipp-schmidt
Copy link
Contributor

philipp-schmidt commented Jul 10, 2022

When exported with --grid:
python models/export.py --weights yolov7.pt --grid

Building the TensorRT engine fails:

root@3aa30b614471:/workspace/yolov7# python deploy/TensoorRT/onnx_to_tensorrt.py --onnx yolov7.onnx --fp16 --explicit-batch -o yolov7.engine
Namespace(calibration_batch_size=128, calibration_cache='calibration.cache', calibration_data=None, debug=False, explicit_batch=True, explicit_precision=False, fp16=True, gpu_fallback=False, int8=False, max_batch_size=None, max_calibration_size=2048, onnx='yolov7.onnx', output='yolov7.engine', refittable=False, simple=False, strict_types=False, verbosity=None)
2022-07-10 07:53:52 - __main__ - INFO - TRT_LOGGER Verbosity: Severity.ERROR
2022-07-10 07:53:52 - __main__ - INFO - Setting BuilderFlag.FP16
[TensorRT] ERROR: [graphShapeAnalyzer.cpp::throwIfError::1306] Error Code 9: Internal Error (Mul_378: broadcast dimensions must be conformable
)
ERROR: Failed to parse the ONNX file: yolov7.onnx
In node 378 (parseGraph): INVALID_NODE: Invalid Node - Mul_378
[graphShapeAnalyzer.cpp::throwIfError::1306] Error Code 9: Internal Error (Mul_378: broadcast dimensions must be conformable
)

Any idea how to fix that @xiang-wuu ?

@philipp-schmidt
Copy link
Contributor

I have the same issue when using trtexec for conversion, so this is definitely a TensorRT / ONNX issue.
Here: #66

@xiang-wuu
Copy link
Author

@philipp-schmidt that could be an issue due to PyTorch and ONNX version, try upgrading to latest versions for both of them. however am working on post-processing part with --grid option which returns primary output node with shape (1, 25200, 85).

@philipp-schmidt
Copy link
Contributor

Yes it was the pytorch version. I also had to run onnx-simplify, otherwise TensorRT had issues with a few resize operations.

Looking forward to try your implementation.

@xiang-wuu
Copy link
Author

Almost done, with some final typo's to be resolved.

@philipp-schmidt
Copy link
Contributor

philipp-schmidt commented Jul 12, 2022

Quickly scanned the code and it looks really good!

A few questions / remarks:

  1. you use yolov7.cache for INT8, how do you put that together? Still a todo? Actually im curios about yolov7 INT8 performance to accuracy tradeoff, so that would be cool to see!

  2. Conversion from onnx to TensorRT can also be done with TensorRT directly without any additional code. The NGC TensorRT docker images come with a precompiled tool "trtexec" which will happily turn ONNX into an engine.

  3. I'm looking into making the batch size dynamic so that e.g. Triton Inference Server can combine smaller requests into larger batch sizes via a feature called Dynamic Batching. (e.g. pack multiple simultaniously arriving batch 1 requests into one larger batch 4)
    While coding this, did you somehow manage to make the input batch size of the TensorRT engine dynamic up to a maximum batch size?
    So basically the input shape will be either [-1,640,640,3] for explicit batching or [640,640,3] with implicit batching. In the past ONNX was unable to support implicit batching (still seems be the case) and custom plugins were a little hard to make work with dynamic (-1) + explicit batching.

@albertfaromatics
Copy link

Hi, sorry to write here. I've tried your branch with tensorrt and yolov7-tiny custom trained on a Nvidia Jetson Xavier NX. I converted the model trained from pytorch with no problem after some tries, but when testing results, both mAP and FPS are much lower:

  • PyTorch + cuda: ~40fps, 78mAP
  • TRT: ~24fps, 64mAP

Is this normal? Am I doing something wrong?

@xiang-wuu
Copy link
Author

Quickly scanned the code and it looks really good!

A few questions / remarks:

1. you use yolov7.cache for INT8, how do you put that together? Still a todo? Actually im curios about yolov7 INT8 performance to accuracy tradeoff, so that would be cool to see!

2. Conversion from onnx to TensorRT can also be done with TensorRT directly without any additional code. The NGC TensorRT docker images come with a precompiled tool "trtexec" which will happily turn ONNX into an engine.

3. I'm looking into making the batch size dynamic so that e.g. Triton Inference Server can combine smaller requests into larger batch sizes via a feature called Dynamic Batching. (e.g. pack multiple simultaniously arriving batch 1 requests into one larger batch 4)
   While coding this, did you somehow manage to make the input batch size of the TensorRT engine dynamic up to a maximum batch size?
   So basically the input shape will be either [-1,640,640,3] for explicit batching or [640,640,3] with implicit batching. In the past ONNX was unable to support implicit batching (still seems be the case) and custom plugins were a little hard to make work with dynamic (-1) + explicit batching.
  1. will add calibration script for PTQ.
  2. Yes, serialization with trtexec is possible but if using TRT < 8.0 the custom plugin need's to be preloaded.
  3. I haven't tested for max. dynamic batch size , but as i know dynamic batching is effectively abstracted by Triton and by exporting the onnx model with implicit batching could make it work with Triton, still subject to trial & error!

@xiang-wuu
Copy link
Author

Hi, sorry to write here. I've tried your branch with tensorrt and yolov7-tiny custom trained on a Nvidia Jetson Xavier NX. I converted the model trained from pytorch with no problem after some tries, but when testing results, both mAP and FPS are much lower:

* PyTorch + cuda: ~40fps, 78mAP

* TRT: ~24fps, 64mAP

Is this normal? Am I doing something wrong?

Optimization is out of scope for this PR, this PR is intended to support minimalistic deployable TRT implementation, the optimization is altogether subject to further contribution.

@philipp-schmidt
Copy link
Contributor

@albertfaromatics
How do you test FPS and mAP?
There is very little chance that your TensorRT engine is slower than pytorch directly. Especially on Jetson.

@albertfaromatics
Copy link

@philipp-schmidt
For PyTorch + cuda, I simply adapted the detect.py here to read a folder of images (around 200 of them), compute prediction time (inference + nms) and compute fps
For TensorRT, I followed the README on the repo, with export, simplify, onnx_to_tensorrt (I'm using TensorRT 8.4) and run.

These steps gave me the FPS (40ish vs 25ish). For mAP I used the test here and adapted a code to get the detections from TensorRT and "manually" compute mAP.

@philipp-schmidt
Copy link
Contributor

philipp-schmidt commented Jul 13, 2022

Try to run your engine with trtexec instead, it will give you a very good indication of actual compute latency.

Last few steps of this: https://github.com/isarsoft/yolov4-triton-tensorrt#build-tensorrt-engine

@philipp-schmidt
Copy link
Contributor

I don't think that it comes prebuilt in the Linux 4 Tegra TensorRT docker images for jetson though.

@albertfaromatics
Copy link

@philipp-schmidt I'll give it a try. I can compile it myself from tensorrt/samples folder, but never used it before.

I'll try and see why I have this results.
Thanks!

@xiang-wuu
Copy link
Author

@WongKinYiu good to merge.

@ccqedq
Copy link

ccqedq commented Jul 14, 2022

it works,but no bounding box is drawn

@xiang-wuu
Copy link
Author

it works,but no bounding box is drawn

share the environment details?

@ccqedq
Copy link

ccqedq commented Jul 15, 2022

torch 1.11.0+cu113 onnx 1.12.0 tensorrt 8.4.1.5
I use ScatterND op built-in plugin to run the code, but found no bounding box is drawn
Considering the built-in plugin is used, is this a problem of data preprocessing?

@ccqedq
Copy link

ccqedq commented Jul 15, 2022

I use deploy_onnx_trt branch to generate yolov7.onnx, to get yolov7.engine, I run the following command:
python3 onnx_to_tensorrt.py --explicit-batch --onnx yolov7-sim.onnx -o yolov7.engine

@xiang-wuu
Copy link
Author

@dongdengwei , try without building the plugin , if using TRT > 8.0

@ccqedq
Copy link

ccqedq commented Jul 15, 2022

I run the following command to do the inference:
python3 yolov7_trt.py video1.mp4
still no bounding box

@ccqedq
Copy link

ccqedq commented Jul 15, 2022

it seem that I should replace "return x if self.training else (torch.cat(z, 1), x)" with "return x if self.training else (torch.cat(z, 1), ) if not self.export else (torch.cat(z, 1), x)" in yolo.py.
but in environment torch 1.10.1+cu111 onnx 1.8.1 tensorrt 7.2.3.4, it has the following error:
2022-07-15 18:18:59 - main - INFO - TRT_LOGGER Verbosity: Severity.ERROR
getFieldNames
createPlugin
[TensorRT] ERROR: Mul_378: elementwise inputs must have same dimensions or follow broadcast rules (input dimensions were [1,3,80,80,2] and [1,1,1,3,2]).
should I upgrade torch 1.10.1 to 1.11.0 and onnx 1.8.1 to 1.12.0

@xiang-wuu
Copy link
Author

@dongdengwei PyTorch > 1.11.0 is required to make it work, recommended is 1.12.0

@akashAD98
Copy link
Contributor

@xiang-wuu @philipp-schmidt @AlexeyAB @Linaom1214 can you share the map performance of converted model? is the accuracy same after conversion ? or how much drop in accuracy?also it would be great if you add support for checking map of .trt model ,its inference on video. Thanks

@akashAD98
Copy link
Contributor

akashAD98 commented Aug 10, 2022

Linaom1214/TensorRT-For-YOLO-Series#26 not able to do inference on videos

@Stoooner
Copy link

Stoooner commented Sep 5, 2022

Hi, sorry to write here. I've tried your branch with tensorrt and yolov7-tiny custom trained on a Nvidia Jetson Xavier NX. I converted the model trained from pytorch with no problem after some tries, but when testing results, both mAP and FPS are much lower:

  • PyTorch + cuda: ~40fps, 78mAP
  • TRT: ~24fps, 64mAP

Is this normal? Am I doing something wrong?

Hi, I have tested the yolov7-tiny tensorRT model on jetson Xavier NX by my own code, and the result is showed in issue #703:#703, maybe you can check it.

@Linaom1214
Copy link
Contributor

Linaom1214/TensorRT-For-YOLO-Series#26 not able to do inference on videos

the reason is colab env don't support opencv imshow fucntion

@9friday
Copy link

9friday commented Oct 14, 2023

Hi, sorry to write here. I've tried your branch with tensorrt and yolov7-tiny custom trained on a Nvidia Jetson Xavier NX. I converted the model trained from pytorch with no problem after some tries, but when testing results, both mAP and FPS are much lower:

  • PyTorch + cuda: ~40fps, 78mAP
  • TRT: ~24fps, 64mAP

Is this normal? Am I doing something wrong?

Hi @xiang-wuu ,
I'm using a Nvidia Jetson Xavier AGX with Jetpack version 4.6.1 and CUDA version 10.2. I would like to recreate these results for both .pt and TRT formats.
We have tried to convert to .engine files using the 'trtexec' already present with the L4T installation in the Jetson device, but the inference timings are not good. For inference we used the 'official yolov7 deepstream inference script' from NVIDIA.

Environment setup:

Should the requirements.txt from yolov7 repo be used on Xavier AGX as it is?
Should we install Pytorch from 'Pytorch for Jetson'. The Pytorch wheel corresponding to Jetpack 4.6.1 is this.

Inference on Jetson device:

Is the original detect.py sufficient for inference using .pt weights on Jetson devices?
Is the YOLOv7ONNXandTRT.ipynb file sufficient for inference using TRT format weights on Jetson devices?

Looking forward to your response.

Cheers :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants