-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Survey TensorRT for inference #8492
Comments
Problem descriptionThe main goal is to come up with an approach for integrating TensorRT with PaddlePaddle's inference library (which is in C++). We want to do this in order to use TensorRT for performing inference on a model saved using Fluid. To address this, we will first briefly discuss TensorRT, and the functionalities offered by TensorRT, before finally proposing our first attempt. TensorRTIntroductionTensorRT is deep learning inference optimizer and runtime from Nvidia, aimed at deploying trained deep networks for inference in a variety of production platforms. Using TensorRT involves two phases:
Build phaseThis step is performed only once, prior to deployment. A "trained model" trained using any popular deep learning framework has to be first parsed using TensorRT, and imported to the TensorRT Optimizer module. The TensorRT Optimizer performs several optimizations (briefly discussed below) and outputs an optimized inference execution engine. This execution engine when serialized to a file on disk is known as plan file. The crucial part here is importing a trained model. For Caffe and Tensorflow, TensorRT provides simple Python and C++ APIs to import the models directly. However, for other frameworks, we need to use TensorRT's Network Definition API to specify the network description (either in C++ or Python), before loading it into TensorRT. An image summarizing this phase is: https://devblogs.nvidia.com/wp-content/uploads/2017/12/pasted-image-0-4-768x656.png The various optimizations performed by the TensorRT Optimizer are:
Deploy phaseIn this phase, the saved plan file is loaded and deserialized to create a TensorRT Runtime engine object and used to perform inference on new data. Our approachAs discussed in the "build phase" subsection, the most important point for our use case is: to import a model into TensorRT that is trained using PaddlePaddle fluid. From the documentation, we find that networks from other frameworks (except Caffe and Tensorflow) can be imported directly via the UFF format. The UFF: Universal Framework Format is a data format that describes an execution graph for a deep network. The documentation contains an example of using TensorRT's Python API to convert a model from PyTorch into a TensorRT engine. However, there isn't any example demonstrating TensorRT's C++ API to convert a model from any other framework. So the first task is to come up with an example where we can use TensorRT's C++ API to convert a model to the required format. Regarding current support of ONNX with TensorRT:
Thus, we think it is reasonable to come up with a custom converter that imports fluid's model into TensorRT (using the C++ API). References:
|
You can refer to the implementation in tensorflow: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/tensorrt |
TensorRT as a 3rd party inference platform, integrate into paddlepaddle will be a good serving option. |
We want to basically see how we can integrate TensorRT to support models trained with fluid.
The text was updated successfully, but these errors were encountered: