Automatic Fallback #360
Replies: 8 comments 1 reply
-
cc: @peri044 @bowang95 |
Beta Was this translation helpful? Give feedback.
-
I think actually we might be able to use the output shape calculation to get intermediate layer shapes similar to how we do now. We will still need to explicitly pull them out and store them and it's still a bit hard to reason about the dynamic shape case but I think it will be the same as the static just with a bunch of placeholders. The runtime will need to ensure that it fills in the tensor sizes for each of the intermediate inputs |
Beta Was this translation helpful? Give feedback.
-
We should make sure that we store some information about what each engine does in the final graph so people dumping the graph can see what got compiled and what didn't. #47 |
Beta Was this translation helpful? Give feedback.
-
Hi Bo, I tested some models with fallback, most of them ended with this error
|
Beta Was this translation helpful? Give feedback.
-
This is the test suite which can cover all the components of the automatic fallback feature. Test suiteEach testcase would take an simple input graph with atleast one unsupported node. 1) segmentationfallback base model conv1-conv2 - log_sig -> conv3 a) assert the number of segments to be correct 2 trt segments, 1 pyt segment b) Also check if the segments have the right nodes. For ex: 1st trt segment should have 2 conv (conv1 + conv2) nodes, pytorch segment should have log_sig node and 2nd trt segment should have conv3 layer c) min_seg_size : 3 , no: segments for trt : 0, for pyt : 4 d) forced_fallback operator test 2) Check the validity of mini-graph - check if the nodes in mini-graph are present in global lowered graph through an unordered map.Iterate through mini-graphs and ensure all the nodes exist 3) shape inference / mini-graph execution (after checking each of them)This step is to ensure all the shapes of a mini-graph are known. 4) TensorRT conversion of mini-graphsconv-conv - log_sig -> conv with min_seg_size=2 5) stitching mini-graphsFor every node in the new global fallback graph, there would be a corresponding input in the original graph. 6) end-end run of hybrid graph vs pytorch graph runbuild a hybrid graph and run b) fallback optimized graph Compare the outputs. |
Beta Was this translation helpful? Give feedback.
-
There is a question of if users use truncation and automatic fallback together, if weights get truncated in a TensorRT block and a following PyTorch block expects a tensor that is Float64 or Long, what should be the correct semantics so that we dont get type errors? Does the TensorRT block need to track uses of truncated tensors so it can cast back at the end? |
Beta Was this translation helpful? Give feedback.
-
currently more corner cases are covered:
|
Beta Was this translation helpful? Give feedback.
-
The feature should have the following principle post segmentation. Dependency resolution in segmentationWe would need all TRT segments in the graph to be self-sustainable. This requires duplicating the dependency chain of the inputs in TRT segments. Eg: Input graph
The output graph should look like
We should make sure the entire dependency chain is duplicated and not just inserting the value of %2 directly in the TRT segment inputs. |
Beta Was this translation helpful? Give feedback.
-
Automatic Fallback
Goal: Allow the compiler to identify subgraphs that can be supported by TRTorch and correctly segment out these graphs, compile each engine and then link together TorchScript and TRTorch engines.
Currently as when users hit an unsupported op, compilation is aborted. Users have requested that in the event of an unsupported op that it would be run in PyTorch. This wildly increases the complexity of the compiler. PyTorch also operates on a JIT basis, not requiring ahead of time understanding of metadata such as shape or dependencies. Since it executes in topological order while traversing the graph, correct execution is guaranteed. We therefore in order to implement fallback will need to determine which subgraphs we can handle without interfering with the execution order and also ensuring all ahead of time information is available. The goal is to be able to return to users a new graph which has both PyTorch ops and TensorRT ops and still ensure correct execution.
For example, the following forward function could be partitioned like so:
The scalar operations and dictionary accesses are handled by Torch but the Tensor computation is done by TensorRT.
Usecase for Fallback
Fallback is useful if the following conditions cannot be met:
Proposed User-Level API
There are a few settings that we would want to support
By default fallback will be off, it is enabled by setting
enabled
asTrue
. Next there will be specification for a minimum block size. What this means is that if there is a series of in this example at least 3 consecutive operations that can be converted to TensorRT then an engine will be created to encapsulate those operations. Finally there is a list of strings that will be names of operations that the user explicitly wants to run in PyTorch.Internals
Lowering
Current lowering used today will run prior to anything related to fallback. This will make sure we are working with the most TensorRT compatible graph we can be.
Partitioning
The first stage of fallback will be the partitioning stage. Here using the settings provided by the user, the graph will be segmented into parts that are going to be in TensorRT and parts that will remain in PyTorch.
Core Data Structures
We need to maintain information about the structure of the graph after partitioning that will allow us to reconstruct the graph once conversion of all TensorRT subgraphs is done. My suggestion is a structure like the following:
The target defines if the section of the graph will remain in Torch or TensorRT. Input and output shape are either provided or calculated through the course of partitioning. Inputs are a list of
Value
pointers which define the inputs to this section of the graph, Similarly with the outputs. As we traverse the list of segmented blocks we should be able to now reconstruct the connections between blocks. Nodes are the list of nodes in the block, stored topologically. This in the case the blocks targets TensorRT will be fed directly into the conversion system otherwise this will be used to reconstruct the subgraph in the final synthesized program. Finally there is a field for the serialized TensorRT engine if applicable.The partitioning system will go through the graph building these blocks and storing them sorted topologically in an ordered list.
Tensor Shape Calculation
As TRTorch operates now, we are given the input shape for the model. As we traverse the graph we can maintain a running calculation of the current tensor shape. Now that we are not necessarily guaranteed to have input shapes for each subgraph requested to be compiled, we need to explicitly calculate tensor shapes as we traverse the graph so that when we go to compile a subgraph we have the input shape for the engine. This is going to be one of the hardest parts of this work since we do not have a fully exhaustive library to calculate Tensor shape. We can attempt to run the graph provided or the subgraphs themselves through JIT with dummy data during compilation to calculate these sizes as well. In the case of dynamic shape we would need to do this through the input range provided perhaps.
How to Partition
Fundamental Assumptions
Tensor
or List ofTensor
s and produces only aTensor
or List ofTensor
sTensor or [Tensor] -> Tensor or [Tensor]
. TensorRT cannot handle side effects.Steps
Verifying Op Support
torch::core::conversion::OpSupported
to help here and we also must respect theforced_fallback_operators
from the user.Preliminary Partitioning
ITensors
from the TensorRT perspective later.Calculate Dependencies
For every operator that we could potentially support we need to verify that all input data to the operator can be resolved at compile time given that some data may be run through libtorch hence only will be available at runtime.
ITensor
(i.e. is either the input to the graph or is an output of a previously identified torch block) or is or a graph constant. Otherwise this op cannot be supported. There is no case where if a op takes an input that is anything but a Tensor and it is not a constant where we can support it as of now due to the fundamental assumptionsEdge cases (Non exhaustive):
There is a supported operator that has two consumers of its output, one block is supported and the other is not.
min_block size
There is a supported operator that consumes an output from a supported block and an unsupported block
min_block_size
Prune TensorRT blocks according to number of nodes
Tensor Shape Calculation
Conversion
For conversion we iterate through the set of identified TensorRT blocks, for each of them running our current day conversion phase, but using the subgraph and input shape defined in the block struct producing a serialized TensorRT engine.
Program Synthesis
From this point we now have to reconstruct the program using the pieces from both LibTorch and TensorRT in a single new module. Using the list of blocks we can now reconstruct a copy of the program in a new module. The graph interface should be the same as the input graph. For each TensorRT Block, the subgraph will be the following series of operations:
Where
%1
is the engine itself,%3
aggregates inputs to the TensorRT engine (this should be all input values to the block) and%5
disaggregates the outputs (corresponding to the output values of the block). Once each block has been added, the module can be returned to the user.Beta Was this translation helpful? Give feedback.
All reactions