-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Conversation
@mxnet-label-bot add [pr-awaiting-review] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What do you think @HahTK? |
…to subgraph_compile
It sounds like the goal for passing data is to allow data to be compiled into the bin (example tensort bin) for that subgraph to avoid an init step. If the weights are in the bin, then we structuring this such that weights can not be changed with calling optimize_for again using new weights ? Is there any reason why we only do this for args and not auxs ? |
Using the weights in
Thanks for pointing this out. I'll add them too |
int idx = 0; | ||
// find the beginning of the output shape for the particular output index | ||
for (unsigned x=0; x < orig.index; x++) | ||
idx = shape.find("[", idx+1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check if idx
points to the right index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
int idx = 0; | ||
// find the beginning of the output dtype for the particular output index | ||
for (unsigned x=0; x < orig.index; x++) | ||
idx = dtype.find("[", idx+1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modify this for list of integers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
for (unsigned i=0; i < sym.outputs.size(); i++) { | ||
nnvm::Node* n = sym.outputs[i].node.get(); | ||
if (n->attrs.dict.count("__shape__") > 0) { | ||
std::string& shape = n->attrs.dict["__shape__"]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modify logic for the case when n
is a subgraph node and shape
a list of lists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@mseth10 thanks! lmk what you think of the latest commit |
All latest changes look good to me. Thanks @samskalicky for addressing multiple subgraph issues in this PR. |
@ptrendx any concerns on TensorRT or is this good to merge? |
ss << ","; | ||
} | ||
ss << "]"; | ||
n->attrs.dict["__shape__"] = ss.str(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use MX_STR_SHAPE instead of "__shape__"
ss << ","; | ||
} | ||
ss << "]"; | ||
n->attrs.dict["__dtype__"] = ss.str(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use MX_STR_DTYPE instead of "__dtype__"
@leezu, Reviewed offline with @ptrendx and @Caenorst. Decided that there are a few missing features needed for TensorRT to use this compile API. But decided to list them here #17236 and implement them in a subsequent PR (#17885). We can go ahead and merge this one as it is still useful for compiling for TVM. After this PR is merged i'll open another PR and start working on the missing features next. |
* 'master' of https://github.com/apache/incubator-mxnet: (192 commits) * impl - FFI for np einsum (apache#17869) [Numpy] FFI for diag/diagonal/diag_indices_from (apache#17789) [Numpy] Kron operator (apache#17323) cmake: Set DMLC_LOG_FATAL_THROW only for building mxnet and not for tvm (apache#17878) Add simplified HybridBlock.forward without F (apache#17530) Use FP32 copy of weights for norm (multitensor LAMB optimizer) (apache#17700) Use multi-tensor sumSQ in clip_global_norm (apache#17652) [Numpy] Add op fmax, fmin, fmod (apache#17567) Adding sparse support to MXTensor for custom operators (apache#17569) Update 3rdparty/mkldnn to v1.2.2 (apache#17313) Dynamic subgraph compile support (apache#17623) Refactor cpp-package CMakeLists.txt & add missing inference/imagenet_inference (apache#17835) staticbuild: Fix potential user-assisted execution of arbitrary code (apache#17860) * FFI for np.argmax and np.argmin (apache#17843) ffi for roll/rot90 (apache#17861) Skip test_multi_worker_dataloader_release_pool on OS X (apache#17797) add ffi for full_like, binary (apache#17811) HybridBlock.export() to return created filenames (apache#17758) Fix SoftReLU fused operator numerical stability (apache#17849) CI: Test clang10 cpu & gpu builds with -WError (apache#17830) ...
This PR adds support for passing the NDArrays from the existing optimize_for API down to the reviewSubgraph function in an external library. It also adds a new API for HybridBlock called optimize_for that can partition the model without running a forward pass. Feature changes Adds new API to HybridBlock optimize_for that partitions the model but does not call the cachedOp Modifies the subgraph library example to optionally require args to be provided Adds annotation on subgraph inputs for the name of the original param so that inputs can be mapped and passes annotations to input nodes of subgraphs Adds support for tensors in MKLDNN format, calls Reorder2Default New tests Adds a new test to partition operators that directly consume params add a new model to test where ops to be partitioned have args/params Bug Fixes fixes bug in passing ids vector by value instead of by reference fixes bug in passing copies of attributes instead of by reference fixes bug where _cached_graph was not updated after partitioning fixes memory leak where user-specified attributes on subgraph ops were not freed if subgraph was rejected fixes problem incorrectly indexing into shape/dtype maps when annotating the graph Docs Updates the README doc with the latest changes described above
This PR adds support for passing the NDArrays from the existing optimize_for API down to the reviewSubgraph function in an external library. It also adds a new API for HybridBlock called optimize_for that can partition the model without running a forward pass. Feature changes Adds new API to HybridBlock optimize_for that partitions the model but does not call the cachedOp Modifies the subgraph library example to optionally require args to be provided Adds annotation on subgraph inputs for the name of the original param so that inputs can be mapped and passes annotations to input nodes of subgraphs Adds support for tensors in MKLDNN format, calls Reorder2Default New tests Adds a new test to partition operators that directly consume params add a new model to test where ops to be partitioned have args/params Bug Fixes fixes bug in passing ids vector by value instead of by reference fixes bug in passing copies of attributes instead of by reference fixes bug where _cached_graph was not updated after partitioning fixes memory leak where user-specified attributes on subgraph ops were not freed if subgraph was rejected fixes problem incorrectly indexing into shape/dtype maps when annotating the graph Docs Updates the README doc with the latest changes described above
…18069) * Dynamic subgraph compile support (#17623) This PR adds support for passing the NDArrays from the existing optimize_for API down to the reviewSubgraph function in an external library. It also adds a new API for HybridBlock called optimize_for that can partition the model without running a forward pass. Feature changes Adds new API to HybridBlock optimize_for that partitions the model but does not call the cachedOp Modifies the subgraph library example to optionally require args to be provided Adds annotation on subgraph inputs for the name of the original param so that inputs can be mapped and passes annotations to input nodes of subgraphs Adds support for tensors in MKLDNN format, calls Reorder2Default New tests Adds a new test to partition operators that directly consume params add a new model to test where ops to be partitioned have args/params Bug Fixes fixes bug in passing ids vector by value instead of by reference fixes bug in passing copies of attributes instead of by reference fixes bug where _cached_graph was not updated after partitioning fixes memory leak where user-specified attributes on subgraph ops were not freed if subgraph was rejected fixes problem incorrectly indexing into shape/dtype maps when annotating the graph Docs Updates the README doc with the latest changes described above * Adding sparse support to MXTensor for custom operators (#17569) * Added enum for sparse storage * Add structure for Dense and Sparse * redesign the data structure for MXSparse * pull out aux data from sparse NDArray * Added more sparse arguments to API interface * Passed sparse from c_api to lib_api.h and set in MXTensor * Fix indent * fix segfault * Fix NDArray to MXTensor errors * Add a sample of sparse(CSR) transpose * Make CSR transpose temporarily work by hardcoding * Fixed sparse output size(Refined) * Add tests for symbolic and stateful ops * Added a sample for row sparse transpose * Added real row sparse transpose * Fix output size issue by adding lambda for CheckAndAlloc() * Fix mixed storage formats error * Added infer storage type function * resolve comments * Set inferSType as optional function * Resolve comments * Add error messages * Resolve comments * verify transpose ops results * fix sanity check * update MX_LIBRARY_VERSION to 5 * Custom Operator Random Number Generator Support (#17762) Add random number generator support for custom operator libraries. Design: We pass from MXNet the initialized and seeded states, located on CPU and GPU, to custom library. So user could use those seeds to generate deterministic values from a given seed passed to MXNet. Basically this workflow: mx.random.seed(128) r1 = mx.nd.some_custom_random_op(data) mx.random.seed(128) r2 = mx.nd.some_custom_random_op(data) assert (r1 == r2) This PR does not let custom library generate exactly the same sequence of random numbers comparing to MXNet This is a continuation of the custom operator project #15921 and #17270 Co-authored-by: guanxinq <[email protected]> Co-authored-by: Ziyi Mu <[email protected]>
* Dynamic subgraph compile support (#17623) This PR adds support for passing the NDArrays from the existing optimize_for API down to the reviewSubgraph function in an external library. It also adds a new API for HybridBlock called optimize_for that can partition the model without running a forward pass. Feature changes Adds new API to HybridBlock optimize_for that partitions the model but does not call the cachedOp Modifies the subgraph library example to optionally require args to be provided Adds annotation on subgraph inputs for the name of the original param so that inputs can be mapped and passes annotations to input nodes of subgraphs Adds support for tensors in MKLDNN format, calls Reorder2Default New tests Adds a new test to partition operators that directly consume params add a new model to test where ops to be partitioned have args/params Bug Fixes fixes bug in passing ids vector by value instead of by reference fixes bug in passing copies of attributes instead of by reference fixes bug where _cached_graph was not updated after partitioning fixes memory leak where user-specified attributes on subgraph ops were not freed if subgraph was rejected fixes problem incorrectly indexing into shape/dtype maps when annotating the graph Docs Updates the README doc with the latest changes described above * Adding sparse support to MXTensor for custom operators (#17569) * Added enum for sparse storage * Add structure for Dense and Sparse * redesign the data structure for MXSparse * pull out aux data from sparse NDArray * Added more sparse arguments to API interface * Passed sparse from c_api to lib_api.h and set in MXTensor * Fix indent * fix segfault * Fix NDArray to MXTensor errors * Add a sample of sparse(CSR) transpose * Make CSR transpose temporarily work by hardcoding * Fixed sparse output size(Refined) * Add tests for symbolic and stateful ops * Added a sample for row sparse transpose * Added real row sparse transpose * Fix output size issue by adding lambda for CheckAndAlloc() * Fix mixed storage formats error * Added infer storage type function * resolve comments * Set inferSType as optional function * Resolve comments * Add error messages * Resolve comments * verify transpose ops results * fix sanity check * update MX_LIBRARY_VERSION to 5 * Custom Operator Random Number Generator Support (#17762) Add random number generator support for custom operator libraries. Design: We pass from MXNet the initialized and seeded states, located on CPU and GPU, to custom library. So user could use those seeds to generate deterministic values from a given seed passed to MXNet. Basically this workflow: mx.random.seed(128) r1 = mx.nd.some_custom_random_op(data) mx.random.seed(128) r2 = mx.nd.some_custom_random_op(data) assert (r1 == r2) This PR does not let custom library generate exactly the same sequence of random numbers comparing to MXNet This is a continuation of the custom operator project #15921 and #17270 Co-authored-by: guanxinq <[email protected]> Co-authored-by: Ziyi Mu <[email protected]>
This PR adds support for passing the NDArrays from the existing optimize_for API down to the reviewSubgraph function in an external library. It also adds a new API for HybridBlock called optimize_for that can partition the model without running a forward pass. Feature changes Adds new API to HybridBlock optimize_for that partitions the model but does not call the cachedOp Modifies the subgraph library example to optionally require args to be provided Adds annotation on subgraph inputs for the name of the original param so that inputs can be mapped and passes annotations to input nodes of subgraphs Adds support for tensors in MKLDNN format, calls Reorder2Default New tests Adds a new test to partition operators that directly consume params add a new model to test where ops to be partitioned have args/params Bug Fixes fixes bug in passing ids vector by value instead of by reference fixes bug in passing copies of attributes instead of by reference fixes bug where _cached_graph was not updated after partitioning fixes memory leak where user-specified attributes on subgraph ops were not freed if subgraph was rejected fixes problem incorrectly indexing into shape/dtype maps when annotating the graph Docs Updates the README doc with the latest changes described above
Description
This PR adds support for passing the NDArrays from the existing
optimize_for
API down to thereviewSubgraph
function in an external library. It also adds a new API for HybridBlock calledoptimize_for
that can partition the model without running a forward pass.Feature changes
optimize_for
that partitions the model but does not call the cachedOpReorder2Default
New tests
Bug Fixes
ids
vector by value instead of by reference_cached_graph
was not updated after partitioningDocs
Design - Passing NDArrays
In #15886 the
optimize_for
API was added to the Symbol class to give users an easy API to use to partition their models. Theargs
argument took the params to the model to use for shape/type inference. But the NDArray values were never used. In this PR, we pass the NDArray data values to the backend library and also add the ability to pass auxiliary params too. Now, theoptimize_for
API looks like:On the backend library side, the
reviewSubgraph
API has an additional arguments forargs
andaux
that are maps of named MXTensors.These additional maps of args and aux will allow backends to compile subgraphs and use param/weight values during compilation. For example, this will enable TensorRT to be implemented as a backend library and eliminate the
init_tensorrt_params
API that is needed to provide the params to the TensorRT backend. It will also enable compiling subgraphs with TVM and other compile-centric backends.This PR also tags subgraph inputs with a new attribute "argName" if the input to the subgraph comes directly from a model param/weight. This will be used when backend libraries lookup params for subgraph inputs since the names of subgraph inputs are modified from the original model param name. The "argName" can be looked up in the new
args
oraux
arguments passed to thereviewSubgraph
API to get the actual tensor values. Heres an example partitioned graph, notice that the original input param is nameda
and the input to the subgraph is nameda0
. You can see the "argName" attribute is set on the input refer to original namea
.Design - Partitioning HybridBlock without infer
In #15969 the
hybridize
API was modified to accept the backend name and partition the model during the hybridize flow. This flow is for users who intend to run inference immediately after partitioning. But for users that want to partition but not run inference, they are out of luck. When compiling the model, a user will compile on a machine suitable for compilation. And that compilation may take tens of minutes depending on the model and optimization strategy. Then they will export their model, and copy it to the machine they intent to run inference on.In this flow, the machine for running compilation may not be suitable for running a complete forward pass. In this PR we add an
optimize_for
API to the HybridBlock class that runs most of what is part of the hybridize flow including the forward pass. But it does not actually call the cachedOp.The new HybridBlock
optimize_for
API combines the argument lists offorward
andhybridize
like:Notice that the first two args are the same as in the
forward
API and the last 3 args are the same as in thehybridize
API. Theactive
argument is dropped since it will default to true in this function in order to partition the model.The expectation is that users will do the following:
This is equivalent to:
But does not actually execute the forward pass. Users can still use the partitioned block to run forward passes just like with hybridize, for example:
Calling
optimize_for
is equivalent tohybridize
in this usage, and since the cachedOp is already created beforeblock(x)
, it is not recreated during the forward pass.Checklist
Essentials
Please feel free to remove inapplicable items for your PR.