Dynamic subgraph compile support #17623

samskalicky · 2020-02-19T01:37:33Z

Description

This PR adds support for passing the NDArrays from the existing optimize_for API down to the reviewSubgraph function in an external library. It also adds a new API for HybridBlock called optimize_for that can partition the model without running a forward pass.

Feature changes

Adds new API to HybridBlock optimize_for that partitions the model but does not call the cachedOp
Modifies the subgraph library example to optionally require args to be provided
Adds annotation on subgraph inputs for the name of the original param so that inputs can be mapped and passes annotations to input nodes of subgraphs
Adds support for tensors in MKLDNN format, calls Reorder2Default

New tests

Adds a new test to partition operators that directly consume params
add a new model to test where ops to be partitioned have args/params

Bug Fixes

fixes bug in passing ids vector by value instead of by reference
fixes bug in passing copies of attributes instead of by reference
fixes bug where _cached_graph was not updated after partitioning
fixes memory leak where user-specified attributes on subgraph ops were not freed if subgraph was rejected
fixes problem incorrectly indexing into shape/dtype maps when annotating the graph

Docs

Updates the README doc with the latest changes described above

Design - Passing NDArrays

In #15886 the optimize_for API was added to the Symbol class to give users an easy API to use to partition their models. The args argument took the params to the model to use for shape/type inference. But the NDArray values were never used. In this PR, we pass the NDArray data values to the backend library and also add the ability to pass auxiliary params too. Now, the optimize_for API looks like:

sym = sym.optimize_for('default', args, aux, ctx=mx.cpu())

On the backend library side, the reviewSubgraph API has an additional arguments for args and aux that are maps of named MXTensors.

MXReturnValue reviewSubgraph(std::string json, int subraph_id, bool* accept,
                             std::unordered_map<std::string, std::string>& options,
                             std::unordered_map<std::string, std::string>& attrs) {
                             std::unordered_map<std::string, std::string>& attrs,
                             std::map<std::string, MXTensor>& args,
                             std::map<std::string, MXTensor>& aux);

These additional maps of args and aux will allow backends to compile subgraphs and use param/weight values during compilation. For example, this will enable TensorRT to be implemented as a backend library and eliminate the init_tensorrt_params API that is needed to provide the params to the TensorRT backend. It will also enable compiling subgraphs with TVM and other compile-centric backends.

This PR also tags subgraph inputs with a new attribute "argName" if the input to the subgraph comes directly from a model param/weight. This will be used when backend libraries lookup params for subgraph inputs since the names of subgraph inputs are modified from the original model param name. The "argName" can be looked up in the new args or aux arguments passed to the reviewSubgraph API to get the actual tensor values. Heres an example partitioned graph, notice that the original input param is named a and the input to the subgraph is named a0. You can see the "argName" attribute is set on the input refer to original name a.

{
  "nodes": [
    {
      "op": "null", 
      "name": "a", 
      "attrs": {
        "dtype": "0", 
        "shape": "[3,2]"
      }, 
      "inputs": []
    }, 
    {
      "op": "_custom_subgraph_op", 
      "name": "_op0", 
      "attrs": {"myKey": "myVal"}, 
      "inputs": [[0, 0, 0]], 
      "subgraphs": [
        {
          "nodes": [
            {
              "op": "null", 
              "name": "a0", 
              "attrs": {
                "argName": "a", 
                "isArg": "True", 
                "isAux": "False"
              }, 
              "inputs": []
            }, 

            ...

        }
      ]
    }
  ], 
}

Design - Partitioning HybridBlock without infer

In #15969 the hybridize API was modified to accept the backend name and partition the model during the hybridize flow. This flow is for users who intend to run inference immediately after partitioning. But for users that want to partition but not run inference, they are out of luck. When compiling the model, a user will compile on a machine suitable for compilation. And that compilation may take tens of minutes depending on the model and optimization strategy. Then they will export their model, and copy it to the machine they intent to run inference on.

In this flow, the machine for running compilation may not be suitable for running a complete forward pass. In this PR we add an optimize_for API to the HybridBlock class that runs most of what is part of the hybridize flow including the forward pass. But it does not actually call the cachedOp.

The new HybridBlock optimize_for API combines the argument lists of forward and hybridize like:

def optimize_for(self, x, *args, backend=None, backend_opts=None, **kwargs)

Notice that the first two args are the same as in the forward API and the last 3 args are the same as in the hybridize API. The active argument is dropped since it will default to true in this function in order to partition the model.

The expectation is that users will do the following:

block.optimize_for(x, backend='myProp')
block.export('partitioned')

This is equivalent to:

block.hybridize(backend='myProp')
block(x)
block.export('partitioned')

But does not actually execute the forward pass. Users can still use the partitioned block to run forward passes just like with hybridize, for example:

block.optimize_for(x, backend='myProp')
block(x)

Calling optimize_for is equivalent to hybridize in this usage, and since the cachedOp is already created before block(x), it is not recreated during the forward pass.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

samskalicky · 2020-02-19T01:54:59Z

If #17585 gets merged first, then ill update the doc with the new compile info. If this one gets merged first ill got modify #17585 to add the description.

samskalicky · 2020-02-19T08:27:30Z

@mxnet-label-bot add [pr-awaiting-review]

src/c_api/c_api.cc

src/operator/subgraph/build_subgraph.cc

mseth10

LGTM

samskalicky · 2020-02-20T04:53:32Z

What do you think @HahTK?

…to subgraph_compile

HahTK · 2020-02-21T21:06:06Z

It sounds like the goal for passing data is to allow data to be compiled into the bin (example tensort bin) for that subgraph to avoid an init step.

If the weights are in the bin, then we structuring this such that weights can not be changed with calling optimize_for again using new weights ?

Is there any reason why we only do this for args and not auxs ?

samskalicky · 2020-02-21T22:09:16Z

It sounds like the goal for passing data is to allow data to be compiled into the bin (example tensort bin) for that subgraph to avoid an init step.

If the weights are in the bin, then we structuring this such that weights can not be changed with calling optimize_for again using new weights ?

Using the weights in optimize_for is a one-way flow. You cannot call optimize_for again with new weights. You would need to call it on the original graph with new weights. The assumption is that weights would only be used for inference. So presumably the model is already trained, and the weights are frozen.

Is there any reason why we only do this for args and not auxs ?

Thanks for pointing this out. I'll add them too

src/operator/subgraph/partitioner/custom_subgraph_property.h

mseth10 · 2020-03-17T05:16:45Z

src/operator/subgraph/partitioner/custom_subgraph_property.h

+        int idx = 0;
+        // find the beginning of the output shape for the particular output index
+        for (unsigned x=0; x < orig.index; x++)
+          idx = shape.find("[", idx+1);


check if idx points to the right index

mseth10 · 2020-03-17T05:22:28Z

src/operator/subgraph/partitioner/custom_subgraph_property.h

+        int idx = 0;
+        // find the beginning of the output dtype for the particular output index
+        for (unsigned x=0; x < orig.index; x++)
+          idx = dtype.find("[", idx+1);


modify this for list of integers

mseth10 · 2020-03-17T05:28:14Z

src/operator/subgraph/partitioner/custom_subgraph_property.h

+        for (unsigned i=0; i < sym.outputs.size(); i++) {
+          nnvm::Node* n = sym.outputs[i].node.get();
+          if (n->attrs.dict.count("__shape__") > 0) {
+            std::string& shape = n->attrs.dict["__shape__"];


modify logic for the case when n is a subgraph node and shape a list of lists

samskalicky · 2020-03-17T07:31:30Z

@mseth10 thanks! lmk what you think of the latest commit

mseth10 · 2020-03-17T20:56:19Z

@mseth10 thanks! lmk what you think of the latest commit

All latest changes look good to me. Thanks @samskalicky for addressing multiple subgraph issues in this PR.

leezu · 2020-03-17T21:24:07Z

@ptrendx any concerns on TensorRT or is this good to merge?

samskalicky · 2020-03-18T16:56:49Z

src/operator/subgraph/partitioner/custom_subgraph_property.h

+            ss << ",";
+        }
+        ss << "]";
+        n->attrs.dict["__shape__"] = ss.str();


nit: use MX_STR_SHAPE instead of "__shape__"

samskalicky · 2020-03-18T16:57:04Z

src/operator/subgraph/partitioner/custom_subgraph_property.h

+            ss << ",";
+        }
+        ss << "]";
+        n->attrs.dict["__dtype__"] = ss.str();


nit: use MX_STR_DTYPE instead of "__dtype__"

samskalicky · 2020-03-18T18:18:48Z

@leezu, Reviewed offline with @ptrendx and @Caenorst. Decided that there are a few missing features needed for TensorRT to use this compile API. But decided to list them here #17236 and implement them in a subsequent PR (#17885). We can go ahead and merge this one as it is still useful for compiling for TVM.

After this PR is merged i'll open another PR and start working on the missing features next.

* 'master' of https://github.com/apache/incubator-mxnet: (192 commits) * impl - FFI for np einsum (apache#17869) [Numpy] FFI for diag/diagonal/diag_indices_from (apache#17789) [Numpy] Kron operator (apache#17323) cmake: Set DMLC_LOG_FATAL_THROW only for building mxnet and not for tvm (apache#17878) Add simplified HybridBlock.forward without F (apache#17530) Use FP32 copy of weights for norm (multitensor LAMB optimizer) (apache#17700) Use multi-tensor sumSQ in clip_global_norm (apache#17652) [Numpy] Add op fmax, fmin, fmod (apache#17567) Adding sparse support to MXTensor for custom operators (apache#17569) Update 3rdparty/mkldnn to v1.2.2 (apache#17313) Dynamic subgraph compile support (apache#17623) Refactor cpp-package CMakeLists.txt & add missing inference/imagenet_inference (apache#17835) staticbuild: Fix potential user-assisted execution of arbitrary code (apache#17860) * FFI for np.argmax and np.argmin (apache#17843) ffi for roll/rot90 (apache#17861) Skip test_multi_worker_dataloader_release_pool on OS X (apache#17797) add ffi for full_like, binary (apache#17811) HybridBlock.export() to return created filenames (apache#17758) Fix SoftReLU fused operator numerical stability (apache#17849) CI: Test clang10 cpu & gpu builds with -WError (apache#17830) ...

This PR adds support for passing the NDArrays from the existing optimize_for API down to the reviewSubgraph function in an external library. It also adds a new API for HybridBlock called optimize_for that can partition the model without running a forward pass. Feature changes Adds new API to HybridBlock optimize_for that partitions the model but does not call the cachedOp Modifies the subgraph library example to optionally require args to be provided Adds annotation on subgraph inputs for the name of the original param so that inputs can be mapped and passes annotations to input nodes of subgraphs Adds support for tensors in MKLDNN format, calls Reorder2Default New tests Adds a new test to partition operators that directly consume params add a new model to test where ops to be partitioned have args/params Bug Fixes fixes bug in passing ids vector by value instead of by reference fixes bug in passing copies of attributes instead of by reference fixes bug where _cached_graph was not updated after partitioning fixes memory leak where user-specified attributes on subgraph ops were not freed if subgraph was rejected fixes problem incorrectly indexing into shape/dtype maps when annotating the graph Docs Updates the README doc with the latest changes described above

…18069) * Dynamic subgraph compile support (#17623) This PR adds support for passing the NDArrays from the existing optimize_for API down to the reviewSubgraph function in an external library. It also adds a new API for HybridBlock called optimize_for that can partition the model without running a forward pass. Feature changes Adds new API to HybridBlock optimize_for that partitions the model but does not call the cachedOp Modifies the subgraph library example to optionally require args to be provided Adds annotation on subgraph inputs for the name of the original param so that inputs can be mapped and passes annotations to input nodes of subgraphs Adds support for tensors in MKLDNN format, calls Reorder2Default New tests Adds a new test to partition operators that directly consume params add a new model to test where ops to be partitioned have args/params Bug Fixes fixes bug in passing ids vector by value instead of by reference fixes bug in passing copies of attributes instead of by reference fixes bug where _cached_graph was not updated after partitioning fixes memory leak where user-specified attributes on subgraph ops were not freed if subgraph was rejected fixes problem incorrectly indexing into shape/dtype maps when annotating the graph Docs Updates the README doc with the latest changes described above * Adding sparse support to MXTensor for custom operators (#17569) * Added enum for sparse storage * Add structure for Dense and Sparse * redesign the data structure for MXSparse * pull out aux data from sparse NDArray * Added more sparse arguments to API interface * Passed sparse from c_api to lib_api.h and set in MXTensor * Fix indent * fix segfault * Fix NDArray to MXTensor errors * Add a sample of sparse(CSR) transpose * Make CSR transpose temporarily work by hardcoding * Fixed sparse output size(Refined) * Add tests for symbolic and stateful ops * Added a sample for row sparse transpose * Added real row sparse transpose * Fix output size issue by adding lambda for CheckAndAlloc() * Fix mixed storage formats error * Added infer storage type function * resolve comments * Set inferSType as optional function * Resolve comments * Add error messages * Resolve comments * verify transpose ops results * fix sanity check * update MX_LIBRARY_VERSION to 5 * Custom Operator Random Number Generator Support (#17762) Add random number generator support for custom operator libraries. Design: We pass from MXNet the initialized and seeded states, located on CPU and GPU, to custom library. So user could use those seeds to generate deterministic values from a given seed passed to MXNet. Basically this workflow: mx.random.seed(128) r1 = mx.nd.some_custom_random_op(data) mx.random.seed(128) r2 = mx.nd.some_custom_random_op(data) assert (r1 == r2) This PR does not let custom library generate exactly the same sequence of random numbers comparing to MXNet This is a continuation of the custom operator project #15921 and #17270 Co-authored-by: guanxinq <[email protected]> Co-authored-by: Ziyi Mu <[email protected]>

* Dynamic subgraph compile support (#17623) This PR adds support for passing the NDArrays from the existing optimize_for API down to the reviewSubgraph function in an external library. It also adds a new API for HybridBlock called optimize_for that can partition the model without running a forward pass. Feature changes Adds new API to HybridBlock optimize_for that partitions the model but does not call the cachedOp Modifies the subgraph library example to optionally require args to be provided Adds annotation on subgraph inputs for the name of the original param so that inputs can be mapped and passes annotations to input nodes of subgraphs Adds support for tensors in MKLDNN format, calls Reorder2Default New tests Adds a new test to partition operators that directly consume params add a new model to test where ops to be partitioned have args/params Bug Fixes fixes bug in passing ids vector by value instead of by reference fixes bug in passing copies of attributes instead of by reference fixes bug where _cached_graph was not updated after partitioning fixes memory leak where user-specified attributes on subgraph ops were not freed if subgraph was rejected fixes problem incorrectly indexing into shape/dtype maps when annotating the graph Docs Updates the README doc with the latest changes described above * Adding sparse support to MXTensor for custom operators (#17569) * Added enum for sparse storage * Add structure for Dense and Sparse * redesign the data structure for MXSparse * pull out aux data from sparse NDArray * Added more sparse arguments to API interface * Passed sparse from c_api to lib_api.h and set in MXTensor * Fix indent * fix segfault * Fix NDArray to MXTensor errors * Add a sample of sparse(CSR) transpose * Make CSR transpose temporarily work by hardcoding * Fixed sparse output size(Refined) * Add tests for symbolic and stateful ops * Added a sample for row sparse transpose * Added real row sparse transpose * Fix output size issue by adding lambda for CheckAndAlloc() * Fix mixed storage formats error * Added infer storage type function * resolve comments * Set inferSType as optional function * Resolve comments * Add error messages * Resolve comments * verify transpose ops results * fix sanity check * update MX_LIBRARY_VERSION to 5 * Custom Operator Random Number Generator Support (#17762) Add random number generator support for custom operator libraries. Design: We pass from MXNet the initialized and seeded states, located on CPU and GPU, to custom library. So user could use those seeds to generate deterministic values from a given seed passed to MXNet. Basically this workflow: mx.random.seed(128) r1 = mx.nd.some_custom_random_op(data) mx.random.seed(128) r2 = mx.nd.some_custom_random_op(data) assert (r1 == r2) This PR does not let custom library generate exactly the same sequence of random numbers comparing to MXNet This is a continuation of the custom operator project #15921 and #17270 Co-authored-by: guanxinq <[email protected]> Co-authored-by: Ziyi Mu <[email protected]>

This PR adds support for passing the NDArrays from the existing optimize_for API down to the reviewSubgraph function in an external library. It also adds a new API for HybridBlock called optimize_for that can partition the model without running a forward pass. Feature changes Adds new API to HybridBlock optimize_for that partitions the model but does not call the cachedOp Modifies the subgraph library example to optionally require args to be provided Adds annotation on subgraph inputs for the name of the original param so that inputs can be mapped and passes annotations to input nodes of subgraphs Adds support for tensors in MKLDNN format, calls Reorder2Default New tests Adds a new test to partition operators that directly consume params add a new model to test where ops to be partitioned have args/params Bug Fixes fixes bug in passing ids vector by value instead of by reference fixes bug in passing copies of attributes instead of by reference fixes bug where _cached_graph was not updated after partitioning fixes memory leak where user-specified attributes on subgraph ops were not freed if subgraph was rejected fixes problem incorrectly indexing into shape/dtype maps when annotating the graph Docs Updates the README doc with the latest changes described above

samskalicky added 2 commits February 17, 2020 08:07

passed args down to acceptSubgraph

639db17

added example and set param names on inputs to subgraph to map

5898d53

samskalicky requested review from aaronmarkham, anirudh2290, eric-haibin-lin and szha as code owners February 19, 2020 01:37

increased lib_api version number

2294584

samskalicky mentioned this pull request Feb 19, 2020

[WIP] passing ndarrays to acceptSubgraph API #17564

Closed

7 tasks

fixed whitespace

fad8e74

lanking520 added the pr-awaiting-review PR is waiting for code review label Feb 19, 2020

mseth10 reviewed Feb 19, 2020

View reviewed changes

src/c_api/c_api.cc Outdated Show resolved Hide resolved

mseth10 reviewed Feb 19, 2020

View reviewed changes

src/operator/subgraph/build_subgraph.cc Outdated Show resolved Hide resolved

fixed spacing

734f1c4

mseth10 approved these changes Feb 19, 2020

View reviewed changes

samskalicky mentioned this pull request Feb 19, 2020

Dynamic subgraph property doc #17585

Merged

4 tasks

samskalicky added 9 commits February 20, 2020 22:19

Merge branch 'master' of https://github.com/apache/incubator-mxnet in…

934ae8f

…to subgraph_compile

added info about lib_api.h to README

ceed9be

updated readme for new args argument to reviewSubgraph

098db85

added more tests

cfcc0a6

added example for partitioning HybridBlock in-place without forward pass

1fa7f1d

added example for partitioning

8f37c48

fixed whitespace

729173f

fixed sanity

bb90d70

fixed lint

06c3841

added support for passing aux

f8f6191

samskalicky added 4 commits March 14, 2020 05:49

fixed whitespace

c1d3f5e

added subgraph property API to let subg_prop initialize subgraph inputs

0b38e5c

moved custom code to subgraph property API, cleaned up build_subgraph.cc

d59a4dc

added support for ops with multiple outputs and InitSubgraphInputs

5abc8c3

mseth10 reviewed Mar 16, 2020

View reviewed changes

src/operator/subgraph/partitioner/custom_subgraph_property.h Outdated Show resolved Hide resolved

samskalicky added 2 commits March 16, 2020 22:30

fixed sanity, removed prints

90f6973

fixed whitespace

28b6bef

mseth10 reviewed Mar 17, 2020

View reviewed changes

fixed shape/dtype parsing

516d149

fixed lint

4e2efec

leezu requested a review from ptrendx March 17, 2020 21:24

samskalicky commented Mar 18, 2020

View reviewed changes

leezu merged commit f7c4323 into apache:master Mar 19, 2020

This was referenced Apr 15, 2020

[1.7] Dynamic subgraph compile support (#17623) #18062

Closed

[1.7] MXNet Extension PRs (#17623, #17569, #17762) #18063

Merged

[1.7] Backport MXNet Extension PRs (#17623, #17569, #17762) #18063 #18069

Merged

Kh4L mentioned this pull request Jun 5, 2020

MXNet-TRT: Add PrePartition param caching - move init_tensorrt_params logic #18490

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic subgraph compile support #17623

Dynamic subgraph compile support #17623

samskalicky commented Feb 19, 2020 •

edited

Loading

samskalicky commented Feb 19, 2020

samskalicky commented Feb 19, 2020

mseth10 left a comment

samskalicky commented Feb 20, 2020

HahTK commented Feb 21, 2020

samskalicky commented Feb 21, 2020

mseth10 Mar 17, 2020

samskalicky Mar 17, 2020

mseth10 Mar 17, 2020

samskalicky Mar 17, 2020

mseth10 Mar 17, 2020

samskalicky Mar 17, 2020

samskalicky commented Mar 17, 2020

mseth10 commented Mar 17, 2020

leezu commented Mar 17, 2020

samskalicky Mar 18, 2020 •

edited

Loading

samskalicky Mar 18, 2020 •

edited

Loading

samskalicky commented Mar 18, 2020 •

edited

Loading

Dynamic subgraph compile support #17623

Dynamic subgraph compile support #17623

Conversation

samskalicky commented Feb 19, 2020 • edited Loading

Description

Feature changes

New tests

Bug Fixes

Docs

Design - Passing NDArrays

Design - Partitioning HybridBlock without infer

Checklist

Essentials

samskalicky commented Feb 19, 2020

samskalicky commented Feb 19, 2020

mseth10 left a comment

Choose a reason for hiding this comment

samskalicky commented Feb 20, 2020

HahTK commented Feb 21, 2020

samskalicky commented Feb 21, 2020

mseth10 Mar 17, 2020

Choose a reason for hiding this comment

samskalicky Mar 17, 2020

Choose a reason for hiding this comment

mseth10 Mar 17, 2020

Choose a reason for hiding this comment

samskalicky Mar 17, 2020

Choose a reason for hiding this comment

mseth10 Mar 17, 2020

Choose a reason for hiding this comment

samskalicky Mar 17, 2020

Choose a reason for hiding this comment

samskalicky commented Mar 17, 2020

mseth10 commented Mar 17, 2020

leezu commented Mar 17, 2020

samskalicky Mar 18, 2020 • edited Loading

Choose a reason for hiding this comment

samskalicky Mar 18, 2020 • edited Loading

Choose a reason for hiding this comment

samskalicky commented Mar 18, 2020 • edited Loading

samskalicky commented Feb 19, 2020 •

edited

Loading

samskalicky Mar 18, 2020 •

edited

Loading

samskalicky Mar 18, 2020 •

edited

Loading

samskalicky commented Mar 18, 2020 •

edited

Loading