Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Runtime] Allow parameter sharing between modules #3489

Merged
merged 1 commit into from
Sep 3, 2019
Merged

[Runtime] Allow parameter sharing between modules #3489

merged 1 commit into from
Sep 3, 2019

Conversation

yongsun
Copy link
Contributor

@yongsun yongsun commented Jul 3, 2019

As GraphRuntime does not provide control-flow logics, we have to split
our model to two parts. While we need to share parameters between them
to save memory usage.

Solution:

  1. add "lazy_init_input" in graph's attributes
   "attrs": {
     ... ...
     "lazy_init_input": [
       "list_str",
       [
         "p0"
       ]
     ]
    }
  1. allow un-allocated NDArray entry in SetupStorage
  2. utilize "set_input_zero_copy" function to set parameters

Thanks for contributing to TVM! Please refer to guideline https://docs.tvm.ai/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers.

@yongsun
Copy link
Contributor Author

yongsun commented Jul 10, 2019

@tqchen @icemelon9 Could you please help to review this change? Thanks!

@tqchen
Copy link
Member

tqchen commented Jul 10, 2019

Thanks @yongsun given this changes the spec and api, it would be helpful to hold a small RFC discussion. Sorry for the delayed action on this PR. Can you also open an RFC thread to discuss the
changes to the graph and api?

cc @ajtulloch @yinghai @hlu1 @yzhliu @srkreddy1238

@tqchen tqchen added the status: need RFC need RFC discussion label Jul 10, 2019
@yongsun
Copy link
Contributor Author

yongsun commented Jul 10, 2019

@tqchen Thanks so much for your replying, I just created a RFC thread here, https://discuss.tvm.ai/t/new-graphruntime-api-to-share-parameter-between-modules/3284

Thanks @yongsun given this changes the spec and api, it would be helpful to hold a small RFC discussion. Sorry for the delayed action on this PR. Can you also open an RFC thread to discuss the
changes to the graph and api?

cc @ajtulloch @yinghai @hlu1 @yzhliu @srkreddy1238

Copy link
Contributor

@yinghai yinghai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be built on top of #3416. You can setup data_entry_'s for the input as before, continue to setup the tvm ops. And then free up the storage in data_entry_[i] with NDArray() for i in inputs. Then you can probably just use set_input_zero_copy as usually.

@yongsun
Copy link
Contributor Author

yongsun commented Jul 19, 2019

I think this can be built on top of #3416. You can setup data_entry_'s for the input as before, continue to setup the tvm ops. And then free up the storage in data_entry_[i] with NDArray() for i in inputs. Then you can probably just use set_input_zero_copy as usually.

When deploying the model on an embedded device, we are trying to avoid the unnecessary memory allocation in the first place. Allocate additional 7~8MB is not affordable for our case, even it can be freed later.

@yongsun
Copy link
Contributor Author

yongsun commented Jul 19, 2019

@tqchen How could I access the CI logs? Thanks!

@tqchen
Copy link
Member

tqchen commented Jul 19, 2019

You can clock on the details link to get access the CI logs

@yongsun
Copy link
Contributor Author

yongsun commented Jul 19, 2019

You can clock on the details link to get access the CI logs

It turns out that the CI site is not accessible from our corp net :( While I can access from a public network.

@yinghai
Copy link
Contributor

yinghai commented Jul 19, 2019

When deploying the model on an embedded device, we are trying to avoid the unnecessary memory allocation in the first place. Allocate additional 7~8MB is not affordable for our case, even it can be freed later.

But you have to allocate those inputs outside anyway? You peak memory shouldn't change, as you free them during initialization of the runtime. And normally input won't comprise 7-8MB. Have you tried and tested this for your case?

@yongsun
Copy link
Contributor Author

yongsun commented Jul 19, 2019

When deploying the model on an embedded device, we are trying to avoid the unnecessary memory allocation in the first place. Allocate additional 7~8MB is not affordable for our case, even it can be freed later.

But you have to allocate those inputs outside anyway? You peak memory shouldn't change, as you free them during initialization of the runtime. And normally input won't comprise 7-8MB. Have you tried and tested this for your case?

In our use case, the RNN model was split to two parts, the output layer in part2 will share the weights of embedding layer in part1. We will not allocate memory for output layer, but directly refer the weight matrix to the NDArray from embedding layer. The output weight matrix size for a RNN language model is vocab_size * embedding_demin, even for a small/compat RNN model, 7~8MB is very common.

@yinghai
Copy link
Contributor

yinghai commented Jul 19, 2019

OK. I think you can have data_entry_ as NDArray() and setup tvm ops with correct shape info (which is available). Then use set_input_zero_copy to set input.

@yongsun
Copy link
Contributor Author

yongsun commented Jul 19, 2019

OK. I think you can have data_entry_ as NDArray() and setup tvm ops with correct shape info (which is available). Then use set_input_zero_copy to set input.

Thanks for your suggestion, it makes lot sense to me. I'll update the PR later.

@yongsun
Copy link
Contributor Author

yongsun commented Aug 12, 2019

@yinghai sorry for the late update, just got some time on this, please help to review. Thanks!

@yongsun
Copy link
Contributor Author

yongsun commented Aug 13, 2019

@yinghai I'm sorry that the revised pull request seems overwrote your review comments. Could you add you comments again? Thanks! BTW, the CI failed on doc-gpu task, should not be impacted by this change.

Copy link
Contributor

@yinghai yinghai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the lazy init part and it makes sense to me. But where is the parameter sharing part?

@yongsun
Copy link
Contributor Author

yongsun commented Aug 14, 2019

I get the lazy init part and it makes sense to me. But where is the parameter sharing part?

The sharing will be done by using set_input_zero_copy, something like this,

decoder.set_input_zero_copy("p0", encoder.get_input("p0"))

@yinghai
Copy link
Contributor

yinghai commented Aug 14, 2019

I see. LGTM then. Could you add some unittest? You can take https://github.com/dmlc/tvm/blob/master/tests/cpp/relay_build_module_test.cc as an example.

@yongsun
Copy link
Contributor Author

yongsun commented Aug 14, 2019

Sure, I'll need to learn how to build a relay module in C++ api, and bind it to a json graph with lazy_init_input attributes.

@yongsun
Copy link
Contributor Author

yongsun commented Aug 14, 2019

@yinghai unittest case added and passed, please help to review. Thanks!

@yinghai
Copy link
Contributor

yinghai commented Aug 15, 2019

@tqchen it's looking good to me. Could you do a pass?

@yongsun
Copy link
Contributor Author

yongsun commented Aug 23, 2019

@tqchen could you please hep to review? Thanks!

@yongsun
Copy link
Contributor Author

yongsun commented Aug 25, 2019

@yinghai is possible you may give the approval? Thanks!

Copy link
Member

@icemelon icemelon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check whether all NDArrays have been allocated before run in the GraphRuntime?

std::vector<uint32_t> lazy_init_entries;
for (auto const& name : attrs_.lazy_init_input) {
int in_idx = GetInputIndex(name);
CHECK_GE(in_idx, 0) << "input \"" << name << "\"does not exist!";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
CHECK_GE(in_idx, 0) << "input \"" << name << "\"does not exist!";
CHECK_GE(in_idx, 0) << "input \"" << name << "\" does not exist!";

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix in next rev.

Copy link
Contributor Author

@yongsun yongsun Aug 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of checking if all NDArrays have been allocated before "run", since the set_input_zero_copy will directly modify the DLTensor's data in op_args_, the original unallocated NDArray instance in data_entry_ will remain unallocated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set_input_zero_copy doesn't mean that NDArray data ptr is nullptr. You need to change the set_input_zero_copy to set the allocate to be true in the function.
I do think this check is necessary. Otherwise it will cause seg fault when you run the graph runtime and some NDArray is not allocated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do following check, please help to review.

 void GraphRuntime::Run() {
   // setup the array and requirements.
   for (size_t i = 0; i < op_execs_.size(); ++i) {
-    if (op_execs_[i]) op_execs_[i]();
+    if (op_execs_[i]) {
+      auto& op_arg = op_args_[i];
+      if (op_arg) {
+        for (auto& arg : op_arg->args) {
+          CHECK(arg.data != nullptr) << "Un-initialized input!";
+        }
+      }
+      op_execs_[i]();
+    }
   }
 }

src/runtime/graph/graph_runtime.h Show resolved Hide resolved
src/runtime/graph/graph_runtime.h Show resolved Hide resolved
reader->BeginArray();
CHECK(reader->NextArrayItem());
reader->Read(&type);
CHECK_EQ(type, "list_str");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use "list_int", just to be compatible with "arg_nodes"?

Copy link
Contributor Author

@yongsun yongsun Aug 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it should be aligned with the type defined in json file?

"attrs": {
     ... ...
     "lazy_init_input": [
       "list_str",
       [
         "p0"
       ]
     ]
    }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I mean you should use "list_int" in json file as well.

Copy link
Contributor Author

@yongsun yongsun Aug 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is very likely customer will manually edit json file to specify lazy_init_input entries. It would be much easier to use parameter name, rather than its numeric id.

As GraphRuntime does not provide control-flow logics, we have to split
our model to two parts. While we need to share parameters between them
to save memory usage.

Solution:
1) add "lazy_init_input" in graph's attributes
   "attrs": {
     ... ...
     "lazy_init_input": [
       "list_str",
       [
         "p0"
       ]
     ]
    }
2) allow un-allocated NDArray entry in SetupStorage
3) utilize "set_input_zero_copy" function to set parameters
Copy link
Member

@icemelon icemelon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm
I've restarted CI and we can merge after CI passed.

@icemelon icemelon merged commit 224cc24 into apache:master Sep 3, 2019
@icemelon
Copy link
Member

icemelon commented Sep 3, 2019

Thanks @yongsun @yinghai

@icemelon icemelon removed the status: need RFC need RFC discussion label Sep 3, 2019
@@ -53,7 +53,15 @@ inline size_t GetDataAlignment(const DLTensor& arr) {
void GraphRuntime::Run() {
// setup the array and requirements.
for (size_t i = 0; i < op_execs_.size(); ++i) {
if (op_execs_[i]) op_execs_[i]();
if (op_execs_[i]) {
auto& op_arg = op_args_[i];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yongsun @icemelon9 where is this op_args_? there is only op_args in the code, and my build failed because op_args_ cannot be found. How does this even get through CI?

@tqchen
Copy link
Member

tqchen commented Sep 3, 2019

This PR is reverted temporary in https://github.com/dmlc/tvm/pull/3884due to a problem in master(and ci test was against an older version). @yongsun please send a new PR that adds the change back. Sorry for the inconvenience

@yongsun
Copy link
Contributor Author

yongsun commented Sep 3, 2019

@tqchen @icemelon9 @yinghai
Created a new PR basing on latest master branch: #3887
Please help to review, thanks!

MarisaKirisame pushed a commit to MarisaKirisame/tvm that referenced this pull request Sep 7, 2019
As GraphRuntime does not provide control-flow logics, we have to split
our model to two parts. While we need to share parameters between them
to save memory usage.

Solution:
1) add "lazy_init_input" in graph's attributes
   "attrs": {
     ... ...
     "lazy_init_input": [
       "list_str",
       [
         "p0"
       ]
     ]
    }
2) allow un-allocated NDArray entry in SetupStorage
3) utilize "set_input_zero_copy" function to set parameters
MarisaKirisame pushed a commit to MarisaKirisame/tvm that referenced this pull request Sep 7, 2019
wweic pushed a commit to wweic/tvm that referenced this pull request Sep 16, 2019
As GraphRuntime does not provide control-flow logics, we have to split
our model to two parts. While we need to share parameters between them
to save memory usage.

Solution:
1) add "lazy_init_input" in graph's attributes
   "attrs": {
     ... ...
     "lazy_init_input": [
       "list_str",
       [
         "p0"
       ]
     ]
    }
2) allow un-allocated NDArray entry in SetupStorage
3) utilize "set_input_zero_copy" function to set parameters
wweic pushed a commit to wweic/tvm that referenced this pull request Sep 16, 2019
wweic pushed a commit to wweic/tvm that referenced this pull request Sep 16, 2019
As GraphRuntime does not provide control-flow logics, we have to split
our model to two parts. While we need to share parameters between them
to save memory usage.

Solution:
1) add "lazy_init_input" in graph's attributes
   "attrs": {
     ... ...
     "lazy_init_input": [
       "list_str",
       [
         "p0"
       ]
     ]
    }
2) allow un-allocated NDArray entry in SetupStorage
3) utilize "set_input_zero_copy" function to set parameters
wweic pushed a commit to wweic/tvm that referenced this pull request Sep 16, 2019
wweic pushed a commit to neo-ai/tvm that referenced this pull request Sep 16, 2019
As GraphRuntime does not provide control-flow logics, we have to split
our model to two parts. While we need to share parameters between them
to save memory usage.

Solution:
1) add "lazy_init_input" in graph's attributes
   "attrs": {
     ... ...
     "lazy_init_input": [
       "list_str",
       [
         "p0"
       ]
     ]
    }
2) allow un-allocated NDArray entry in SetupStorage
3) utilize "set_input_zero_copy" function to set parameters
wweic pushed a commit to neo-ai/tvm that referenced this pull request Sep 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants