Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we split Executor::Run into Executor::Prepare and Executor::exe #6285

Closed
reyoung opened this issue Dec 5, 2017 · 8 comments
Closed

Comments

@reyoung
Copy link
Collaborator

reyoung commented Dec 5, 2017

Problem

We create new operators in CPP when Executor::Run is invoked since we assume the topology may be changed every time. However, the program is usually not changed. To create operators locally or sending protobuf again and again to a remote node is very time-consuming.

Solution

To reduce the time cost of creating operators in local mode and network communication in cluster mode, we can extract a method named Executor::Prepare.

class Executor {
 public:
  using HANDLE=int;
  virtual HANDLE Prepare(program, feed_list, fetch_list) = 0;
  virtual void Exec(HANDLE handle) = 0;
  void Run(program, feed_list, fetch_list) {
    Exec(Prepare(program, feed_list, fetch_list));
  }
 private:
  vector<Ops> prepared_ops_;
};

Prepare return a HANDLE.

In local mode, It could be an array index of an internal data structure of Executor. The internal data structure holds the C++ operators which the program contains.

In cluster mode, Prepare could just send the protobuf of the program to a remote node. The handle could be an RPC return value. We can just send the HANDLE to remote to execute the associated program, instead of serializing and sending protobuf again and again.

@qingqing01
Copy link
Contributor

To my understanding, there are two problems:

If we can solve these problems in a better design without losing the flexibility, maybe it is good.

@tonyyang-svail
Copy link

In local mode, It could be an array index of an internal data structure of Executor. The internal data structure holds the C++ operators which the program contains.

Looks like the cost of creating an operator instance is not significant.

In cluster mode, Prepare could just send the protobuf of the program to a remote node. The handle could be an RPC return value. We can just send the HANDLE to remote to execute the associated program, instead of serializing and sending protobuf again and again.

Although we expect the serialization and sending are slow, can we first measure how slow it is? Also, @reyoung how does an executor determine current_program == previous_program?

@helinwang
Copy link
Contributor

helinwang commented Dec 5, 2017

  • To avoid premature optimization, we need to profile before doing optimization.

  • If we need to optimize, the interface does not have to change. We can still have one method called Run, but internally Run caches the states so it can be reused in the next run. We probably should not expose the detail (optimization) to the API.

In cluster mode, Prepare could just send the protobuf of the program to a remote node. The handle could be an RPC return value. We can just send the HANDLE to remote to execute the associated program, instead of serializing and sending protobuf again and again.

I think different executors should never communicate with each other, the module who send ExecutionPlan to the executors communicate with all executors.

@reyoung
Copy link
Collaborator Author

reyoung commented Dec 6, 2017

@tonyyang-svail

Looks like the cost of creating an operator instance is not significant.

The cost of creating operators in RNN is very significant since it will create operators every time-step. It could also be significant when remotely.

@tonyyang-svail @helinwang

To avoid premature optimization, we need to profile before doing optimization.

@chengduoZH @qingqing01
We have a profiling right now for a plain network. We need to accumulate the time cost of Program.clone() in Python and Executor.Run in C++. The total time cost could be list here as a comment.

In my view of running a plain network and an RNN, the time cost of Executor.Run might be around 8%/20%.

@tonyyang-svail @helinwang

how does an executor determine current_program == previous_program?

We can still have one method called Run, but internally Run caches the states so it can be reused in the next run. We probably should not expose the detail (optimization) to the API.

The time complexity to determine current_program == previous_program is exactly as same as creating C++ operators because we need to compare every operators and variables between two programs are same.

In this issue, I suggest to let end users or a higher level API manage the cache handle, not the Executor. We could provide a Trainer API, and manage the cache handle inside the trainer. However, the low-level API should be provided since users can train many different programs in the same Python file.

Or, another straight-forward way is making program as a constructor parameter of Executor, i.e., each executor can only execute a same program.

@helinwang
Copy link
Contributor

helinwang commented Dec 6, 2017

In this issue, I suggest to let end users or a higher level API manage the cache handle, not the Executor. We could provide a Trainer API, and manage the cache handle inside the trainer.

Curious what is the benefit of letting end users or a higher level API manage the cache handle? I can think of the benefit of not exposing cache handling: simple executor API, no chance for the user to mess up the cache handling.

@reyoung
Copy link
Collaborator Author

reyoung commented Dec 22, 2017

The time consumption of creating and destroying operators in a Dynamic RNN is pretty large. It takes about 12.4% of computation time, which can be fully optimized.

@reyoung
Copy link
Collaborator Author

reyoung commented Dec 22, 2017

Related issue #6885

@Xreki
Copy link
Contributor

Xreki commented May 14, 2018

It is done in #9000 .

@Xreki Xreki closed this as completed May 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants