Should we split Executor::Run into Executor::Prepare and Executor::exe #6285

reyoung · 2017-12-05T07:36:42Z

Problem

We create new operators in CPP when Executor::Run is invoked since we assume the topology may be changed every time. However, the program is usually not changed. To create operators locally or sending protobuf again and again to a remote node is very time-consuming.

Solution

To reduce the time cost of creating operators in local mode and network communication in cluster mode, we can extract a method named Executor::Prepare.

class Executor {
 public:
  using HANDLE=int;
  virtual HANDLE Prepare(program, feed_list, fetch_list) = 0;
  virtual void Exec(HANDLE handle) = 0;
  void Run(program, feed_list, fetch_list) {
    Exec(Prepare(program, feed_list, fetch_list));
  }
 private:
  vector<Ops> prepared_ops_;
};

Prepare return a HANDLE.

In local mode, It could be an array index of an internal data structure of Executor. The internal data structure holds the C++ operators which the program contains.

In cluster mode, Prepare could just send the protobuf of the program to a remote node. The handle could be an RPC return value. We can just send the HANDLE to remote to execute the associated program, instead of serializing and sending protobuf again and again.

The text was updated successfully, but these errors were encountered:

qingqing01 · 2017-12-05T12:07:07Z

To my understanding, there are two problems:

create program at each running in Executor in Python
- https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/fluid/executor.py#L114
- From my profiling of python, the time of program = program.clone()takes up more than 6% of total time.
create each operator at each running in Executor in C++
- https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/executor.cc#L126
- From my profling of C++, it takes up less than 1% fo total time.

If we can solve these problems in a better design without losing the flexibility, maybe it is good.

tonyyang-svail · 2017-12-05T19:38:52Z

In local mode, It could be an array index of an internal data structure of Executor. The internal data structure holds the C++ operators which the program contains.

Looks like the cost of creating an operator instance is not significant.

In cluster mode, Prepare could just send the protobuf of the program to a remote node. The handle could be an RPC return value. We can just send the HANDLE to remote to execute the associated program, instead of serializing and sending protobuf again and again.

Although we expect the serialization and sending are slow, can we first measure how slow it is? Also, @reyoung how does an executor determine current_program == previous_program?

helinwang · 2017-12-05T21:18:42Z

To avoid premature optimization, we need to profile before doing optimization.
If we need to optimize, the interface does not have to change. We can still have one method called Run, but internally Run caches the states so it can be reused in the next run. We probably should not expose the detail (optimization) to the API.

In cluster mode, Prepare could just send the protobuf of the program to a remote node. The handle could be an RPC return value. We can just send the HANDLE to remote to execute the associated program, instead of serializing and sending protobuf again and again.

I think different executors should never communicate with each other, the module who send ExecutionPlan to the executors communicate with all executors.

reyoung · 2017-12-06T02:51:17Z

@tonyyang-svail

Looks like the cost of creating an operator instance is not significant.

The cost of creating operators in RNN is very significant since it will create operators every time-step. It could also be significant when remotely.

@tonyyang-svail @helinwang

To avoid premature optimization, we need to profile before doing optimization.

@chengduoZH @qingqing01
We have a profiling right now for a plain network. We need to accumulate the time cost of Program.clone() in Python and Executor.Run in C++. The total time cost could be list here as a comment.

In my view of running a plain network and an RNN, the time cost of Executor.Run might be around 8%/20%.

@tonyyang-svail @helinwang

how does an executor determine current_program == previous_program?

We can still have one method called Run, but internally Run caches the states so it can be reused in the next run. We probably should not expose the detail (optimization) to the API.

The time complexity to determine current_program == previous_program is exactly as same as creating C++ operators because we need to compare every operators and variables between two programs are same.

In this issue, I suggest to let end users or a higher level API manage the cache handle, not the Executor. We could provide a Trainer API, and manage the cache handle inside the trainer. However, the low-level API should be provided since users can train many different programs in the same Python file.

Or, another straight-forward way is making program as a constructor parameter of Executor, i.e., each executor can only execute a same program.

helinwang · 2017-12-06T21:48:28Z

In this issue, I suggest to let end users or a higher level API manage the cache handle, not the Executor. We could provide a Trainer API, and manage the cache handle inside the trainer.

Curious what is the benefit of letting end users or a higher level API manage the cache handle? I can think of the benefit of not exposing cache handling: simple executor API, no chance for the user to mess up the cache handling.

reyoung · 2017-12-22T08:37:23Z

The time consumption of creating and destroying operators in a Dynamic RNN is pretty large. It takes about 12.4% of computation time, which can be fully optimized.

reyoung · 2017-12-22T08:40:46Z

Related issue #6885

Xreki · 2018-05-14T07:24:40Z

It is done in #9000 .

reyoung assigned wangkuiyi, helinwang, qingqing01, typhoonzero, QiJune and tonyyang-svail Dec 5, 2017

reyoung added the need be discussed label Dec 5, 2017

reyoung assigned abhinavarora Dec 5, 2017

typhoonzero mentioned this issue Dec 6, 2017

We need compile time - runtime separation #6322

Closed

jacquesqiao mentioned this issue Mar 5, 2018

[Speed]speed up python executor in fluid #8729

Closed

Xreki closed this as completed May 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should we split Executor::Run into Executor::Prepare and Executor::exe #6285

Should we split Executor::Run into Executor::Prepare and Executor::exe #6285

reyoung commented Dec 5, 2017 •

edited

Loading

qingqing01 commented Dec 5, 2017

tonyyang-svail commented Dec 5, 2017

helinwang commented Dec 5, 2017 •

edited

Loading

reyoung commented Dec 6, 2017 •

edited

Loading

helinwang commented Dec 6, 2017 •

edited

Loading

reyoung commented Dec 22, 2017

reyoung commented Dec 22, 2017

Xreki commented May 14, 2018

Should we split Executor::Run into Executor::Prepare and Executor::exe #6285

Should we split Executor::Run into Executor::Prepare and Executor::exe #6285

Comments

reyoung commented Dec 5, 2017 • edited Loading

Problem

Solution

qingqing01 commented Dec 5, 2017

tonyyang-svail commented Dec 5, 2017

helinwang commented Dec 5, 2017 • edited Loading

reyoung commented Dec 6, 2017 • edited Loading

helinwang commented Dec 6, 2017 • edited Loading

reyoung commented Dec 22, 2017

reyoung commented Dec 22, 2017

Xreki commented May 14, 2018

reyoung commented Dec 5, 2017 •

edited

Loading

helinwang commented Dec 5, 2017 •

edited

Loading

reyoung commented Dec 6, 2017 •

edited

Loading

helinwang commented Dec 6, 2017 •

edited

Loading