Insert send op while backward op finished #9382

Yancey1989 · 2018-03-26T12:41:48Z

Related #9161

benchmark with fc size is 4096:

4trainers + 4pservers:

local:

Seems the speedup ratio is about only 31%

…send_after_backword_op

helinwang · 2018-03-27T18:13:48Z

paddle/fluid/operators/recv_op.cc

-  mutable detail::RPCClient client_;
+    for (size_t i = 0; i < outs.size(); i++) {
+      VLOG(2) << "getting " << outs[i] << " from " << epmap[i];
+      rpc_client->AsyncGetVariable(epmap[i], ctx, scope, outs[i]);


Why there are two rpc_client->AsyncGetVariable(epmap[i], ctx, scope, outs[i]); (another one at line 56) in this function?

Sorry, it's a mistake, and have fixed it.

helinwang · 2018-03-27T18:31:57Z

python/paddle/fluid/distribute_transpiler.py

-            4. append send_op to send splited variables to server and fetch
-               params(splited blocks or origin param) from server.
-            5. append concat_op to merge splited blocks to update local weights.
+            3. modify trainer program add split_op and send_op to each grad variable.


Is our current implementation one send_op for each grad variable or one send_op for all grad variable?
From my understanding this lines seems to mean one for each, but the implementation seem to be one for all. Maybe we need to have description matching the implementation?

helinwang · 2018-03-27T18:34:25Z

python/paddle/fluid/distribute_transpiler.py

+    def _find_op_by_out(self, program, var):
+        for idx, op in enumerate(program.global_block().ops):
+            if var.name in op.output_arg_names:
+                return idx + 1


Maybe return idx here and the caller do insert_idx = self._find_op_by_out(program, splited_vars[0]) + 1, since the name of this function indicates the index is the index of the op, not the index of the op to be inserted.

helinwang · 2018-03-27T18:34:55Z

python/paddle/fluid/distribute_transpiler.py

+                return idx + 1
+        return -1
+
+    def _insert_send_vars_op(self, program, index, send_vars, epmap, eps):


This function seems not used, are we planning to use it?

Delete the unused function, done.

helinwang · 2018-03-27T18:37:14Z

python/paddle/fluid/distribute_transpiler.py

+                   "epmap": epmap,
+                   "endpoints": eps})
+
+    def _append_trainer_op(self, program, gradblocks, spliter):


I think "trainer op" is too general, the trainer could do forward, calculate grad, send grad, recv param. But this function seems to only do split and send grad? Maybe change to another more specific name?

Thanks, I changed the name to dispatch_trainer_grads and add some comments here.

helinwang · 2018-03-27T18:41:09Z

python/paddle/fluid/distributed_spliter.py

@@ -13,38 +13,63 @@
 # limitations under the License.


-def hash_name(varlist, pserver_endpoints):
+class DistributedSpliter(object):


"Distributed" seems very general, consider rename to PSStrategy (TF uses this name in https://www.tensorflow.org/api_docs/python/tf/train/replica_device_setter) or PSSpliter, PSBalancer, PSDispatcher? Just some ideas, feel free to come up with your own naming :D

Thanks! A specific name looks better, I changed the name to PSDispatcher.

gongweibao · 2018-03-28T13:16:24Z

python/paddle/fluid/distribute_transpiler.py

+            eplist = dispatcher.dispatch(splited_vars)
+            rpc_client_var = layers.io.get_rpc_client_var(program)
+
+            program.global_block().insert_op(


这样应该不能把GPU到内存的拷贝和 GPU计算并行化。
同一个CUDA Stream中的操作是根据先后次序来的。 @chengduoZH
可能需要另外的一个CUDA Stream来做拷贝。

Thanks @gongweibao，这里是实现RPC的IO和backward的Op的计算过程可以并行处理。
另外看到 @typhoonzero 之前有一个PR #9425 是先实现pinned memory，应该可以并行的Copy，下午会和 @typhoonzero 讨论确定一下。

Yancey1989 · 2018-05-08T07:58:43Z

This PR has too much conflict with the latest code, I will close this PR and reopen a new one to implement #9161

…Part2 add sharding_mesh_dimension param (PaddlePaddle#9382) * add custom sharding_dim * Update training_args.py * Update auto_trainer.py * Update auto_trainer.py

insert send op after backward op

f66ac96

Yancey1989 requested review from helinwang and typhoonzero March 26, 2018 12:41

Yancey1989 added 3 commits March 26, 2018 20:44

revert test_word2vec

26f0994

Merge branch 'develop' of github.com:PaddlePaddle/Paddle into insert_…

c08d22c

…send_after_backword_op

fix bugs

bb894f6

Yancey1989 changed the title ~~[WIP]Insert send op while backward op finished~~ Insert send op while backward op finished Mar 27, 2018

update

95dcac7

helinwang reviewed Mar 27, 2018

View reviewed changes

Yancey1989 added 2 commits March 28, 2018 11:17

update by comment

24399af

delete unused import

69ad625

gongweibao requested a review from chengduoZH March 28, 2018 13:10

gongweibao requested changes Mar 28, 2018

View reviewed changes

Yancey1989 closed this May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Insert send op while backward op finished #9382

Insert send op while backward op finished #9382

Yancey1989 commented Mar 26, 2018 •

edited

Loading

helinwang Mar 27, 2018

Yancey1989 Mar 28, 2018

helinwang Mar 27, 2018

Yancey1989 Mar 28, 2018

helinwang Mar 27, 2018

Yancey1989 Mar 28, 2018

helinwang Mar 27, 2018

Yancey1989 Mar 28, 2018

helinwang Mar 27, 2018

Yancey1989 Mar 28, 2018 •

edited

Loading

helinwang Mar 27, 2018 •

edited

Loading

Yancey1989 Mar 28, 2018

gongweibao Mar 28, 2018

Yancey1989 Mar 29, 2018

Yancey1989 commented May 8, 2018

Insert send op while backward op finished #9382

Insert send op while backward op finished #9382

Conversation

Yancey1989 commented Mar 26, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yancey1989 Mar 28, 2018 • edited Loading

Choose a reason for hiding this comment

helinwang Mar 27, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yancey1989 commented May 8, 2018

Yancey1989 commented Mar 26, 2018 •

edited

Loading

Yancey1989 Mar 28, 2018 •

edited

Loading

helinwang Mar 27, 2018 •

edited

Loading