Adding fluid distributed training guide doc #7619

putcn · 2018-01-17T18:26:11Z

helinwang

Awesome!

helinwang · 2018-01-17T19:24:57Z

doc/howto/usage/cluster/fluid_cluster_train_en.md

+
+### Have PaddlePaddle installed
+
+PaddlePaddle must be installed on all nodes. It would be great if you have GPU cards on your nodes, be sure to properly install drivers and CUDA libraries.


Maybe change "It would be great if you have GPU cards" to "If you have GPU cards"? Doesn't seem GPU is any better than CPU only nodes in the distributed settings.

let me update, thanks!

helinwang · 2018-01-17T19:26:03Z

doc/howto/usage/cluster/fluid_cluster_train_en.md

+exit(1)
+```
+
+We created a simple fully connected net program and handled it to the fluid executor to run for 100 passes.


Maybe change "net program" to "neural networks training program"

abhinavarora · 2018-01-17T19:54:23Z

doc/howto/usage/cluster/fluid_cluster_train_en.md

+
+## Introduction
+
+In this article, we'll explain how to config and run distributed training job with PaddlePaddle Fluid in a bare metal cluster.


job -> jobs

abhinavarora · 2018-01-17T19:55:03Z

doc/howto/usage/cluster/fluid_cluster_train_en.md

+
+### Get your cluster ready
+
+Prepare your computer node in the cluster. Nodes in this cluster can be any spec that runs PaddlePaddle, and with a unique IP address assigned to it. Make sure they can talk to each other.


node -> nodes

can be any spec -> can be of any specification

I think it would be better to use the word communicate instead of talk as this is a formal document.

will update

abhinavarora · 2018-01-17T19:57:49Z

doc/howto/usage/cluster/fluid_cluster_train_en.md

+
+#### Introducing parameter server
+
+As you see from the non-cluster version of training script, there is only one role in it: the trainer, who does the computing as well as holding parameters. In cluster training, since multi-trainers are working on the same task, they need one centralized the place to hold and distribute parameters. This centralized place is called Parameter Server in PaddlePaddle.


they need one centralized the place -> they need one centralized place

called Parameter Server -> called the Parameter Server

abhinavarora · 2018-01-17T19:59:07Z

doc/howto/usage/cluster/fluid_cluster_train_en.md

+
+![parameter server architect](src/trainer.png)
+
+Parameter Server in fluid does not only hold parameters but also assigned with part of the program. Trainers communicate with parameter servers via send/receive OPs. For more tech detail, please refer to this [document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/dist_refactor/distributed_architecture.md).


but also -> but is also
with part of the program -> with a part of the program

abhinavarora · 2018-01-17T20:00:02Z

doc/howto/usage/cluster/fluid_cluster_train_en.md

+
+Fluid provides a tool called "Distribute Transpiler" to automatically convert the non-cluster program into cluster program.
+
+The idea behind this tool is to find optimize OPs and gradient parameters, slice the program into 2 pieces and connect then with send/receive OP.


then -> them

abhinavarora · 2018-01-17T20:00:26Z

doc/howto/usage/cluster/fluid_cluster_train_en.md

+
+The idea behind this tool is to find optimize OPs and gradient parameters, slice the program into 2 pieces and connect then with send/receive OP.
+
+And optimize OPs and gradient parameters can be found from the return values of optimizer's minimize function.


You can drop the And

abhinavarora · 2018-01-17T20:01:07Z

doc/howto/usage/cluster/fluid_cluster_train_en.md

+#current_endpoint here means current pserver IP:PORT you wish to run on
+exe.run(t.get_pserver_program(current_endpoint, optimize_ops)) 
+
+# in trianer, run this


trianer -> trainer

abhinavarora

Thank you for this amazing document. There are some grammatical mistakes. Please fix them before merging. Rest everything is good.

helinwang

LGTM++!

abhinavarora

LGTM!

typhoonzero · 2018-01-18T02:21:24Z

doc/howto/usage/cluster/fluid_cluster_train_en.md

+... #create executor
+
+# in pserver, run this
+exe.run(fluid.default_startup_program())


Sorry but the latest transpiler updates the starup_program for each pserver too, you have to run the pserver like below:

pserver_prog = t.get_pserver_program(current_endpoint, optimize_ops) pserver_startup = t.get_startup_program(current_endpoint, pserver_prog) exe.run(pserver_startup) exe.run(pserver_prog)

got it, let me update the doc.
just to confirm, the trainers will still work with the same default_startup_program right?

putcn added 3 commits January 16, 2018 16:04

init check in for fluid dist train doc

51f3447

gramma update

f9c76f4

minor tweaks

000e236

putcn requested review from Yancey1989, helinwang, gongweibao and typhoonzero January 17, 2018 18:26

helinwang reviewed Jan 17, 2018

View reviewed changes

abhinavarora reviewed Jan 17, 2018

View reviewed changes

abhinavarora previously approved these changes Jan 17, 2018

View reviewed changes

update following comments

cc619a5

putcn dismissed abhinavarora’s stale review via cc619a5 January 17, 2018 20:59

helinwang approved these changes Jan 17, 2018

View reviewed changes

abhinavarora approved these changes Jan 17, 2018

View reviewed changes

abhinavarora merged commit f6dfccb into PaddlePaddle:develop Jan 17, 2018

putcn deleted the doc_howto_fluid_dist_train branch January 18, 2018 00:53

typhoonzero reviewed Jan 18, 2018

View reviewed changes

putcn added a commit to putcn/Paddle that referenced this pull request Jan 19, 2018

update doc and dist test due to API change PaddlePaddle#7619 (review)

95d6dce

putcn mentioned this pull request Jan 19, 2018

update doc and dist test due to transpiler change #7669

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding fluid distributed training guide doc #7619

Adding fluid distributed training guide doc #7619

putcn commented Jan 17, 2018

helinwang left a comment

helinwang Jan 17, 2018

putcn Jan 17, 2018

helinwang Jan 17, 2018

abhinavarora Jan 17, 2018

abhinavarora Jan 17, 2018

abhinavarora Jan 17, 2018

abhinavarora Jan 17, 2018

putcn Jan 17, 2018

abhinavarora Jan 17, 2018

abhinavarora Jan 17, 2018

putcn Jan 17, 2018

abhinavarora Jan 17, 2018

putcn Jan 17, 2018

abhinavarora Jan 17, 2018

putcn Jan 17, 2018

abhinavarora Jan 17, 2018

putcn Jan 17, 2018

abhinavarora Jan 17, 2018

abhinavarora left a comment •

edited

Loading

helinwang left a comment

abhinavarora left a comment

typhoonzero Jan 18, 2018

putcn Jan 19, 2018


		### Have PaddlePaddle installed

		PaddlePaddle must be installed on all nodes. It would be great if you have GPU cards on your nodes, be sure to properly install drivers and CUDA libraries.


		## Introduction

		In this article, we'll explain how to config and run distributed training job with PaddlePaddle Fluid in a bare metal cluster.


		### Get your cluster ready

		Prepare your computer node in the cluster. Nodes in this cluster can be any spec that runs PaddlePaddle, and with a unique IP address assigned to it. Make sure they can talk to each other.


		#### Introducing parameter server

		As you see from the non-cluster version of training script, there is only one role in it: the trainer, who does the computing as well as holding parameters. In cluster training, since multi-trainers are working on the same task, they need one centralized the place to hold and distribute parameters. This centralized place is called Parameter Server in PaddlePaddle.


		![parameter server architect](src/trainer.png)

		Parameter Server in fluid does not only hold parameters but also assigned with part of the program. Trainers communicate with parameter servers via send/receive OPs. For more tech detail, please refer to this [document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/dist_refactor/distributed_architecture.md).


		Fluid provides a tool called "Distribute Transpiler" to automatically convert the non-cluster program into cluster program.

		The idea behind this tool is to find optimize OPs and gradient parameters, slice the program into 2 pieces and connect then with send/receive OP.


		The idea behind this tool is to find optimize OPs and gradient parameters, slice the program into 2 pieces and connect then with send/receive OP.

		And optimize OPs and gradient parameters can be found from the return values of optimizer's minimize function.

Adding fluid distributed training guide doc #7619

Adding fluid distributed training guide doc #7619

Conversation

putcn commented Jan 17, 2018

helinwang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhinavarora left a comment • edited Loading

Choose a reason for hiding this comment

helinwang left a comment

Choose a reason for hiding this comment

abhinavarora left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhinavarora left a comment •

edited

Loading