-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update design of dist train refactor #5776
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,9 +16,9 @@ limitations: | |
write the inter-model-shard communication code. | ||
|
||
3. The user can not directly specify the parameter update rule: need | ||
to modify the parameter server C++ code and compile a new | ||
binary. This adds complication for researchers: A lot of extra | ||
effort is required. Besides, the training job submission program | ||
to modify the parameter server C++ code and compile a new binary. | ||
This adds complication for researchers: A lot of extra effort is | ||
required. Besides, the training job submission program | ||
may not allow running arbitrary binaries. | ||
|
||
This design doc discusses PaddlePaddle's new distributed training | ||
|
@@ -44,7 +44,7 @@ replicated Python instances are running on different nodes: both the | |
training logic and the neural network computation is replicated. | ||
|
||
The tasks that should only run once all belong to the training logic, | ||
if we only replicate the neural network computation, but do **not** | ||
if we only replicate the neural network computation but do **not** | ||
replicate the training logic, the limitation could be solved. | ||
|
||
### Limitation 2 | ||
|
@@ -53,13 +53,13 @@ Model parallelism means running a single model on multiple nodes by | |
partitioning the model onto different nodes and managing the | ||
inter-model-shard communications. | ||
|
||
PaddlePaddle should be able to modify the nerual network computation | ||
PaddlePaddle should be able to modify the neural network computation | ||
definition to support model parallelism automatically. However, the | ||
computation is only specified in Python code, and PaddlePaddle can not | ||
computation is only specified in Python code, and PaddlePaddle cannot | ||
modify Python code. | ||
|
||
Just like compiler uses a intermediate representation (IR) so that | ||
programmer does not need to manually optimize their code in most of | ||
Just like compiler uses an intermediate representation (IR) so that | ||
the programmer does not need to manually optimize their code in most of | ||
the cases - the compiler will optimize the IR: | ||
|
||
<img src="src/compiler.png"/> | ||
|
@@ -75,20 +75,20 @@ Python: | |
### Limitation 3 | ||
|
||
The user can not directly specify the parameter update rule for the | ||
parameter server because the previous implementaion hard coded that | ||
parameter server because the previous implementation hard coded that | ||
parameter server only do vector's optimization algorithm by | ||
configuration. The user can not specify the parameter server's | ||
computation layer by layer. | ||
|
||
This could be fixed by making the parameter server run a separated | ||
IR according to the trainer's varialble (tensors, selectedrows) | ||
defination. | ||
IR according to the trainer's variable (tensors, selectedrows) | ||
definition. | ||
|
||
the same | ||
computation definition as the trainer. For a detailed explanation, | ||
computation definition of the trainer. For a detailed explanation, | ||
please | ||
see | ||
[Design Doc: Operation Graph Based Parameter Server](./parameter_server.md) | ||
[Design Doc: Operation Graph-Based Parameter Server](./parameter_server.md) | ||
|
||
## Distributed Training Architecture | ||
|
||
|
@@ -136,18 +136,43 @@ iteratively. | |
|
||
As shown in the graph, `RemoteExecutor.run` sends the IR to the | ||
PaddlePaddle cluster for Execution. You can also use parameter | ||
`fetch_list` to interactively fetch varirable back to local for | ||
`fetch_list` to interactively fetch variable back to local for | ||
log printing. | ||
|
||
The Python `RemoteExecutor` is derived from `Executor` class. | ||
For more information about `RemoteExecutor`, please | ||
see [Design Doc: RemoteExecutor](./remote_executor.md). | ||
|
||
The `RemoteExecutor.run` interface defination is: | ||
|
||
```python | ||
run(self, | ||
program=None, | ||
feed=None, | ||
fetch_list=None, | ||
feed_var_name='feed', | ||
fetch_var_name='fetch', | ||
job_desc=JobDesc( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we want to separate job creation and There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Is it fine to do job submission when creating There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We probably need to support many actions for a job (e.g., creation, view state, deletion), maybe we can use CLI to manage resource (can be a Python package), the CLI returns a job ID and the Python script can use the job ID in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OR implement like below is more reasonable? The python program is running on every node, and the k8s job is still managed by CLI which is the same as current implementation? job_desc = JobDesc(num_trainer=4, num_ps=2, GPU_per_trainer=2)
pserver = RemoteExecutor(placement="/pserver/*", job_desc=job_desc)
trainer = RemoteExecutor(placement="/trainer/*", job_desc=job_desc)
placement = os.getenv('PADDLE_PLACEMENT')
if placement.startswith('/pserver'):
pserver.run(startup_program)
pserver.run(optimize_program)
pserver.wait()
elif placement.startswith('/trainer'):
for pass_num in xrange(num_passes):
trainer.run(main_program)
trainer.wait() # wait for sync SGD There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please refer to this graph, there is only a single Python instance that controls the cpp instances in the cluster. The Python instance could run locally or in the cluster. Please see Limitation 1 on why a single Python instances is better. The Python code send the entire graph to a service process in the cluster that will do the graph conversion, placement and sending sub-graph to each runtime instance. So I think the Python code that the user writes should not contain the placement logic below: placement = os.getenv('PADDLE_PLACEMENT')
if placement.startswith('/pserver'):
pserver.run(startup_program)
pserver.run(optimize_program)
pserver.wait()
elif placement.startswith('/trainer'):
for pass_num in xrange(num_passes):
trainer.run(main_program)
trainer.wait() # wait for sync SGD And only one executor is necessary. So maybe something like: job = paddle.createJob(num_trainer=4, num_ps=2, GPU_per_trainer=2) # or create using CLI.
exe = paddle.RemoteExecutor(job)
exe.run(main_program) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
# the orchestrator instance here is just a client of Orchestrator which runs in machine_a.cluster at port 2222
orchestrator = paddle.Orchestrator(uri="machine_a.cluster:2222")
# add an executor. orchestrator will talk to Executor and fetch device type(CPU, GPU) and other device hardware data
orchestrator.addExecutors(uri="machine_a.cluster:3333")
# add another remote Executor
orchestrator.addExecutors(uri="machine_b.cluster:3333")
job_desc = paddle.createJob(num_trainer=4, num_ps=2) # there should be other config regarding affinity.
# also, should we wrap job and main_program together? I'm not sure
orchestrator.run(main_program, job_desc) # submit the main_program and job_desc to remote orchestrator if we just need to run local training job, create an orchestrator in localhost and add localhost Executors There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. just discussed with @helinwang here is our ideas we want to talk to you with
./orchestrator -port 8080 -ip 192.168.1.2 now orchestrator will start listening for incoming resource registration and protobuf ./executor -orchestrator_uri 192.168.1.2:8080 now executor is created and register itself with orchestrator with its hardware features python train.py 192.168.1.2:8080 now protobuf (IR) will be sent to the orchestrator; orchestrator change/cut the IR and set sub IR to executors. base on the hardware features of executors, they will be set to different roles(pserver or trainer)
python train.py IR will be delivered to a lite orchestrator which only do graph change to fit multi thread or multi GPU case, then the updated graph will be delivered to executors and run. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since we are only considering Kubernetes, the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The Python code example above is: for data in train_reader():
loss, acc = exe.run(trainer_prog,
feed=feeder.feed(data),
fetch_list=[avg_cost]) The There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See discussions here, because we use python code to form the iterate loop, so "remote executor" can not just run a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the discussion, Python loop feels native but not as efficient on remote training, I think we should provide both options to the user, and let the user to choose. So we probably still need to support Python loop well. Does a training job can only be one ProgramDesc run (e.g, one run with while op)? I think we don't need to couple training job with ProgramDesc run. The training job can just be some resource, waiting to run however many ProgramDescs, it preserves the state between multiples ProgramDesc runs. In this way we can support Python loop in remote training. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree with support both methods. |
||
jobname, | ||
num_trainer, | ||
num_pserver, | ||
cpu_per_trainer, | ||
gpu_per_trainer, | ||
mem_per_trainer, | ||
cpu_per_pserver, | ||
mem_per_pserver | ||
)) | ||
``` | ||
|
||
`JobDesc` object describe the distributed job resource specification to run on | ||
Cluster environment. | ||
|
||
By default, `Executor.run` starts a PaddlePaddle Cloud | ||
[TrainingJob](https://github.com/PaddlePaddle/cloud/blob/develop/doc/autoscale/README.md#training-job-resource), or you can run each component in the | ||
[TrainingJob](https://github.com/PaddlePaddle/cloud/blob/develop/doc/autoscale/README.md#training-job-resource), | ||
or you can run each component in the | ||
executor by your own method: | ||
|
||
- Data Parrallelism | ||
- Data Parallelism | ||
```python | ||
if os.getenv('PLACE_PSERVER'): | ||
exe.run_pserver() | ||
|
@@ -164,10 +189,10 @@ executor by your own method: | |
|
||
As mentioned above, the implementation of IR is [Program](../program.md). | ||
|
||
[Executor](../executor.md) converts and parses the IR to a prefered | ||
[Executor](../executor.md) converts and parses the IR to a preferred | ||
graph for final execution. For local training you generally use | ||
`Executor` to run the graph locally. For any kind of distributed | ||
training, you can use `RemoteExecutor` to specify desired distributed | ||
training, you can use `RemoteExecutor` to specify desired distributed | ||
training method with some optional arguments. | ||
|
||
### PaddlePaddle Converter | ||
|
@@ -182,7 +207,7 @@ to different PaddlePaddle runtimes. Below are the steps: | |
|
||
1. Extract a new computation (sub)graph with `feed` and `fetch` OP as | ||
the boundary. The runtime does not need to run the OP that is not | ||
dependent by the `fetch` OP. | ||
dependent on the `fetch` OP. | ||
|
||
1. Optimizes the computation graph. | ||
|
||
|
@@ -238,7 +263,7 @@ the Python reader will need to read from the distributed filesystem | |
network traffic. | ||
|
||
When doing distributed training, the user can still use Python data | ||
reader: the training data are sent with `Executor.run`. However should | ||
reader: the training data are sent with `Executor.run`. However, should | ||
be used for debugging purpose only. The users are encouraged to use | ||
the read data OPs. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also give an example ProgramDesc generated by transpiler from above sample code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see the dist-graph.png in
parameter_server.md
.