-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster train doc for v2 API #2072
Cluster train doc for v2 API #2072
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 英文文档等中文文档修改完后,再细看
- 目前的代码都放在demo/word2vec下面,但demo目录今后要被撤掉,所以 @lcy-seso 这块集群代码是否可以移到book repo下了。
os.getenv("PADDLE_INIT_NUM_GRADIENT_SERVERS", "1")), | ||
trainer_id=int(os.getenv("PADDLE_INIT_TRAINER_ID", "0")), | ||
pservers=os.getenv("PADDLE_INIT_PSERVERS", "127.0.0.1")) | ||
#word_dict = paddle.dataset.imikolov.build_dict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
63行注释的代码可以删掉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
示例代码移动到了src目录下
@@ -1,135 +1,214 @@ | |||
```eval_rst | |||
.. _cluster_train: | |||
* [概述](#概述) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
中英文档都缺一个总体的标题,这样目录中无法调用显示,且所有列出来的标题变成下一级。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
# 概述 | ||
本文将介绍如何使用PaddlePaddle在不同的集群框架下完成分布式训练。分布式训练架构如下图所示: | ||
|
||
![cluster train](src/trainer.png) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 图在github中看着过大,请缩小一点。
- 在中文文档中,图中的字能变成中文么?图中的local/global model shard是本地/全局参数分片的意思?这两个内容没在下面的三点中提到。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
![cluster train](src/trainer.png) | ||
|
||
- Data shard(数据分片): 用于训练神经网络的数据,被切分成多个部分,每个部分分别给每个trainer使用 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 25行缺句号。
- 因为是中文文档,所以写成: 数据分片(Data shard)比较好,下同。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
![cluster train](src/trainer.png) | ||
|
||
- Data shard(数据分片): 用于训练神经网络的数据,被切分成多个部分,每个部分分别给每个trainer使用 | ||
- Trainer(计算节点): 每个trainer启动后读取切分好的一部分数据,并开始神经网络的“前馈”和“后馈”计算,并和parameter server通信。在完成一定量数据的训练后,上传计算得出的梯度(gradients)然后下载优化更新后的神经网络参数(parameters)。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- ,(加一个逗号)然后
- parameter server改成参数服务器
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
`PADDLE_NIC` 集群通信通道的 NIC(Network Interface Card, 网络接口卡) 接口名称,例如以太网的 eth0,infiniband 的 ib0。 | ||
``` | ||
- `train_data_dir`:此目录可以是从分布式存储挂载过来包含训练数据的目录,也可以是在任务启动前下载到本地的。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
包含训练数据的目录,可以是从分布式存储挂载过来的,也可以是在任务启动前下载到本地的。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
`PADDLE_NIC` 集群通信通道的 NIC(Network Interface Card, 网络接口卡) 接口名称,例如以太网的 eth0,infiniband 的 ib0。 | ||
``` | ||
- `train_data_dir`:此目录可以是从分布式存储挂载过来包含训练数据的目录,也可以是在任务启动前下载到本地的。 | ||
- `test_data_dir`:包含测试数据集,同样可以是挂载或下载生成。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
包含测试数据集的目录
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* [启动trainer](#启动trainer) | ||
* [准备数据集](#准备数据集) | ||
* [准备训练程序](#准备训练程序) | ||
* [使用分布式计算平台或工具](#使用分布式计算平台或工具) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
计算平台和工具不在同一级里。这块比较迷惑:
“使用Fabric启动集群作业”中包含了“使用kubernetes启动一个测试集群”,“在OpenMPI集群中提交训练作业”中包含了“使用kubernetes启动一个测试openmpi集群”。
所以是有两种启动方式:Fabric和k8s,然后两种提交方式:openmpi和k8s?还是两个平台,都可以用Fabric进行启动?
请重新整理下目录和对应内容,内容下次再细看。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
`PADDLE_PORTS_NUM_FOR_SPARSE` 用于 sparse remote updater 集群通信信道的端口数。如果使用 sparse remote update,则可以像 `PADDLE_PORTS_NUM` 一样设置。 | ||
对于不同的集群平台,会分别介绍集群作业的启动和停止方法。这些例子都可以在[cluster_train_v2](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/scripts/cluster_train_v2)找到。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个链接是失效的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
由于是使用develop分支,merge后就可以生效了
|
||
## 在Kubernetes集群中提交训练作业 | ||
|
||
此部分的使用方法可以参考[k8s_cn.md](../k8s/k8s_cn.md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
请参考Kubernetes单机训练文档。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里之前不能使用中文,会在ci是sphinx报错。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
英文部分能否请 @helinwang 帮忙review一下
* [检查集群训练结果](#检查集群训练结果) | ||
* [检查模型输出](#检查模型输出) | ||
* [在OpenMPI集群中提交训练作业](#在openmpi集群中提交训练作业) | ||
* [准备openmpi集群](#准备openmpi集群) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
openmpi->OpenMPI
|
||
这样,通过计算节点和参数服务器的分布式协作,可以完成神经网络的SGD方法的训练。PaddlePaddle可以同时支持同步随机梯度下降(SGD)和异步随机梯度下降。 | ||
|
||
在使用同步SGD训练神经网络时,PaddlePaddle使用同步屏障(barrier),使梯度的提交和参数的更新按照顺序方式执行。在异步SGD中,则并不会等待所有trainer提交梯度才更新参数,这样极大地提高了计算的并行性:参数服务器之间不相互依赖,并行的接收梯度和更新参数,参数服务器也不会等待计算节点全部都提交梯度之后才开始下一步,计算节点之间也不会相互依赖,并行地执行模型的训练。可以看出,虽然异步SGD方式会提高参数更新并行度, 但是并不能保证参数同步更新,在任意时间某一台参数服务器上保存的参数可能比另一台要更新,与同步SGD相比,梯度会有噪声。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
并行的接收梯度-》并行地接收梯度
<img src="src/trainer_cn.png" width="500"> | ||
|
||
- 数据分片(Data shard): 用于训练神经网络的数据,被切分成多个部分,每个部分分别给每个trainer使用。 | ||
- 计算节点(Trainer): 每个trainer启动后读取切分好的一部分数据,并开始神经网络的“前馈”和“后馈”计算,并和参数服务器通信。在完成一定量数据的训练后,上传计算得出的梯度(gradients),然后下载优化更新后的神经网络参数(parameters)。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
并开始神经网络-》开始神经网络(去掉第一个并)
|
||
以下步骤基于 demo 目录中的 [demo/recommendation](https://github.com/PaddlePaddle/Paddle/tree/develop/demo/recommendation)。 | ||
参考样例数据准备脚本"prepare.py"(位于`doc/howto/usage/cluster/src/word2vec/prepare.py`),准备训练数据和验证数据集,我们使用paddle.dataset.imikolov数据集,并根据分布式训练并发数(trainer节点个数),指定`SPLIT_COUNT`将数据切分成多份。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 改成url格式:prepare.py
SPLIT_COUNT
是要在哪儿设置么?还是不需要读者关注的一个透明变量?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
需要在prepare.py
中设置。
`HOSTS` 所有节点运行集群作业的主机名或 IP 。你还可以将用户和 ssh 端口附加到主机名上,例如 [email protected]:9090。 | ||
- `mylib.py`:会被`train.py`调用的一些库函数。 | ||
- `word_dict.pickle`:在`train.py`中会使用到的字典数据文件。 | ||
- `train.py`:训练程序,代码参考"api_train_v2_cluster.py"(位于`doc/howto/usage/cluster/src/word2vec/api_train_v2_cluster.py`)。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
改成url格式:代码参考api_train_v2_cluster.py
|
||
`PADDLE_PORTS_NUM_FOR_SPARSE` 用于 sparse remote updater 集群通信信道的端口数。如果使用 sparse remote update,则可以像 `PADDLE_PORTS_NUM` 一样设置。 | ||
在使用分布式计算平台进行训练时,任务被平台调度在集群中时会使用计算平台提供的API或环境变量获取启动的参数。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
后半句不是很明白。是说任务会自动使用API或启动的参数么?
|
||
## 在Kubernetes集群中提交训练作业 | ||
|
||
此部分的使用方法可以参考[Kubernetes分布式训练](../k8s/k8s_distributed_cn.md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
缺句号。
|
||
### 启动集群作业 | ||
|
||
`paddle.py` 提供了自动化脚本来启动不同节点中的所有 PaddlePaddle 集群进程。默认情况下,所有命令行选项可以设置为```paddle.py``` 命令选项并且 `paddle.py` 将透明、自动地将这些选项应用到 PaddlePaddle 底层进程。 | ||
|
||
`paddle.py` 为方便作业启动提供了两个独特的命令选项。 | ||
|
||
`job_dispatch_package` 设为本地 `workspace` 目录,它将被分发到 conf.py 中设置的所有节点。 它有助于帮助频繁修改和访问工作区文件的用户减少负担,否则频繁的多节点工作空间部署可能会很麻烦。 | ||
`job_workspace` 设为已部署的工作空间目录,`paddle.py` 将跳过分发阶段直接启动所有节点的集群作业。它可以帮助减少分发延迟。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
job_dispatch_package
设为本地workspace
目录,它将被分发到 conf.py 中设置的所有节点。 它有助于帮助频繁修改和访问工作区文件的用户减少负担,否则频繁的多节点工作空间部署可能会很麻烦。job_workspace
设为已部署的工作空间目录,paddle.py
将跳过分发阶段直接启动所有节点的集群作业。它可以帮助减少分发延迟。
`paddle.py` 提供了自动化脚本来启动不同节点中的所有 PaddlePaddle 集群进程。默认情况下,所有命令行选项可以设置为```paddle.py``` 命令选项并且 `paddle.py` 将透明、自动地将这些选项应用到 PaddlePaddle 底层进程。 | ||
|
||
`paddle.py` 为方便作业启动提供了两个独特的命令选项。 | ||
|
||
`job_dispatch_package` 设为本地 `workspace` 目录,它将被分发到 conf.py 中设置的所有节点。 它有助于帮助频繁修改和访问工作区文件的用户减少负担,否则频繁的多节点工作空间部署可能会很麻烦。 | ||
`job_workspace` 设为已部署的工作空间目录,`paddle.py` 将跳过分发阶段直接启动所有节点的集群作业。它可以帮助减少分发延迟。 | ||
|
||
`cluster_train/run.sh` 提供了命令样例来运行 `demo/recommendation` 集群工作,只需用你定义的目录修改 `job_dispatch_package` 和 `job_workspace`,然后: | ||
`cluster_train/run.sh` 提供了命令样例来运行 `doc/howto/usage/cluster/src/word2vec` 集群任务,只需用你定义的目录修改 `job_dispatch_package` 和 `job_workspace`,然后: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
你-》您
|
||
### 启动集群作业 | ||
|
||
`paddle.py` 提供了自动化脚本来启动不同节点中的所有 PaddlePaddle 集群进程。默认情况下,所有命令行选项可以设置为```paddle.py``` 命令选项并且 `paddle.py` 将透明、自动地将这些选项应用到 PaddlePaddle 底层进程。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
所有命令行选项可以设置为paddle.py
中的选项,并且XXX
@luotao1 Sure will review. |
|
||
# Introduction | ||
|
||
In this article, we'll explain how to do distributed training jobs with PaddlePaddle on different types of clusters. The diagram below shows the mail architecture of a distributed trainning job: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how to do -> how to run
mail -> main
|
||
<img src="src/trainer.png" width="500"> | ||
|
||
- Data shard: training data will be split into multiple parts, trainers use some parts of the whole dataset to do the training job. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multiple parts -> multiple partitions
trainers use some parts of the whole dataset -> trainers use the partitions
可能更好些。
<img src="src/trainer.png" width="500"> | ||
|
||
- Data shard: training data will be split into multiple parts, trainers use some parts of the whole dataset to do the training job. | ||
- Trainer: each trainer reads the data shard, and do Neural Network training algorithms like "forward" and "backward". Then the trainer will upload calculated "gradients" to parameter servers, and wait for parameters to be optimized on the parameter server side. When that finishes, the trainer download optimized parameters and continues its training. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neural Network貌似不需要大写。
<img src="src/trainer.png" width="500"> | ||
|
||
- Data shard: training data will be split into multiple parts, trainers use some parts of the whole dataset to do the training job. | ||
- Trainer: each trainer reads the data shard, and do Neural Network training algorithms like "forward" and "backward". Then the trainer will upload calculated "gradients" to parameter servers, and wait for parameters to be optimized on the parameter server side. When that finishes, the trainer download optimized parameters and continues its training. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and do Neural Network training algorithms like "forward" and "backward"
感觉forward和backward并不是分开的training algorithm,是不是可以写成"and train the neural network".
|
||
- Data shard: training data will be split into multiple parts, trainers use some parts of the whole dataset to do the training job. | ||
- Trainer: each trainer reads the data shard, and do Neural Network training algorithms like "forward" and "backward". Then the trainer will upload calculated "gradients" to parameter servers, and wait for parameters to be optimized on the parameter server side. When that finishes, the trainer download optimized parameters and continues its training. | ||
- Parameter server: every parameter server stores part of the whole Neural Network model data. They will do optimization calculations when gradients are uploaded from trainers, and then send updated parameters to trainers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neural Network貌似不需要大写。
|
||
`dataprovider.py` | ||
used to read train/test samples. It's same as local training. | ||
When job started, every trainer needs to get it's own part of data. In some distributed systems they will provide a storage service, so the date under that path can be accessed by all the trainer nodes. Without storage service, you must copy the training data to each trainer node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they will provide a storage service -> a storage service will be provided.
Without storage service -> Without the storage service
`ROOT_DIR` workspace ROOT directory for placing JOB workspace directory | ||
- `mylib.py`: some library functions. This is optional. | ||
- `word_dict.pickle`: dict file for training word embeding. | ||
- `train.py`: training program. Sample code: "api_train_v2_cluster.py"(located in `doc/howto/usage/cluster/src/word2vec/api_train_v2_cluster.py`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use a link:
[api_train_v2_cluster.py](./src/word2vec/api_train_v2_cluster.py)
|
||
`ROOT_DIR` workspace ROOT directory for placing JOB workspace directory | ||
- `mylib.py`: some library functions. This is optional. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个看了之后还是不知道mylib.py
是做什么的,能否给个更详细的说明或者例子?
|
||
### Launching Cluster Job | ||
`paddle.py` provides automatical scripts to start all PaddlePaddle cluster processes in different nodes. By default, all command line options can set as `paddle.py` command options and `paddle.py` will transparently and automatically set these options to PaddlePaddle lower level processes. | ||
|
||
`paddle.py`provides two distinguished command option for easy job launching. | ||
|
||
`job_dispatch_package` set it with local `workspace`directory, it will be dispatched to all nodes set in conf.py. It could be helpful for frequent hacking workspace files, otherwise frequent mulit-nodes workspace deployment could make your crazy. | ||
`job_dispatch_package` set it with local `workspace`directory, it will be dispatched to all nodes set in conf.py. It could be helpful for frequent hacking workspace files, otherwise, frequent multi-nodes workspace deployment could make your crazy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe change "could make your crazy" to "is very troublesome"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
改成了"frequent multi-nodes workspace deployment is very annoying"
#environments setting for all processes in cluster job | ||
LD_LIBRARY_PATH="/usr/local/cuda/lib64:/usr/lib64" | ||
``` | ||
Run `kubectl -f ssh_servers.yaml` under the directory: `paddle/scripts/cluster_train_v2/fabric/docker_cluster` will launch a demo cluster. Run `kubectl get po -o wide` to get IP addresses of these nodes. | ||
|
||
### Launching Cluster Job | ||
`paddle.py` provides automatical scripts to start all PaddlePaddle cluster processes in different nodes. By default, all command line options can set as `paddle.py` command options and `paddle.py` will transparently and automatically set these options to PaddlePaddle lower level processes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can set as -> can be set as
@luotao1 @helinwang comments are all done. Thank you very for such detailed comments, thank you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Fix #1942. Work in progress.
Fix #3003
Fix #3008
Better to view from here
and here
This document is using https://github.com/ekalinin/github-markdown-toc to generate TOC.