Skip to content

Commit

Permalink
update launch
Browse files Browse the repository at this point in the history
  • Loading branch information
esythan committed May 18, 2022
1 parent 84903cf commit 9ed062c
Show file tree
Hide file tree
Showing 2 changed files with 49 additions and 58 deletions.
106 changes: 48 additions & 58 deletions docs/guides/06_distributed_training/cluster_quick_start_ps_cn.rst
Original file line number Diff line number Diff line change
Expand Up @@ -146,65 +146,55 @@

.. code-block:: bash
fleetrun --server_num=1 --worker_num=2 train.py
fleetrun --server_num=1 --trainer_num=2 train.py
您将看到显示如下日志信息
您将在执行终端看到如下日志信息

.. code-block:: bash
----------- Configuration Arguments -----------
gpus: 0,1
heter_worker_num: None
heter_workers:
http_port: None
ips: 127.0.0.1
log_dir: log
nproc_per_node: None
server_num: 1
servers:
training_script: train.py
training_script_args: []
worker_num: 2
workers:
------------------------------------------------
INFO 2021-05-06 12:14:26,890 launch.py:298] Run parameter-sever mode. pserver arguments:['--worker_num', '--server_num'], cuda count:8
INFO 2021-05-06 12:14:26,892 launch_utils.py:973] Local server start 1 processes. First process distributed environment info (Only For Debug):
+=======================================================================================+
| Distributed Envs Value |
+---------------------------------------------------------------------------------------+
| PADDLE_TRAINERS_NUM 2 |
| TRAINING_ROLE PSERVER |
| POD_IP 127.0.0.1 |
| PADDLE_GLOO_RENDEZVOUS 3 |
| PADDLE_PSERVERS_IP_PORT_LIST 127.0.0.1:34008 |
| PADDLE_PORT 34008 |
| PADDLE_WITH_GLOO 0 |
| PADDLE_HETER_TRAINER_IP_PORT_LIST |
| PADDLE_TRAINER_ENDPOINTS 127.0.0.1:18913,127.0.0.1:10025 |
| PADDLE_GLOO_HTTP_ENDPOINT 127.0.0.1:23053 |
| PADDLE_GLOO_FS_PATH /tmp/tmp8vqb8arq |
+=======================================================================================+
INFO 2021-05-06 12:14:26,902 launch_utils.py:1041] Local worker start 2 processes. First process distributed environment info (Only For Debug):
+=======================================================================================+
| Distributed Envs Value |
+---------------------------------------------------------------------------------------+
| PADDLE_GLOO_HTTP_ENDPOINT 127.0.0.1:23053 |
| PADDLE_GLOO_RENDEZVOUS 3 |
| PADDLE_PSERVERS_IP_PORT_LIST 127.0.0.1:34008 |
| PADDLE_WITH_GLOO 0 |
| PADDLE_TRAINER_ENDPOINTS 127.0.0.1:18913,127.0.0.1:10025 |
| FLAGS_selected_gpus 0 |
| PADDLE_GLOO_FS_PATH /tmp/tmp8vqb8arq |
| PADDLE_TRAINERS_NUM 2 |
| TRAINING_ROLE TRAINER |
| XPU_VISIBLE_DEVICES 0 |
| PADDLE_HETER_TRAINER_IP_PORT_LIST |
| PADDLE_TRAINER_ID 0 |
| CUDA_VISIBLE_DEVICES 0 |
| FLAGS_selected_xpus 0 |
+=======================================================================================+
INFO 2021-05-06 12:14:26,921 launch_utils.py:903] Please check servers, workers and heter_worker logs in log/workerlog.*, log/serverlog.* and log/heterlog.*
INFO 2021-05-06 12:14:33,446 launch_utils.py:914] all workers exit, going to finish parameter server and heter_worker.
INFO 2021-05-06 12:14:33,446 launch_utils.py:926] all parameter server are killed
LAUNCH INFO 2022-05-18 11:27:17,761 ----------- Configuration ----------------------
LAUNCH INFO 2022-05-18 11:27:17,761 devices: None
LAUNCH INFO 2022-05-18 11:27:17,761 elastic_level: -1
LAUNCH INFO 2022-05-18 11:27:17,761 elastic_timeout: 30
LAUNCH INFO 2022-05-18 11:27:17,761 gloo_port: 6767
LAUNCH INFO 2022-05-18 11:27:17,761 host: None
LAUNCH INFO 2022-05-18 11:27:17,761 job_id: default
LAUNCH INFO 2022-05-18 11:27:17,761 legacy: False
LAUNCH INFO 2022-05-18 11:27:17,761 log_dir: log
LAUNCH INFO 2022-05-18 11:27:17,761 log_level: INFO
LAUNCH INFO 2022-05-18 11:27:17,762 master: None
LAUNCH INFO 2022-05-18 11:27:17,762 max_restart: 3
LAUNCH INFO 2022-05-18 11:27:17,762 nnodes: 1
LAUNCH INFO 2022-05-18 11:27:17,762 nproc_per_node: None
LAUNCH INFO 2022-05-18 11:27:17,762 rank: -1
LAUNCH INFO 2022-05-18 11:27:17,762 run_mode: collective
LAUNCH INFO 2022-05-18 11:27:17,762 server_num: 1
LAUNCH INFO 2022-05-18 11:27:17,762 servers:
LAUNCH INFO 2022-05-18 11:27:17,762 trainer_num: 2
LAUNCH INFO 2022-05-18 11:27:17,762 trainers:
LAUNCH INFO 2022-05-18 11:27:17,762 training_script: train.py
LAUNCH INFO 2022-05-18 11:27:17,762 training_script_args: []
LAUNCH INFO 2022-05-18 11:27:17,762 with_gloo: 0
LAUNCH INFO 2022-05-18 11:27:17,762 --------------------------------------------------
LAUNCH INFO 2022-05-18 11:27:17,772 Job: default, mode ps, replicas 1[1:1], elastic False
LAUNCH INFO 2022-05-18 11:27:17,775 Run Pod: evjsyn, replicas 3, status ready
LAUNCH INFO 2022-05-18 11:27:17,795 Watching Pod: evjsyn, replicas 3, status running
同时,在log目录下,会生成服务节点和训练节点的日志文件。
服务节点日志:default.evjsyn.ps.0.log,日志中须包含以下内容,证明服务节点启动成功,可以提供服务。

.. code-block:: bash
I0518 11:27:20.730531 177420 brpc_ps_server.cc:73] running server with rank id: 0, endpoint: 10.108.119.16:47837
训练节点日志:default.evjsyn.trainer.0.log,日志中打印了训练过程中的部分变量值。

.. code-block:: bash
time: [2022-05-18 11:27:27], batch: [1], loss[1]:[0.666739]
time: [2022-05-18 11:27:27], batch: [2], loss[1]:[0.690405]
time: [2022-05-18 11:27:27], batch: [3], loss[1]:[0.681693]
time: [2022-05-18 11:27:27], batch: [4], loss[1]:[0.703863]
time: [2022-05-18 11:27:27], batch: [5], loss[1]:[0.670717]
备注:启动相关问题,请参考\ `launch <https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/distributed/launch_cn.html>`_\
1 change: 1 addition & 0 deletions docs/guides/06_distributed_training/index_cn.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
您可以通过以下内容,了解飞桨分布式训练的特性和使用指南:

- `分布式训练快速开始 <./cluster_quick_start_cn.html>`_ : 使用飞桨框架快速开始分布式训练。
- `参数服务器快速开始 <./cluster_quick_start_ps_cn.html>`_ : 使用飞桨参数服务器快速开始分布式训练。
- `使用FleetAPI进行分布式训练 <./fleet_api_howto_cn.html>`_ : 使用飞桨框架FleetAPI完成分布式训练。

.. toctree::
Expand Down

0 comments on commit 9ed062c

Please sign in to comment.