diff --git a/docs/guides/06_distributed_training/cluster_quick_start_ps_cn.rst b/docs/guides/06_distributed_training/cluster_quick_start_ps_cn.rst index 2f6506ac8ca..1fc821f963a 100644 --- a/docs/guides/06_distributed_training/cluster_quick_start_ps_cn.rst +++ b/docs/guides/06_distributed_training/cluster_quick_start_ps_cn.rst @@ -146,65 +146,55 @@ .. code-block:: bash - fleetrun --server_num=1 --worker_num=2 train.py + fleetrun --server_num=1 --trainer_num=2 train.py -您将看到显示如下日志信息: +您将在执行终端看到如下日志信息: .. code-block:: bash - ----------- Configuration Arguments ----------- - gpus: 0,1 - heter_worker_num: None - heter_workers: - http_port: None - ips: 127.0.0.1 - log_dir: log - nproc_per_node: None - server_num: 1 - servers: - training_script: train.py - training_script_args: [] - worker_num: 2 - workers: - ------------------------------------------------ - INFO 2021-05-06 12:14:26,890 launch.py:298] Run parameter-sever mode. pserver arguments:['--worker_num', '--server_num'], cuda count:8 - INFO 2021-05-06 12:14:26,892 launch_utils.py:973] Local server start 1 processes. First process distributed environment info (Only For Debug): - +=======================================================================================+ - | Distributed Envs Value | - +---------------------------------------------------------------------------------------+ - | PADDLE_TRAINERS_NUM 2 | - | TRAINING_ROLE PSERVER | - | POD_IP 127.0.0.1 | - | PADDLE_GLOO_RENDEZVOUS 3 | - | PADDLE_PSERVERS_IP_PORT_LIST 127.0.0.1:34008 | - | PADDLE_PORT 34008 | - | PADDLE_WITH_GLOO 0 | - | PADDLE_HETER_TRAINER_IP_PORT_LIST | - | PADDLE_TRAINER_ENDPOINTS 127.0.0.1:18913,127.0.0.1:10025 | - | PADDLE_GLOO_HTTP_ENDPOINT 127.0.0.1:23053 | - | PADDLE_GLOO_FS_PATH /tmp/tmp8vqb8arq | - +=======================================================================================+ - - INFO 2021-05-06 12:14:26,902 launch_utils.py:1041] Local worker start 2 processes. First process distributed environment info (Only For Debug): - +=======================================================================================+ - | Distributed Envs Value | - +---------------------------------------------------------------------------------------+ - | PADDLE_GLOO_HTTP_ENDPOINT 127.0.0.1:23053 | - | PADDLE_GLOO_RENDEZVOUS 3 | - | PADDLE_PSERVERS_IP_PORT_LIST 127.0.0.1:34008 | - | PADDLE_WITH_GLOO 0 | - | PADDLE_TRAINER_ENDPOINTS 127.0.0.1:18913,127.0.0.1:10025 | - | FLAGS_selected_gpus 0 | - | PADDLE_GLOO_FS_PATH /tmp/tmp8vqb8arq | - | PADDLE_TRAINERS_NUM 2 | - | TRAINING_ROLE TRAINER | - | XPU_VISIBLE_DEVICES 0 | - | PADDLE_HETER_TRAINER_IP_PORT_LIST | - | PADDLE_TRAINER_ID 0 | - | CUDA_VISIBLE_DEVICES 0 | - | FLAGS_selected_xpus 0 | - +=======================================================================================+ - - INFO 2021-05-06 12:14:26,921 launch_utils.py:903] Please check servers, workers and heter_worker logs in log/workerlog.*, log/serverlog.* and log/heterlog.* - INFO 2021-05-06 12:14:33,446 launch_utils.py:914] all workers exit, going to finish parameter server and heter_worker. - INFO 2021-05-06 12:14:33,446 launch_utils.py:926] all parameter server are killed \ No newline at end of file + LAUNCH INFO 2022-05-18 11:27:17,761 ----------- Configuration ---------------------- + LAUNCH INFO 2022-05-18 11:27:17,761 devices: None + LAUNCH INFO 2022-05-18 11:27:17,761 elastic_level: -1 + LAUNCH INFO 2022-05-18 11:27:17,761 elastic_timeout: 30 + LAUNCH INFO 2022-05-18 11:27:17,761 gloo_port: 6767 + LAUNCH INFO 2022-05-18 11:27:17,761 host: None + LAUNCH INFO 2022-05-18 11:27:17,761 job_id: default + LAUNCH INFO 2022-05-18 11:27:17,761 legacy: False + LAUNCH INFO 2022-05-18 11:27:17,761 log_dir: log + LAUNCH INFO 2022-05-18 11:27:17,761 log_level: INFO + LAUNCH INFO 2022-05-18 11:27:17,762 master: None + LAUNCH INFO 2022-05-18 11:27:17,762 max_restart: 3 + LAUNCH INFO 2022-05-18 11:27:17,762 nnodes: 1 + LAUNCH INFO 2022-05-18 11:27:17,762 nproc_per_node: None + LAUNCH INFO 2022-05-18 11:27:17,762 rank: -1 + LAUNCH INFO 2022-05-18 11:27:17,762 run_mode: collective + LAUNCH INFO 2022-05-18 11:27:17,762 server_num: 1 + LAUNCH INFO 2022-05-18 11:27:17,762 servers: + LAUNCH INFO 2022-05-18 11:27:17,762 trainer_num: 2 + LAUNCH INFO 2022-05-18 11:27:17,762 trainers: + LAUNCH INFO 2022-05-18 11:27:17,762 training_script: train.py + LAUNCH INFO 2022-05-18 11:27:17,762 training_script_args: [] + LAUNCH INFO 2022-05-18 11:27:17,762 with_gloo: 0 + LAUNCH INFO 2022-05-18 11:27:17,762 -------------------------------------------------- + LAUNCH INFO 2022-05-18 11:27:17,772 Job: default, mode ps, replicas 1[1:1], elastic False + LAUNCH INFO 2022-05-18 11:27:17,775 Run Pod: evjsyn, replicas 3, status ready + LAUNCH INFO 2022-05-18 11:27:17,795 Watching Pod: evjsyn, replicas 3, status running + +同时,在log目录下,会生成服务节点和训练节点的日志文件。 +服务节点日志:default.evjsyn.ps.0.log,日志中须包含以下内容,证明服务节点启动成功,可以提供服务。 + +.. code-block:: bash + + I0518 11:27:20.730531 177420 brpc_ps_server.cc:73] running server with rank id: 0, endpoint: 10.108.119.16:47837 + +训练节点日志:default.evjsyn.trainer.0.log,日志中打印了训练过程中的部分变量值。 + +.. code-block:: bash + + time: [2022-05-18 11:27:27], batch: [1], loss[1]:[0.666739] + time: [2022-05-18 11:27:27], batch: [2], loss[1]:[0.690405] + time: [2022-05-18 11:27:27], batch: [3], loss[1]:[0.681693] + time: [2022-05-18 11:27:27], batch: [4], loss[1]:[0.703863] + time: [2022-05-18 11:27:27], batch: [5], loss[1]:[0.670717] + +备注:启动相关问题,请参考\ `launch `_\ \ No newline at end of file diff --git a/docs/guides/06_distributed_training/index_cn.rst b/docs/guides/06_distributed_training/index_cn.rst index e869479edad..3fc2e0664f9 100644 --- a/docs/guides/06_distributed_training/index_cn.rst +++ b/docs/guides/06_distributed_training/index_cn.rst @@ -5,6 +5,7 @@ 您可以通过以下内容,了解飞桨分布式训练的特性和使用指南: - `分布式训练快速开始 <./cluster_quick_start_cn.html>`_ : 使用飞桨框架快速开始分布式训练。 +- `参数服务器快速开始 <./cluster_quick_start_ps_cn.html>`_ : 使用飞桨参数服务器快速开始分布式训练。 - `使用FleetAPI进行分布式训练 <./fleet_api_howto_cn.html>`_ : 使用飞桨框架FleetAPI完成分布式训练。 .. toctree::