Support CPU Parallel in DataParallel Interface by GLOO to speed up training #35745

2742195759 · 2021-09-15T03:10:18Z

PR types

New features

PR changes

APIs

Describe

背景

这个PR实现了在用户调用 spawn 或者是 launch 的时候可以添加自定义的backend参数。backend参数表示了用户希望使用CPU、GPU、XPU还是PS并行。可用的backend选择为 "gloo", "nccl", "bkcl", "auto" .

场景举例

支持用户想要在CUDA版本下使用CPU的并行。 spawn(main, backend='gloo') 即可。之前的版本会报错。
添加了CPU版本Paddle下的兜底策略：支持用户想要在CPU版本的Paddle时使用并行训练。这种情况下会自动推导出 backend='gloo' 并在 不改变原始调用的情况下 默认使用CPU进行并行。

影响面

本次修改为兼容升级，当用户在GPU版本的paddle下，使用原始的API接口，会得到预期的效果。
修改了CPU版本下的并行策略，之前是报错，现在是启动Gloo并行
修改了launch()函数下，PS 和 Collective 的语义。之前是通过参数进行推导，现在改为了如下所示
gloo / nccl / bkcl -> Collective
auto -> 按照之前的逻辑进行推导。

异常处理

1、对Mac和Win平台下使用Gloo的报错

目前的Paddle只有linux支持GLOO，因此当用户想要迁移代码的时候，如果在MAC或者是Windows下并试图使用gloo作为backend的时候会Raise ValueError。

2、用户指定的backend和paddle版本不匹配的报错

例如用户在 CPU 的paddle下使用了 backend='nccl' 那么会报错

3、gloo模式下对参数的检测

例如用户在backend=gloo的时候添加了很多其他的cpu不支持的参数，在这些情况下进行了捕获和报错。

4、NPU模式目前不支持parallel

目前NPU不支持 parallel训练，但是在parallel接口中存在了NPU的一些逻辑，所以这些逻辑尽量保留，这时候，backend如果为 auto的情况下，推导为 unknown，为了方便后续报错。

####使用样例

# 代码样例
def train():
    # 1. initialize parallel environment
    dist.init_parallel_env()
    
    # 2. create data parallel layer & optimizer
    layer = LinearNet()
    dp_layer = paddle.DataParallel(layer)
    
    # .....(省略)
    adam.clear_grad()

if __name__ == '__main__':
    # 1. start by ``paddle.distributed.launch``
    train()
	
    # 2. start by ``paddle.distributed.spawn`` (default)
    dist.spawn(train, nproc=4, backend='gloo')

使用launch启动上述代码：

# 强行指定CPU并行，无论paddle-cpu / paddle-gpu 都会CPU并行。
python -m paddle.distributed.launch --nproc_per_node=4 --backend='gloo'

paddle-bot-old · 2021-09-15T03:10:22Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Aurelius84

LGTM

python/paddle/distributed/fleet/launch.py

XieYunshen

LGTM for
bash_test_modules(test_cpuonly_launch START_BASH test_cpuonly_launch.sh SERIAL LABELS "RUN_TYPE=EXCLUSIVE" ENVS "PADDLE_DIST_UT_PORT=${dist_ut_port}" PADDLE_BINARY_DIR=${PADDLE_BINARY_DIR}

Aurelius84

LGTM

python/paddle/distributed/fleet/launch.py

python/paddle/distributed/fleet/launch_utils.py

python/paddle/distributed/spawn.py

python/paddle/distributed/fleet/launch.py

XieYunshen

LGTM for set_tests_properties(test_parallel_dygraph_unused_variables_gloo PROPERTIES TIMEOUT 120) set_tests_properties(test_parallel_dygraph_sparse_embedding_gloo PROPERTIES TIMEOUT 120) set_tests_properties(test_parallel_dygraph_sparse_embedding_over_height_gloo PROPERTIES TIMEOUT 120)

ForFishes

LGTM

sandyhouse

LGTM

Xreki

LGTM for const_cast

XiaoguangHu01

LGTM

gongweibao · 2021-10-20T09:05:15Z

paddle/fluid/framework/fleet/gloo_wrapper.h

+    opts.setOutput(output_ptr, element_num * size_);
+    gloo::allgather(opts);
+#else
+    LOG(WARNING) << "AllGather does nothing when WITH_GLOO=OFF";


这个地方是 Throw Exception?

这个保持了gloo_wrapper的其他接口的处理范式，参见line 221行

…to speed up training (#35745) (#36605) * User specified backend (#35745) * remove tensordot

wduo · 2021-11-03T15:06:02Z

@2742195759 您好！源码编译最新的develop分支后，用python -m paddle.distributed.launch --nproc_per_node=4 --backend=gloo train.py训练，运行到dist.init_parallel_env()这句时报错如下：
AttributeError: module 'paddle.fluid.core_avx' has no attribute 'GLOOParallelContext'
请问您这个是因为什么呢？是需要把develop分支代码回退到您提交时的版本吗？

wduo · 2021-11-03T15:25:23Z

@2742195759 您好！源码编译最新的develop分支后，用python -m paddle.distributed.launch --nproc_per_node=4 --backend=gloo train.py训练，运行到dist.init_parallel_env()这句时报错如下： AttributeError: module 'paddle.fluid.core_avx' has no attribute 'GLOOParallelContext' 请问您这个是因为什么呢？是需要把develop分支代码回退到您提交时的版本吗？

我用docker环境编译的CPU版的：

cmake .. -DPY_VERSION=3.7 -DWITH_GPU=OFF -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -DWITH_MKL=ON

2742195759 · 2021-11-03T16:27:38Z

@2742195759 您好！源码编译最新的develop分支后，用python -m paddle.distributed.launch --nproc_per_node=4 --backend=gloo train.py训练，运行到dist.init_parallel_env()这句时报错如下： AttributeError: module 'paddle.fluid.core_avx' has no attribute 'GLOOParallelContext' 请问您这个是因为什么呢？是需要把develop分支代码回退到您提交时的版本吗？

我用docker环境编译的CPU版的：
cmake .. -DPY_VERSION=3.7 -DWITH_GPU=OFF -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -DWITH_MKL=ON

你得添加 –DWITH_GLOO=ON 才能开启cpu并行。

wduo · 2021-11-04T02:43:07Z

@2742195759 您好！源码编译最新的develop分支后，用python -m paddle.distributed.launch --nproc_per_node=4 --backend=gloo train.py训练，运行到dist.init_parallel_env()这句时报错如下： AttributeError: module 'paddle.fluid.core_avx' has no attribute 'GLOOParallelContext' 请问您这个是因为什么呢？是需要把develop分支代码回退到您提交时的版本吗？

我用docker环境编译的CPU版的：
cmake .. -DPY_VERSION=3.7 -DWITH_GPU=OFF -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -DWITH_MKL=ON
你得添加 –DWITH_GLOO=ON 才能开启cpu并行。

您好，已经添加 –DWITH_GLOO=ON并且不再报上述错误，但是报了一个新的错误如下：

AttributeError: module 'paddle.fluid.core_avx.ops' has no attribute 'c_broadcast'

请问您这个是因为什么呢？是否是某些第三方库下载不完整的原因呢？

wduo · 2021-11-04T02:51:44Z

@2742195759 完成的报错信息如下：

Traceback (most recent call last):
  File "tools/train.py", line 125, in <module>
    main(config, device, logger, vdl_writer)
  File "tools/train.py", line 78, in main
    model = paddle.DataParallel(model)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/parallel.py", line 582, in __init__
    sync_params_buffers(self._layers)
  File "/usr/local/lib/python3.7/dist-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/base.py", line 276, in __impl__
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/wrapped_decorator.py", line 25, in __impl__
    return wrapped_func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/framework.py", line 230, in __impl__
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/parallel.py", line 377, in sync_params_buffers
    coalesced_var, src=src_rank, group=comm_group, use_calc_stream=True)
  File "/usr/local/lib/python3.7/dist-packages/paddle/distributed/collective.py", line 394, in broadcast
    return _C_ops.c_broadcast(tensor, tensor, 'root', gsrc,
AttributeError: module 'paddle.fluid.core_avx.ops' has no attribute 'c_broadcast'

2742195759 · 2021-11-04T03:03:32Z

@2742195759 完成的报错信息如下：

Traceback (most recent call last):
  File "tools/train.py", line 125, in <module>
    main(config, device, logger, vdl_writer)
  File "tools/train.py", line 78, in main
    model = paddle.DataParallel(model)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/parallel.py", line 582, in __init__
    sync_params_buffers(self._layers)
  File "/usr/local/lib/python3.7/dist-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/base.py", line 276, in __impl__
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/wrapped_decorator.py", line 25, in __impl__
    return wrapped_func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/framework.py", line 230, in __impl__
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/parallel.py", line 377, in sync_params_buffers
    coalesced_var, src=src_rank, group=comm_group, use_calc_stream=True)
  File "/usr/local/lib/python3.7/dist-packages/paddle/distributed/collective.py", line 394, in broadcast
    return _C_ops.c_broadcast(tensor, tensor, 'root', gsrc,
AttributeError: module 'paddle.fluid.core_avx.ops' has no attribute 'c_broadcast'

看样子还是有某些cmake的选项没有开启，导致没有编译到这个c broadcast 属性。我晚上帮你看下。你可以先试试 –DWITH_DISTRIBUTE=ON 这个属性是否有用。

wduo · 2021-11-04T03:45:28Z

我刚才运行cmake .. -DPY_VERSION=3.7 -DWITH_GPU=OFF -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -DWITH_MKL=ON -DWITH_GLOO=ON -DWITH_DISTRIBUTE=ON ，在最后几行看到了一个MESSAGE信息：

...
ccache version                      3.7
cache directory                     /root/.ccache
primary config                      /root/.ccache/ccache.conf
secondary config      (readonly)    /usr/local/ccache-3.7.9/etc/ccache.conf
stats updated                       Thu Nov  4 02:24:53 2021
cache hit (direct)                  1061
cache hit (preprocessed)             937
cache miss                           962
cache hit rate                     67.50 %
called for link                     4990
cleanups performed                     0
files in cache                      2801
cache size                         560.3 MB
max cache size                       5.0 GB

-- Paddle version is 0.0.0
-- Enable Intel OpenMP with /paddle/build/third_party/install/mklml/lib/libiomp5.so
CMake Warning at CMakeLists.txt:403 (message):
  On inference mode, will take place some specific optimization.  Turn on the
  ON_INFER flag when building inference_lib only.


-- generating brpc sendrecv.proto
-- commit: e512aa9a4b
-- branch: develop
WITH_DLNNE:
-- MESSAGE: This is just a message for publishing release.
      You are building AVX version without NOAVX core.
      So the wheel package may fail on NOAVX machine.
      You can add -DNOAVX_CORE_FILE=/path/to/your/core_noavx.* in cmake command
      to get a full wheel package to resolve this warning.
      While, this version will still work on local machine.
-- Configuring done
-- Generating done
-- Build files have been written to: /paddle/build

您看是否是因为这个的原因呢？

wduo · 2021-11-04T03:48:38Z

我这会儿正在尝试再添加一个 -WITH_AVX=OFF 试下；
或者是否是需要使用-DNOAVX_CORE_FILE=/path/to/your/core_noavx.*呢？请问您这个文件core_noavx.*应该在哪里找到呢？

wduo · 2021-11-04T11:29:17Z

@2742195759 您好！上述问题已解决，均为第三方库下载时的网络问题，使相应包安装不完整。感谢您的答复！

2742195759 · 2021-11-04T15:29:50Z

@2742195759 您好！上述问题已解决，均为第三方库下载时的网络问题，使相应包安装不完整。感谢您的答复！

好的~

2742195759 and others added 12 commits September 10, 2021 10:54

add gloo parallel training support in dist.spawn() and dist.launch

50464c2

add 2 testcase: test_cpuonly_spawn and test_cpuonly_launch

25a05bb

change namespace to hasattr judge

103688b

fix bug

dd04b7a

fix for code review

b831674

fix bug

970c8fa

add env

c15cce0

if not gloo and not cpuonly, will not execute the cpuonly mode

0400eca

fix bug

9036ae1

functional commit, finish spawn and launch for customized backend

8b1e643

change cmakefile list

b7caaf3

remove testcase

8af32d6

2742195759 and others added 4 commits September 15, 2021 06:10

Merge branch 'spawn' into user_specified_backend

2b2a01c

fix block windows and mac function

947fd53

change cmakelists.txt

2bca5a7

move backend deduction logic to _get_env_list

ecc4e31

Aurelius84 previously approved these changes Sep 15, 2021

View reviewed changes

fix bug

23e9512

2742195759 dismissed Aurelius84’s stale review via 23e9512 September 15, 2021 12:43

ZHUI reviewed Sep 15, 2021

View reviewed changes

python/paddle/distributed/fleet/launch.py Show resolved Hide resolved

XieYunshen previously approved these changes Sep 16, 2021

View reviewed changes

Aurelius84 previously approved these changes Sep 16, 2021

View reviewed changes

TCChenlong previously approved these changes Sep 16, 2021

View reviewed changes

wangxicoding requested review from kuizhiqing and ForFishes September 16, 2021 04:05

kuizhiqing reviewed Sep 16, 2021

View reviewed changes

python/paddle/distributed/fleet/launch.py Show resolved Hide resolved

Baibaifan reviewed Sep 16, 2021

View reviewed changes

python/paddle/distributed/fleet/launch.py Show resolved Hide resolved

python/paddle/distributed/fleet/launch_utils.py Show resolved Hide resolved

python/paddle/distributed/spawn.py Show resolved Hide resolved

sandyhouse reviewed Sep 16, 2021

View reviewed changes

python/paddle/distributed/fleet/launch.py Show resolved Hide resolved

lanxianghit previously approved these changes Sep 16, 2021

View reviewed changes

Aurelius84 requested review from lanxianghit and TCChenlong October 19, 2021 06:16

Aurelius84 approved these changes Oct 19, 2021

View reviewed changes

XieYunshen approved these changes Oct 20, 2021

View reviewed changes

ForFishes approved these changes Oct 20, 2021

View reviewed changes

sandyhouse approved these changes Oct 20, 2021

View reviewed changes

lanxianghit approved these changes Oct 21, 2021

View reviewed changes

Aurelius84 requested review from XiaoguangHu01 and raindrops2sea October 21, 2021 02:35

TCChenlong approved these changes Oct 21, 2021

View reviewed changes

Aurelius84 requested a review from Xreki October 21, 2021 02:51

Xreki approved these changes Oct 21, 2021

View reviewed changes

raindrops2sea approved these changes Oct 21, 2021

View reviewed changes

XiaoguangHu01 approved these changes Oct 21, 2021

View reviewed changes

gongweibao approved these changes Oct 21, 2021

View reviewed changes

gongweibao merged commit b6e7f8e into PaddlePaddle:develop Oct 21, 2021

2742195759 added a commit to 2742195759/Paddle that referenced this pull request Oct 21, 2021

User specified backend (PaddlePaddle#35745)

d20a74b

Aurelius84 changed the title ~~User specified backend~~ Support CPU Parallel in DataParallel Interface by GLOO to speed up training Oct 26, 2021

XiaoguangHu01 pushed a commit that referenced this pull request Oct 26, 2021

[cherry-pick] Support CPU Parallel in DataParallel Interface by GLOO …

beb920c

…to speed up training (#35745) (#36605) * User specified backend (#35745) * remove tensordot

tianxin1860 mentioned this pull request Nov 2, 2021

使用CPU进行模型训练，如何利用多处理器进行加速 #36718

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support CPU Parallel in DataParallel Interface by GLOO to speed up training #35745

Support CPU Parallel in DataParallel Interface by GLOO to speed up training #35745

2742195759 commented Sep 15, 2021 •

edited

Loading

paddle-bot-old bot commented Sep 15, 2021

Aurelius84 left a comment

XieYunshen left a comment

Aurelius84 left a comment

XieYunshen left a comment

ForFishes left a comment

sandyhouse left a comment

Xreki left a comment

XiaoguangHu01 left a comment

gongweibao Oct 20, 2021

Aurelius84 Oct 21, 2021

wduo commented Nov 3, 2021

wduo commented Nov 3, 2021

2742195759 commented Nov 3, 2021

wduo commented Nov 4, 2021

wduo commented Nov 4, 2021

2742195759 commented Nov 4, 2021 •

edited

Loading

wduo commented Nov 4, 2021

wduo commented Nov 4, 2021

wduo commented Nov 4, 2021

2742195759 commented Nov 4, 2021

Support CPU Parallel in DataParallel Interface by GLOO to speed up training #35745

Support CPU Parallel in DataParallel Interface by GLOO to speed up training #35745

Conversation

2742195759 commented Sep 15, 2021 • edited Loading

PR types

PR changes

Describe

背景

场景举例

影响面

异常处理

1、对Mac和Win平台下使用Gloo的报错

2、用户指定的backend和paddle版本不匹配的报错

3、gloo模式下对参数的检测

4、NPU模式目前不支持parallel

paddle-bot-old bot commented Sep 15, 2021

Aurelius84 left a comment

Choose a reason for hiding this comment

XieYunshen left a comment

Choose a reason for hiding this comment

Aurelius84 left a comment

Choose a reason for hiding this comment

XieYunshen left a comment

Choose a reason for hiding this comment

ForFishes left a comment

Choose a reason for hiding this comment

sandyhouse left a comment

Choose a reason for hiding this comment

Xreki left a comment

Choose a reason for hiding this comment

XiaoguangHu01 left a comment

Choose a reason for hiding this comment

gongweibao Oct 20, 2021

Choose a reason for hiding this comment

Aurelius84 Oct 21, 2021

Choose a reason for hiding this comment

wduo commented Nov 3, 2021

wduo commented Nov 3, 2021

2742195759 commented Nov 3, 2021

wduo commented Nov 4, 2021

wduo commented Nov 4, 2021

2742195759 commented Nov 4, 2021 • edited Loading

wduo commented Nov 4, 2021

wduo commented Nov 4, 2021

wduo commented Nov 4, 2021

2742195759 commented Nov 4, 2021

2742195759 commented Sep 15, 2021 •

edited

Loading

2742195759 commented Nov 4, 2021 •

edited

Loading