Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support CPU Parallel in DataParallel Interface by GLOO to speed up training #35745

Merged
merged 23 commits into from
Oct 21, 2021

Conversation

2742195759
Copy link
Contributor

@2742195759 2742195759 commented Sep 15, 2021

PR types

New features

PR changes

APIs

Describe

背景

这个PR实现了在用户调用 spawn 或者是 launch 的时候可以添加自定义的backend参数。backend参数表示了用户希望使用CPU、GPU、XPU还是PS并行。可用的backend选择为 "gloo", "nccl", "bkcl", "auto" .

场景举例

  1. 支持用户想要在CUDA版本下使用CPU的并行。 spawn(main, backend='gloo') 即可。 之前的版本会报错。
  2. 添加了CPU版本Paddle下的兜底策略:支持用户想要在CPU版本的Paddle时使用并行训练。这种情况下会自动推导出 backend='gloo' 并在 不改变原始调用的情况下 默认使用CPU进行并行。

影响面

  1. 本次修改为兼容升级,当用户在GPU版本的paddle下,使用原始的API接口,会得到预期的效果。
  2. 修改了CPU版本下的并行策略,之前是报错,现在是启动Gloo并行
  3. 修改了launch()函数下,PS 和 Collective 的语义。之前是通过参数进行推导,现在改为了如下所示
    gloo / nccl / bkcl -> Collective
    auto -> 按照之前的逻辑进行推导。

异常处理

1、对Mac和Win平台下使用Gloo的报错

目前的Paddle只有linux支持GLOO,因此当用户想要迁移代码的时候,如果在MAC或者是Windows下并试图使用gloo作为backend的时候会Raise ValueError。

2、用户指定的backend和paddle版本不匹配的报错

例如用户在 CPU 的paddle下使用了 backend='nccl' 那么会报错

3、gloo模式下对参数的检测

例如用户在backend=gloo的时候添加了很多其他的cpu不支持的参数,在这些情况下进行了捕获和报错。

4、NPU模式目前不支持parallel

目前NPU不支持 parallel训练,但是在parallel接口中存在了NPU的一些逻辑,所以这些逻辑尽量保留,这时候,backend如果为 auto的情况下,推导为 unknown,为了方便后续报错。

####使用样例

# 代码样例
def train():
    # 1. initialize parallel environment
    dist.init_parallel_env()
    
    # 2. create data parallel layer & optimizer
    layer = LinearNet()
    dp_layer = paddle.DataParallel(layer)
    
    # .....(省略)
    adam.clear_grad()

if __name__ == '__main__':
    # 1. start by ``paddle.distributed.launch``
    train()
	
    # 2. start by ``paddle.distributed.spawn`` (default)
    dist.spawn(train, nproc=4, backend='gloo')

使用launch启动上述代码:

# 强行指定CPU并行,无论paddle-cpu / paddle-gpu 都会CPU并行。
python -m paddle.distributed.launch --nproc_per_node=4 --backend='gloo'  

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Aurelius84
Aurelius84 previously approved these changes Sep 15, 2021
Copy link
Contributor

@Aurelius84 Aurelius84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

XieYunshen
XieYunshen previously approved these changes Sep 16, 2021
Copy link
Contributor

@XieYunshen XieYunshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for
bash_test_modules(test_cpuonly_launch START_BASH test_cpuonly_launch.sh SERIAL LABELS "RUN_TYPE=EXCLUSIVE" ENVS "PADDLE_DIST_UT_PORT=${dist_ut_port}" PADDLE_BINARY_DIR=${PADDLE_BINARY_DIR}

Aurelius84
Aurelius84 previously approved these changes Sep 16, 2021
Copy link
Contributor

@Aurelius84 Aurelius84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

TCChenlong
TCChenlong previously approved these changes Sep 16, 2021
lanxianghit
lanxianghit previously approved these changes Sep 16, 2021
Copy link
Contributor

@XieYunshen XieYunshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for set_tests_properties(test_parallel_dygraph_unused_variables_gloo PROPERTIES TIMEOUT 120) set_tests_properties(test_parallel_dygraph_sparse_embedding_gloo PROPERTIES TIMEOUT 120) set_tests_properties(test_parallel_dygraph_sparse_embedding_over_height_gloo PROPERTIES TIMEOUT 120)

Copy link
Member

@ForFishes ForFishes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

@sandyhouse sandyhouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Aurelius84 Aurelius84 requested a review from Xreki October 21, 2021 02:51
Copy link
Contributor

@Xreki Xreki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for const_cast

Copy link
Contributor

@XiaoguangHu01 XiaoguangHu01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

opts.setOutput(output_ptr, element_num * size_);
gloo::allgather(opts);
#else
LOG(WARNING) << "AllGather does nothing when WITH_GLOO=OFF";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个地方是 Throw Exception?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个保持了gloo_wrapper的其他接口的处理范式,参见line 221行

@gongweibao gongweibao merged commit b6e7f8e into PaddlePaddle:develop Oct 21, 2021
2742195759 added a commit to 2742195759/Paddle that referenced this pull request Oct 21, 2021
@Aurelius84 Aurelius84 changed the title User specified backend Support CPU Parallel in DataParallel Interface by GLOO to speed up training Oct 26, 2021
XiaoguangHu01 pushed a commit that referenced this pull request Oct 26, 2021
…to speed up training (#35745) (#36605)

* User specified backend (#35745)

* remove tensordot
@wduo
Copy link

wduo commented Nov 3, 2021

@2742195759 您好!源码编译最新的develop分支后,用python -m paddle.distributed.launch --nproc_per_node=4 --backend=gloo train.py训练,运行到dist.init_parallel_env()这句时报错如下:
AttributeError: module 'paddle.fluid.core_avx' has no attribute 'GLOOParallelContext'
请问您这个是因为什么呢?是需要把develop分支代码回退到您提交时的版本吗?

@wduo
Copy link

wduo commented Nov 3, 2021

@2742195759 您好!源码编译最新的develop分支后,用python -m paddle.distributed.launch --nproc_per_node=4 --backend=gloo train.py训练,运行到dist.init_parallel_env()这句时报错如下: AttributeError: module 'paddle.fluid.core_avx' has no attribute 'GLOOParallelContext' 请问您这个是因为什么呢?是需要把develop分支代码回退到您提交时的版本吗?

我用docker环境编译的CPU版的:

cmake .. -DPY_VERSION=3.7 -DWITH_GPU=OFF -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -DWITH_MKL=ON

@2742195759
Copy link
Contributor Author

@2742195759 您好!源码编译最新的develop分支后,用python -m paddle.distributed.launch --nproc_per_node=4 --backend=gloo train.py训练,运行到dist.init_parallel_env()这句时报错如下: AttributeError: module 'paddle.fluid.core_avx' has no attribute 'GLOOParallelContext' 请问您这个是因为什么呢?是需要把develop分支代码回退到您提交时的版本吗?

我用docker环境编译的CPU版的:

cmake .. -DPY_VERSION=3.7 -DWITH_GPU=OFF -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -DWITH_MKL=ON

你得添加 –DWITH_GLOO=ON 才能开启cpu并行。

@wduo
Copy link

wduo commented Nov 4, 2021

@2742195759 您好!源码编译最新的develop分支后,用python -m paddle.distributed.launch --nproc_per_node=4 --backend=gloo train.py训练,运行到dist.init_parallel_env()这句时报错如下: AttributeError: module 'paddle.fluid.core_avx' has no attribute 'GLOOParallelContext' 请问您这个是因为什么呢?是需要把develop分支代码回退到您提交时的版本吗?

我用docker环境编译的CPU版的:

cmake .. -DPY_VERSION=3.7 -DWITH_GPU=OFF -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -DWITH_MKL=ON

你得添加 –DWITH_GLOO=ON 才能开启cpu并行。

您好,已经添加 –DWITH_GLOO=ON并且不再报上述错误,但是报了一个新的错误如下:

AttributeError: module 'paddle.fluid.core_avx.ops' has no attribute 'c_broadcast'

请问您这个是因为什么呢?是否是某些第三方库下载不完整的原因呢?

@wduo
Copy link

wduo commented Nov 4, 2021

@2742195759 完成的报错信息如下:

Traceback (most recent call last):
  File "tools/train.py", line 125, in <module>
    main(config, device, logger, vdl_writer)
  File "tools/train.py", line 78, in main
    model = paddle.DataParallel(model)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/parallel.py", line 582, in __init__
    sync_params_buffers(self._layers)
  File "/usr/local/lib/python3.7/dist-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/base.py", line 276, in __impl__
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/wrapped_decorator.py", line 25, in __impl__
    return wrapped_func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/framework.py", line 230, in __impl__
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/parallel.py", line 377, in sync_params_buffers
    coalesced_var, src=src_rank, group=comm_group, use_calc_stream=True)
  File "/usr/local/lib/python3.7/dist-packages/paddle/distributed/collective.py", line 394, in broadcast
    return _C_ops.c_broadcast(tensor, tensor, 'root', gsrc,
AttributeError: module 'paddle.fluid.core_avx.ops' has no attribute 'c_broadcast'

@2742195759
Copy link
Contributor Author

2742195759 commented Nov 4, 2021

@2742195759 完成的报错信息如下:

Traceback (most recent call last):
  File "tools/train.py", line 125, in <module>
    main(config, device, logger, vdl_writer)
  File "tools/train.py", line 78, in main
    model = paddle.DataParallel(model)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/parallel.py", line 582, in __init__
    sync_params_buffers(self._layers)
  File "/usr/local/lib/python3.7/dist-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/base.py", line 276, in __impl__
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/wrapped_decorator.py", line 25, in __impl__
    return wrapped_func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/framework.py", line 230, in __impl__
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/parallel.py", line 377, in sync_params_buffers
    coalesced_var, src=src_rank, group=comm_group, use_calc_stream=True)
  File "/usr/local/lib/python3.7/dist-packages/paddle/distributed/collective.py", line 394, in broadcast
    return _C_ops.c_broadcast(tensor, tensor, 'root', gsrc,
AttributeError: module 'paddle.fluid.core_avx.ops' has no attribute 'c_broadcast'

看样子还是有某些cmake的选项没有开启,导致没有编译到这个c broadcast 属性。我晚上帮你看下。你可以先试试 –DWITH_DISTRIBUTE=ON 这个属性是否有用。

@wduo
Copy link

wduo commented Nov 4, 2021

我刚才运行cmake .. -DPY_VERSION=3.7 -DWITH_GPU=OFF -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -DWITH_MKL=ON -DWITH_GLOO=ON -DWITH_DISTRIBUTE=ON ,在最后几行看到了一个MESSAGE信息:

...
ccache version                      3.7
cache directory                     /root/.ccache
primary config                      /root/.ccache/ccache.conf
secondary config      (readonly)    /usr/local/ccache-3.7.9/etc/ccache.conf
stats updated                       Thu Nov  4 02:24:53 2021
cache hit (direct)                  1061
cache hit (preprocessed)             937
cache miss                           962
cache hit rate                     67.50 %
called for link                     4990
cleanups performed                     0
files in cache                      2801
cache size                         560.3 MB
max cache size                       5.0 GB

-- Paddle version is 0.0.0
-- Enable Intel OpenMP with /paddle/build/third_party/install/mklml/lib/libiomp5.so
CMake Warning at CMakeLists.txt:403 (message):
  On inference mode, will take place some specific optimization.  Turn on the
  ON_INFER flag when building inference_lib only.


-- generating brpc sendrecv.proto
-- commit: e512aa9a4b
-- branch: develop
WITH_DLNNE:
-- MESSAGE: This is just a message for publishing release.
      You are building AVX version without NOAVX core.
      So the wheel package may fail on NOAVX machine.
      You can add -DNOAVX_CORE_FILE=/path/to/your/core_noavx.* in cmake command
      to get a full wheel package to resolve this warning.
      While, this version will still work on local machine.
-- Configuring done
-- Generating done
-- Build files have been written to: /paddle/build

您看是否是因为这个的原因呢?

@wduo
Copy link

wduo commented Nov 4, 2021

我这会儿正在尝试再添加一个 -WITH_AVX=OFF 试下;
或者是否是需要使用-DNOAVX_CORE_FILE=/path/to/your/core_noavx.*呢?请问您这个文件core_noavx.*应该在哪里找到呢?

@wduo
Copy link

wduo commented Nov 4, 2021

@2742195759 您好!上述问题已解决,均为第三方库下载时的网络问题,使相应包安装不完整。感谢您的答复!

@2742195759
Copy link
Contributor Author

@2742195759 您好!上述问题已解决,均为第三方库下载时的网络问题,使相应包安装不完整。感谢您的答复!

好的~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.