Make lars cpp code flexible #36450

JamesLim-sy · 2021-10-14T13:25:36Z

PR types

Function optimization

PR changes

OPs

Describe

特征：

结合PR36428 中针对 L2 norm计算的bug调试，对CUDA 低版本的代码也做了同步修改；
参考PR36380 中采用ParamMerge 策略，对LarsParam 做出了同步修改
考虑到master_param 和 master_param_out, velocity_param 和 velocity_param_out, param 和 param_out，这三组tensor都是互相 inplace类型，因此仅仅择取其中的master_param_out，velocity_param_out和 param_out，作为计算tensor
C++代码内部对合并后的LarsMomentum Op进行自动的打包&分组计算
由于#36409 已经明确了对于weight_decay == 0 的区分，将Merged_larsMomentum 计算情况下的全部LarsMomentum Op ，调整至共享相同的一个weight_decay
添加Merged_larsMomentum的单元测试

paddle-bot-old · 2021-10-14T13:25:42Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

JamesLim-sy · 2021-10-14T13:35:46Z

paddle/fluid/operators/optimizers/lars_momentum_op.cu

+  T* __restrict__ p_arr[kOpNum];
+  MT* __restrict__ v_arr[kOpNum];
+  MT weight_decay_arr[kOpNum];
+};


这里直接利用数据类型T 判断生成的 LarsParamWarpper 类型，也就是默认了使用fp16类型就必须使用master_param，这种修改不适用于不依赖master_param 的纯fp16计算

JamesLim-sy · 2021-10-14T14:08:08Z

paddle/fluid/operators/optimizers/lars_momentum_op.cu

-  MT grad_norm = Sqrt(rescale_grad_pow *
-                      math::blockReduceSum<MT>(grad_part_norm, FINAL_MASK));
+  MT param_norm = Sqrt(s_buffer[0]);
+  MT grad_norm = Sqrt(rescale_pow * s_buffer[1]);


对于低版本的CUDA，修正L2 Norm 计算的结果获取过程

JamesLim-sy · 2021-10-16T08:57:37Z

paddle/fluid/operators/optimizers/lars_momentum_op.cu

+          lars_warpper.g_arr[i] = grad[start_idx + i]->data<T>();
+          lars_warpper.p_arr[i] = param_out[start_idx + i]->data<T>();
+          lars_warpper.v_arr[i] = velocity_out[start_idx + i]->data<MT>();
+          lars_warpper.lr_arr[i] = learning_rate[i]->data<MT>();


能够共享learning_rate的话，这里可以优化掉，减少从后面CUDA Kernel 中，访问global memory 的次数

… Make_lars_cpp_code_more_flexible

JamesLim-sy · 2021-10-17T19:35:53Z

paddle/fluid/operators/optimizers/lars_momentum_op.cu

    auto weight_decay_arr = ctx.Attr<std::vector<float>>("lars_weight_decay");
+    MT lars_weight_decay = weight_decay_arr[0];


考虑到目前在optimizer.py 文件中已经明确了从lars_momentum 中区别出 weight_decay == 0 的特例，因此，调整Merged LarsMomentum Optimizer 计算分支共享相同的weight_decay 值。

Paddle/python/paddle/fluid/optimizer.py

Lines 2087 to 2095 in 4e036fa

# create the momentum optimize op

momentum_op = block.append_op(

type=self.type if _lars_weight_decay != 0.0 else 'momentum',

inputs=inputs,

outputs=outputs,

attrs=attrs,

stop_gradient=True)

return momentum_op

此处的处理能够避免merged_lars 训练时，其中的每个op 都执行从global memory中取数据的问题.

除了ResNet50这个场景外，不会出现weight_decay非0且不一样的场景吗？

是否走入lars 计算，需要看Op是否在'self._exclude_from_weight_decay'名单中，resnet50 模型里传入的是exclude_from_weight_decay=['bn', 'batch_norm', '.b_0']

JamesLim-sy · 2021-10-18T01:38:29Z

python/paddle/fluid/optimizer.py

@@ -1961,6 +1961,7 @@ def __init__(self,
                 exclude_from_weight_decay=None,
                 epsilon=0,
                 multi_precision=False,
+                 merge_option=False,


遗漏删除，下一个commit删掉

Xreki · 2021-10-18T05:40:11Z

paddle/fluid/operators/optimizers/lars_momentum_op.cu

-    MT* __restrict__ g_n = nullptr) {
-  __shared__ MT s_buffer[2];
+    MT* __restrict__ g_buffer, const int64_t numel, const MT rescale_grad,
+    MT* __restrict__ p_n = nullptr, MT* __restrict__ g_n = nullptr) {
  int tid = threadIdx.x + blockDim.x * blockIdx.x;
  int grid_stride = LARS_BLOCK_SIZE * gridDim.x;


这里感觉使用BlockDim.x，比使用LARS_BLOCK_SIZE安全一些。

根据建议修改

Xreki · 2021-10-18T05:43:26Z

paddle/fluid/operators/optimizers/lars_momentum_op.cu

-    const MT rescale_grad, const int thresh = 0, MT* __restrict__ p_n = nullptr,
-    MT* __restrict__ g_n = nullptr) {
-  __shared__ MT s_buffer[2];
+    MT* __restrict__ g_buffer, const int64_t numel, const MT rescale_grad,


p_buffer、g_buffer命名更直观一些，p_norm_for_blocks？

准备改成buffer_for_grad_norm 和 buffer_for_param_norm

Xreki · 2021-10-18T05:54:24Z

paddle/fluid/operators/optimizers/lars_momentum_op.cu

+template <typename MT, int kOpNum, typename T>
+struct MergedLarsMasterParam {
+  DEVICE inline MT* GetMasterParam(size_t) const { return nullptr; }
+  constexpr void SetMasterParam(size_t, MT*) {}


这个函数不用加DEVICE描述？

SetMasterParam 在host端完成，所以就没加 DEVICE描述了

Xreki · 2021-10-18T05:56:55Z

paddle/fluid/operators/optimizers/lars_momentum_op.cu

-  MT* __restrict__ master_p_out_arr[LARS_MAX_MERGED_OPS];
-  MT weight_decay_arr[LARS_MAX_MERGED_OPS];
+template <typename MT, int kOpNum, typename T>
+struct MergedLarsMasterParam {


这个结构能更通用一些吗？类名叫MasterParamHelper？

根据建议修改

Xreki · 2021-10-18T07:13:17Z

paddle/fluid/operators/optimizers/lars_momentum_op.cu

+  constexpr void SetMasterParam(size_t, MT*) {}
+};
+
+template <typename MT, int kOpNum>


模板中的变量名，不要叫kXxx吧？

嗯，那就之间换成OpNum

Xreki · 2021-10-18T07:22:11Z

paddle/fluid/operators/optimizers/lars_momentum_op.cu

    auto weight_decay_arr = ctx.Attr<std::vector<float>>("lars_weight_decay");
+    MT lars_weight_decay = weight_decay_arr[0];


除了ResNet50这个场景外，不会出现weight_decay非0且不一样的场景吗？

Xreki · 2021-10-18T07:26:38Z

paddle/fluid/operators/optimizers/lars_momentum_op.cu

-                                "Input(MasterParam) and Output(MasterParamOut) "
-                                "must be the same Tensors."));
+      lars_warpper.weight_decay = lars_weight_decay;
+      int merge_times = (op_num + lars_warpper.kNum - 1) / lars_warpper.kNum;


如果一个模型有160个参数，这个模型依然只会有一个optimzier op，只是这个optimizer op会启动2个CUDA Kernel计算，每个CUDA Kernel更新80个参数？

merge_times这个变量名也。。。

还是需要在python层面区分出来 AMP LarsMomentum 和非AMP LarsMomentum，然后分先后将AMP lars 和非AMP Lars 传入计算。如果单次传入的Op数量过多的话，会按照至多80个一组执行计算。

变量名准备改成loop

Xreki · 2021-10-18T07:31:59Z

paddle/fluid/operators/optimizers/lars_momentum_op.cu

+                              reinterpret_cast<void*>(&rescale_grad),
+                              reinterpret_cast<void*>(&multi_precision)};
+        // Lanuch all sm theads,thead of each block synchronizedly cooperate.
+        cudaLaunchCooperativeKernel(


这个接口调用，确实后续可以再封装一下，可以实现在gpu_launch_config.h中，不过这个文件最好命名成gpu_launch_helper.h。

好的，这种写法真的太占地方了

Xreki · 2021-10-18T07:35:19Z

python/paddle/fluid/tests/unittests/test_merged_lars_optimizer.py

@@ -0,0 +1,210 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.


文件名改成test_merged_optimizer.py？后续若有其他optimizer的merge，也可以基于这个单测来写？

可以的，我这个就是抄的锦乐大佬的单测框架写出来的

paddle-bot-old · 2021-10-28T02:36:09Z

Sorry to inform you that c5d06e0's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

paddle-bot · 2022-11-01T06:38:34Z

很抱歉，经过我们的反复讨论，你的PR暂未达到合入标准，请阅读飞桨原生算子开发规范，你可以重新提交新的PR，我们先将此PR关闭，感谢你的贡献。
Sorry to inform you that through our discussion, your PR fails to meet the merging standard (Reference: Paddle Custom Operator Design Doc). You can also submit an new one. Thank you.

first commit

4e1fc95

JamesLim-sy changed the title ~~first commit~~ Make lars cpp code more flexible Oct 14, 2021

JamesLim-sy commented Oct 14, 2021

View reviewed changes

revert change about argmuent of l2_norm

2ffbbba

JamesLim-sy commented Oct 14, 2021

View reviewed changes

get param addr from param_out tensor

3da1b1f

JamesLim-sy force-pushed the Make_lars_cpp_code_more_flexible branch from d32884c to 3da1b1f Compare October 14, 2021 15:06

JamesLim-sy added 2 commits October 14, 2021 17:03

shrink the struct size

fb89aef

add test file

eec1fc6

refine cpp codes

dc103de

JamesLim-sy force-pushed the Make_lars_cpp_code_more_flexible branch from 9ba1d75 to dc103de Compare October 16, 2021 08:43

JamesLim-sy commented Oct 16, 2021

View reviewed changes

fix python codes error

7be6434

JamesLim-sy force-pushed the Make_lars_cpp_code_more_flexible branch from 7bb3c4b to 7be6434 Compare October 17, 2021 14:30

JamesLim-sy added 3 commits October 17, 2021 14:46

change the type form of lars_weight_decay

57b952a

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

c310c8d

… Make_lars_cpp_code_more_flexible

add test file of merged lars

3f62752

JamesLim-sy changed the title ~~Make lars cpp code more flexible~~ Make lars cpp code flexible Oct 17, 2021

JamesLim-sy commented Oct 17, 2021

View reviewed changes

JamesLim-sy requested review from Xreki and sneaxiy October 18, 2021 01:29

JamesLim-sy commented Oct 18, 2021

View reviewed changes

Xreki reviewed Oct 18, 2021

View reviewed changes

JamesLim-sy added 3 commits October 18, 2021 12:37

fix code according to comments.

c6ef005

change the format of lars_weight_decay from scalar into vector

47ba53b

change the format of lars_weight_decay from scalar into vector

82dd12a

fix lars_weight_decay from a scalar into a vector.

c5d06e0

JamesLim-sy requested a review from Xreki October 27, 2021 07:11

paddle-bot-old bot closed this Nov 1, 2022

paddle-bot bot added the status: not progressed label Nov 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make lars cpp code flexible #36450

Make lars cpp code flexible #36450

JamesLim-sy commented Oct 14, 2021 •

edited

Loading

paddle-bot-old bot commented Oct 14, 2021

JamesLim-sy Oct 14, 2021

JamesLim-sy Oct 14, 2021

JamesLim-sy Oct 16, 2021

JamesLim-sy Oct 17, 2021 •

edited

Loading

Xreki Oct 18, 2021

JamesLim-sy Oct 18, 2021 •

edited

Loading

JamesLim-sy Oct 18, 2021

Xreki Oct 18, 2021

JamesLim-sy Oct 18, 2021

Xreki Oct 18, 2021

JamesLim-sy Oct 18, 2021

Xreki Oct 18, 2021

JamesLim-sy Oct 18, 2021

Xreki Oct 18, 2021

JamesLim-sy Oct 18, 2021

Xreki Oct 18, 2021

JamesLim-sy Oct 18, 2021

Xreki Oct 18, 2021

Xreki Oct 18, 2021

JamesLim-sy Oct 18, 2021

Xreki Oct 18, 2021

JamesLim-sy Oct 18, 2021

Xreki Oct 18, 2021

JamesLim-sy Oct 18, 2021

paddle-bot-old bot commented Oct 28, 2021

paddle-bot bot commented Nov 1, 2022

		auto weight_decay_arr = ctx.Attr<std::vector<float>>("lars_weight_decay");
		MT lars_weight_decay = weight_decay_arr[0];

	# create the momentum optimize op
	momentum_op = block.append_op(
	type=self.type if _lars_weight_decay != 0.0 else 'momentum',
	inputs=inputs,
	outputs=outputs,
	attrs=attrs,
	stop_gradient=True)

	return momentum_op

		@@ -0,0 +1,210 @@
		# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.

Make lars cpp code flexible #36450

Make lars cpp code flexible #36450

Conversation

JamesLim-sy commented Oct 14, 2021 • edited Loading

PR types

PR changes

Describe

paddle-bot-old bot commented Oct 14, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JamesLim-sy Oct 17, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JamesLim-sy Oct 18, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paddle-bot-old bot commented Oct 28, 2021

paddle-bot bot commented Nov 1, 2022

JamesLim-sy commented Oct 14, 2021 •

edited

Loading

JamesLim-sy Oct 17, 2021 •

edited

Loading

JamesLim-sy Oct 18, 2021 •

edited

Loading