-
Notifications
You must be signed in to change notification settings - Fork 762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve gradient_clip doc #4959
Conversation
感谢你贡献飞桨文档,文档预览构建中,Docs-New 跑完后即可预览,预览链接:http://preview-pr-4959.paddle-docs-preview.paddlepaddle.org.cn/documentation/docs/zh/api/index_cn.html |
✅ This PR's description meets the template requirements! |
for t in range(100): | ||
idx = np.random.choice(total_data, batch_size, replace=False) | ||
x = paddle.to_tensor(x_data[idx,:]) | ||
y = paddle.to_tensor(y_data[idx,:]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
y叫label,y_pred叫pred吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
train() | ||
|
||
未开启梯度裁剪时的部分日志如下,可以看到在loss和梯度都在逐渐增大,在第4步就已经达到正无穷大,变为nan。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里可以再说详细点,在这个网络设计下,是哪些参数的梯度计算出来很大,以及为什么越来越大的原因
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
loss = \frac{1}{n} \sum_{i=1}^n(y_i-y_i')^2 | ||
|
||
- 在得到损失后,进行一次反向传播,调整权重和偏差。为了更新网络参数,首先要计算损失函数对于参数的梯度,然后使用某种梯度更新算法, | ||
执行一步梯度下降,以减小损失函数值。如下式,其中alpha为学习率。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里格式不对
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
一、设定范围值裁剪 | ||
-------------------- | ||
.. math:: | ||
O^k = f(W O^{k-1} + b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
O要解释一下
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
在计算出网络的估计值后,使用类似均方误差的方法,计算由目标值与估计值的差距定义的损失函数。 | ||
|
||
.. math:: | ||
loss = \frac{1}{n} \sum_{i=1}^n(y_i-y_i')^2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yi, yi' 也要解释一下
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
.. math:: | ||
loss = \frac{1}{n} \sum_{i=1}^n(y_i-y_i')^2 | ||
|
||
- 在得到损失后,进行一次反向传播,调整权重和偏差。为了更新网络参数,首先要计算损失函数对于参数的梯度,然后使用某种梯度更新算法, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
梯度的概念最好解释一下?以及对应公式中的哪个部分
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已加上对应的部分。梯度最为深度学习最基础的概念,我感觉是需要用户自己掌握的~
- 模型参数和损失值变为NaN | ||
|
||
如果发生了 "梯度爆炸",在网络学习过程中会直接跳过最优解,所以有必要进行梯度裁剪,防止网络在学习过程中越过最优解。Paddle提供了三种梯度裁剪方式:设置范围值裁剪、通过L2范数裁剪、通过全局L2范数裁剪。设置范围值 | ||
裁剪方法简单,但是很难确定一个合适的阈值。通过L2范数裁剪和通过全局L2范数裁剪方法,都是用阈值限制梯度向量的L2范数,前者只对特定梯度进行裁剪,后者会对优化器的所有梯度进行裁剪。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里预览多个空格
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
设定范围值裁剪:将参数的梯度限定在一个范围内,如果超出这个范围,则进行裁剪。 | ||
|
||
使用方式:需要创建一个 :ref:`paddle.nn.ClipGradByValue <cn_api_fluid_clip_ClipGradByValue>` 类的实例,然后传入到优化器中,优化器会在更新参数前,对梯度进行裁剪。 | ||
|
||
**1. 全部参数裁剪(默认)** | ||
- **全部参数裁剪(默认)** | ||
|
||
默认情况下,会裁剪优化器中全部参数的梯度: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
说明一下裁剪的范围?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
二、通过L2范数裁剪 | ||
-------------------- | ||
2. 通过L2范数裁剪 | ||
################### | ||
|
||
通过L2范数裁剪:梯度作为一个多维Tensor,计算其L2范数,如果超过最大值则按比例进行裁剪,否则不裁剪。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
解释一下 X & clip_norm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -114,7 +146,7 @@ Paddle提供了三种梯度裁剪方式: | |||
|
|||
其中 :math:`norm(X)` 代表 :math:`X` 的L2范数 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
解释一下 X、global_norm、clip_norm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
部分参数裁剪的设置方式与上面一致,也是通过设置参数的 :ref:`paddle.ParamAttr <cn_api_fluid_ParamAttr>` ,其中的 ``need_clip`` 默认为True,表示需要裁剪,如果设置为False,则不会裁剪。可参考上面的示例代码进行设置。 | ||
|
||
由上面的介绍可以知道,设置范围值裁剪可能会改变梯度向量的方向。例如,阈值为1.0,原梯度向量为[0.8, 89.0],裁剪后的梯度向量变为[0,8, 1.0],方向发生了很大的改变。而对于通过L2范数裁剪的两种方式,阈值为1.0,则裁剪后的梯度向量 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
阈值为1.0,则裁剪后的梯度向量为[] ?
为空么?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已补充数据
由上面的介绍可以知道,设置范围值裁剪可能会改变梯度向量的方向。例如,阈值为1.0,原梯度向量为[0.8, 89.0],裁剪后的梯度向量变为[0,8, 1.0],方向发生了很大的改变。而对于通过L2范数裁剪的两种方式,阈值为1.0,则裁剪后的梯度向量 | ||
为[]。能够保证原梯度向量的方向,但是由于分量2的值较大,导致分量1的值变得接近0。在实际的训练过程中,如果遇到梯度爆炸情况,可以试着用不同的裁剪方式对比在验证集上的效果。 | ||
|
||
三、 实例 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这一节格式有问题,请处理一下。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
二、Paddle梯度裁剪使用方法 | ||
--------------------------- | ||
|
||
1. 设定范围值裁剪 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
改成2.1 设定范围值裁剪
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -38,8 +69,8 @@ Paddle提供了三种梯度裁剪方式: | |||
|
|||
linear = paddle.nn.Linear(10, 10,bias_attr=paddle.ParamAttr(need_clip=False)) | |||
|
|||
二、通过L2范数裁剪 | |||
-------------------- | |||
2. 通过L2范数裁剪 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
改成2.2 通过L2范数裁剪
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
三、通过全局L2范数裁剪 | ||
-------------------- | ||
3. 通过全局L2范数裁剪 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
改成2.3 通过全局L2范数裁剪
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
improve gradient_clip doc