-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cost going to NaN with Paddle v0.10.0 for MT example #2563
Comments
|
Thanks for reporting the problem for us, I will tune the parameters carefully and fix the NMT demo ASAP. Actually, I have met the same problem also. |
Adding the error clipping at https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/train.py#L51 avoided the explosion:
Do you know what's the previous setting for the error/gradient clipping in v0.8 and v0.9? Any idea why didn't the gradient/error explode in the previous version? |
Previously, most times we directly set a global clipping threshold, and it works fine, but currently, such a global parameter (set in I think this bug is terrible. It also means globally set regularizer and other parameters are all invalid. Sorry, this must be fixed, I am working on it, |
Ah now, I understand. Thank you @lcy-seso! |
Related issue but not on the NaN, I just realized that the current
|
You are right.
|
@lcy-seso Thanks in advance for fixing it!! |
It is a terrible bug that must be fixed. Sorry for it. |
No worries, it's open source so it should always "kaizen" (改善) =) |
Hi, @alvations, about the bug that some globally set parameters cannot work, there is a way to avoid it, but we will fix this.
optimizer = paddle.optimizer.RMSProp(
learning_rate=1e-3,
gradient_clipping_threshold=10.0,
regularization=paddle.optimizer.L2Regularization(rate=8e-4))
cost = seq2seq_net(source_dict_dim, target_dict_dim)
parameters = paddle.parameters.create(cost)
|
Installing from source off the
develop
branch, thepaddle
command seems to be working fine:And I cloned the book repo and ran the
train.py
from the machine translation example. But the CPU training stopped with aFloating point exception
:Changing to use GPU, the cost goes to NaN:
Similarly with 1 GPU trainer:
I've tried changing the values for:
But somehow the cost goes to NaN and I can't seem to go through 1 epoch without the cost going to NaN.
Possibly this is a related issue: #1738
The text was updated successfully, but these errors were encountered: