Cost going to NaN with Paddle v0.10.0 for MT example #2563

alvations · 2017-06-22T09:09:13Z

Installing from source off the develop branch, the paddle command seems to be working fine:

$ git log 
commit 7bce40d7be9174bea90e75df684ce8526485b36a
Merge: 603fd43 252ef0c
Author: gangliao <[email protected]>
Date:   Wed Jun 21 10:22:04 2017 +0800

    Merge pull request #2538 from wangkuiyi/generic.cmake-comments
    
    Rewrite tutorial comments in generic.cmake

$ sudo paddle version
PaddlePaddle 0.10.0, compiled with
    with_avx: ON
    with_gpu: ON
    with_double: OFF
    with_python: ON
    with_rdma: OFF
    with_timer: OFF

$ python
Python 2.7.12 (default, Nov 19 2016, 06:48:10) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import paddle.v2 as paddle
>>> paddle.init(use_gpu=True, trainer_count=4)
I0622 16:51:44.955044 28154 Util.cpp:166] commandline:  --use_gpu=True --trainer_count=4 
>>> exit()

And I cloned the book repo and ran the train.py from the machine translation example. But the CPU training stopped with a Floating point exception:

$ git clone https://github.com/PaddlePaddle/book.git
$ cd book/08.machine_translation/

book/08.machine_translation/$ python train.py 
I0622 16:54:11.143401 28309 Util.cpp:166] commandline:  --use_gpu=False --trainer_count=1 
I0622 16:54:11.374763 28309 GradientMachine.cpp:85] Initing parameters..
I0622 16:54:13.712622 28309 GradientMachine.cpp:92] Init parameters done.

Pass 0, Batch 0, Cost 230.933862, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 10, Cost 230.808911, {'classification_error_evaluator': 0.9642857313156128}
.........
Pass 0, Batch 20, Cost 343.881104, {'classification_error_evaluator': 0.916167676448822}
.........
Pass 0, Batch 30, Cost 244.960254, {'classification_error_evaluator': 0.8907563090324402}
.....*** Aborted at 1498121868 (unix time) try "date -d @1498121868" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGFPE (@0x7f2047213b49) received by PID 28309 (TID 0x7f201a5d3700) from PID 1193360201; stack trace: ***
    @     0x7f2048fc8390 (unknown)
    @     0x7f2047213b49 paddle::AssignCpuEvaluate<>()
    @     0x7f204721a9a7 paddle::AssignEvaluate<>()
    @     0x7f2047211183 paddle::adamApply()
    @     0x7f2047208909 paddle::AdamParameterOptimizer::update()
    @     0x7f20471f2b6e paddle::SgdThreadUpdater::threadUpdateDense()
    @     0x7f20471f3d9f _ZNSt17_Function_handlerIFvimEZN6paddle16SgdThreadUpdater11finishBatchEfEUlimE_E9_M_invokeERKSt9_Any_dataOiOm
    @     0x7f2046ffec1c _ZNSt6thread5_ImplISt12_Bind_simpleIFZN6paddle14SyncThreadPool5startEvEUliE_mEEE6_M_runEv
    @     0x7f2045b39c80 (unknown)
    @     0x7f2048fbe6ba start_thread
    @     0x7f2048cf43dd clone
    @                0x0 (unknown)
Floating point exception (core dumped)

Changing to use GPU, the cost goes to NaN:

book/08.machine_translation$ sed -i "s|\(use_gpu=.*\)|use_gpu=True, trainer_count=4\)|g" train.py
book/08.machine_translation$ python test.py
python: can't open file 'test.py': [Errno 2] No such file or directory
ltan@walle1:~/book/08.machine_translation$ python train.py
I0622 17:04:29.819021 28398 Util.cpp:166] commandline:  --use_gpu=True --trainer_count=4 
I0622 17:04:35.025086 28398 MultiGradientMachine.cpp:99] numLogicalDevices=1 numThreads=4 numDevices=4
I0622 17:04:35.179461 28398 GradientMachine.cpp:85] Initing parameters..
I0622 17:04:37.593305 28398 GradientMachine.cpp:92] Init parameters done.

Pass 0, Batch 0, Cost 232.981567, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 10, Cost 284.369263, {'classification_error_evaluator': 0.9420289993286133}
.........
Pass 0, Batch 20, Cost 265.632788, {'classification_error_evaluator': 0.9224806427955627}
.........
Pass 0, Batch 30, Cost 168.668164, {'classification_error_evaluator': 0.9146341681480408}
.........
Pass 0, Batch 40, Cost 119.270068, {'classification_error_evaluator': 0.8965517282485962}
.........
Pass 0, Batch 50, Cost 224.066553, {'classification_error_evaluator': 0.9174311757087708}
.........
Pass 0, Batch 60, Cost 295.795679, {'classification_error_evaluator': 0.9305555820465088}
.........
Pass 0, Batch 70, Cost 256.279614, {'classification_error_evaluator': 0.9599999785423279}
.........
Pass 0, Batch 80, Cost 206.731763, {'classification_error_evaluator': 0.9504950642585754}
.........
Pass 0, Batch 90, Cost 484.451318, {'classification_error_evaluator': 0.9037656784057617}
.........
Pass 0, Batch 100, Cost 181.277283, {'classification_error_evaluator': 0.966292142868042}
.........
Pass 0, Batch 110, Cost 281.560010, {'classification_error_evaluator': 0.9424460530281067}
.........
Pass 0, Batch 120, Cost 198.955090, {'classification_error_evaluator': 0.9693877696990967}
.........
Pass 0, Batch 130, Cost nan, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 140, Cost nan, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 150, Cost nan, {'classification_error_evaluator': 1.0}

Similarly with 1 GPU trainer:

book/08.machine_translation$ sed -i "s|\(use_gpu=.*\)|use_gpu=True, trainer_count=1\)|g" train.py

ltan@walle1:~/book/08.machine_translation$ python train.py
I0622 17:09:47.405041 28503 Util.cpp:166] commandline:  --use_gpu=True --trainer_count=1 
I0622 17:09:52.146150 28503 GradientMachine.cpp:85] Initing parameters..
I0622 17:09:54.330538 28503 GradientMachine.cpp:92] Init parameters done.

Pass 0, Batch 0, Cost 253.607739, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 10, Cost 245.239307, {'classification_error_evaluator': 0.9495798349380493}
.........
Pass 0, Batch 20, Cost 362.484961, {'classification_error_evaluator': 0.9034090638160706}
.........
Pass 0, Batch 30, Cost 228.537988, {'classification_error_evaluator': 0.9099099040031433}
.........
Pass 0, Batch 40, Cost 277.921631, {'classification_error_evaluator': 0.9333333373069763}
.........
Pass 0, Batch 50, Cost 273.311084, {'classification_error_evaluator': 0.8872180581092834}
.........
Pass 0, Batch 60, Cost 310.044189, {'classification_error_evaluator': 0.9006622433662415}
.........
Pass 0, Batch 70, Cost 262.669629, {'classification_error_evaluator': 0.921875}
.........
Pass 0, Batch 80, Cost 135.404944, {'classification_error_evaluator': 0.9242424368858337}
.........
Pass 0, Batch 90, Cost 272.579102, {'classification_error_evaluator': 0.932330846786499}
.........
Pass 0, Batch 100, Cost 348.291699, {'classification_error_evaluator': 0.929411768913269}
.........
Pass 0, Batch 110, Cost 257.603052, {'classification_error_evaluator': 0.920634925365448}
.........
Pass 0, Batch 120, Cost 212.971094, {'classification_error_evaluator': 0.9903846383094788}
.........
Pass 0, Batch 130, Cost 198.442700, {'classification_error_evaluator': 0.9587628841400146}
.........
Pass 0, Batch 140, Cost 192.191089, {'classification_error_evaluator': 0.936170220375061}
.........
Pass 0, Batch 150, Cost 365.744531, {'classification_error_evaluator': 0.9329608678817749}
.........
Pass 0, Batch 160, Cost 226.738013, {'classification_error_evaluator': 0.9009009003639221}
.........
Pass 0, Batch 170, Cost 294.002539, {'classification_error_evaluator': 0.9444444179534912}
.........
Pass 0, Batch 180, Cost nan, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 190, Cost nan, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 200, Cost nan, {'classification_error_evaluator': 1.0}
.........

I've tried changing the values for:

learning rate
batch size
L2 regularizer
gradient clipping
encoder/decoder dimensions
vocab size

But somehow the cost goes to NaN and I can't seem to go through 1 epoch without the cost going to NaN.

Possibly this is a related issue: #1738

The text was updated successfully, but these errors were encountered:

lcy-seso · 2017-06-22T09:25:12Z

Hi, I have also found the same problem. My colleague has written this for a temporally fix: A temporary guide to stablize models' training #2262. Maybe you can try it.
Unfortunately, I also found PaddlePaddle has a bug in parsing the parameter config globally set parameters cannot work in V2 API #2488 that some globally set parameters in optimizer including (but not limiting to) clipping, regularizer does not work, this causes gradient clipping is not triggered in the NMT demo. I am working on this bug.

lcy-seso · 2017-06-22T09:29:15Z

Thanks for reporting the problem for us, I will tune the parameters carefully and fix the NMT demo ASAP. Actually, I have met the same problem also.

alvations · 2017-06-22T09:56:46Z

Thanks @lcy-seso and @kuke !

Adding the error clipping at https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/train.py#L51 avoided the explosion:

        decoder_inputs = paddle.layer.mixed(
            size=decoder_size * 3,
            input=[
                paddle.layer.full_matrix_projection(input=context),
                paddle.layer.full_matrix_projection(input=current_word)
            ], 
           layer_attr=ExtraAttr(error_clipping_threshold=100.0))

Do you know what's the previous setting for the error/gradient clipping in v0.8 and v0.9? Any idea why didn't the gradient/error explode in the previous version?

lcy-seso · 2017-06-22T10:03:07Z

Previously, most times we directly set a global clipping threshold, and it works fine, but currently, such a global parameter (set in optimizer)setting does not work, you have to set it parameter by parameter if possible.

I think this bug is terrible. It also means globally set regularizer and other parameters are all invalid. Sorry, this must be fixed, I am working on it,

alvations · 2017-06-22T10:06:31Z

Ah now, I understand. Thank you @lcy-seso!

alvations · 2017-06-23T01:40:10Z

@lcy-seso @luotao1

Related issue but not on the NaN, I just realized that the current train.py from the book:

doesn't save the model at every epoch and there isn't an explicit parameter to save the model to a custom path/name (少了储存模型的配置参数)
Reports the current cost of SGD which is helpful for plotting but without the average cost, it's a little hard to track when to stop early.

lcy-seso · 2017-06-23T01:55:04Z

You are right.

There are no codes to show how to save the models in PaddleBook. You can check this example (encoder-decoder without attention): https://github.com/lcy-seso/models/blob/refine_seq2seq/nmt_without_attention/train.py#L42. I will fix the NMT book chapter.
v2 API leaves calculating the average cost to users, but we do print it before. I think we need an enhancement. Thank you.

alvations · 2017-06-23T03:16:11Z

@lcy-seso Thanks in advance for fixing it!!

lcy-seso · 2017-06-23T03:17:28Z

It is a terrible bug that must be fixed. Sorry for it.

alvations · 2017-06-23T03:47:42Z

No worries, it's open source so it should always "kaizen" (改善) =)
Thank you again!

lcy-seso · 2017-06-27T08:28:00Z

Hi, @alvations, about the bug that some globally set parameters cannot work, there is a way to avoid it, but we will fix this.

the reason for the bug
- PaddlePaddle simplifies the way it parses the network configuration file by introducing some global variable in this PR Fix V2 API #2288 (Before this PR, the configuration parsing process does not contain any global variable).
- Optimizer is used to set the default values for some parameters, including L2 regularization, gradient_clipping_threshold. These settings are recorded by some global variables, so before parsing the network topology, these global variables should be correctly initialized.
- This means the definition of the optimizer must be called before the definition of the network topology to make the global parameter settings enable.
To avoid the bug:

        optimizer = paddle.optimizer.RMSProp(
            learning_rate=1e-3,
            gradient_clipping_threshold=10.0,
            regularization=paddle.optimizer.L2Regularization(rate=8e-4))
        cost = seq2seq_net(source_dict_dim, target_dict_dim)
        parameters = paddle.parameters.create(cost)

This is very dangerous for users because no error is reported. I will fix this. Thank you for reporting this for us.
I think this is may be the reason why the gradient/error didn't explode in the previous version because this is a new bug.

pkuyym assigned lcy-seso Jun 22, 2017

lcy-seso added the enhancement label Jun 22, 2017

lcy-seso added the feature request label Jun 23, 2017

lcy-seso added Bug and removed feature request labels Jun 23, 2017

Superjomn mentioned this issue Jul 18, 2017

分类任务，本地可以运行，集群上评估部分出错“PrecisionRecallEvaluator” #2941

Closed

typhoonzero mentioned this issue Oct 23, 2017

Fix v2 optimizer define order #4972

Merged

typhoonzero closed this as completed in #4972 Oct 23, 2017

chengduoZH mentioned this issue Mar 5, 2018

Floating point exception (core dumped) #8702

Closed

MonkandMonkey mentioned this issue Jun 12, 2018

MPI 集群模式使用PrecisionRecallEvaluator #11400

Closed

jiweibo mentioned this issue Sep 1, 2019

Floating point exception (core dumped) 求关注! PaddlePaddle/models#3154

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cost going to NaN with Paddle v0.10.0 for MT example #2563

Cost going to NaN with Paddle v0.10.0 for MT example #2563

alvations commented Jun 22, 2017 •

edited

Loading

lcy-seso commented Jun 22, 2017

lcy-seso commented Jun 22, 2017 •

edited

Loading

alvations commented Jun 22, 2017 •

edited

Loading

lcy-seso commented Jun 22, 2017 •

edited

Loading

alvations commented Jun 22, 2017

alvations commented Jun 23, 2017

lcy-seso commented Jun 23, 2017

alvations commented Jun 23, 2017 •

edited

Loading

lcy-seso commented Jun 23, 2017

alvations commented Jun 23, 2017

lcy-seso commented Jun 27, 2017 •

edited

Loading

Cost going to NaN with Paddle v0.10.0 for MT example #2563

Cost going to NaN with Paddle v0.10.0 for MT example #2563

Comments

alvations commented Jun 22, 2017 • edited Loading

lcy-seso commented Jun 22, 2017

lcy-seso commented Jun 22, 2017 • edited Loading

alvations commented Jun 22, 2017 • edited Loading

lcy-seso commented Jun 22, 2017 • edited Loading

alvations commented Jun 22, 2017

alvations commented Jun 23, 2017

lcy-seso commented Jun 23, 2017

alvations commented Jun 23, 2017 • edited Loading

lcy-seso commented Jun 23, 2017

alvations commented Jun 23, 2017

lcy-seso commented Jun 27, 2017 • edited Loading

alvations commented Jun 22, 2017 •

edited

Loading

lcy-seso commented Jun 22, 2017 •

edited

Loading

alvations commented Jun 22, 2017 •

edited

Loading

lcy-seso commented Jun 22, 2017 •

edited

Loading

alvations commented Jun 23, 2017 •

edited

Loading

lcy-seso commented Jun 27, 2017 •

edited

Loading