Cannot use --init_model_path #1738

ghost · 2017-04-01T07:46:33Z

I try to use the NMT model you trained (fr->en).
When i use --init_model_path=<your_pretrained_model>, I get the following error:

I0401 16:41:14.938364 3162 TrainerInternal.cpp:165] Batch=2 samples=100 AvgCost=nan CurrentCost=nan Eval: classification_error_evaluator=0.696074 CurrentEval: classification_error_evaluator=1

The command line is
paddle train
--config='translation/train.conf'
--save_dir='translation/model/wmt14_model'
--use_gpu=1
--num_passes=16
--show_parameter_stats_period=1
--trainer_count=1
--log_period=1
--dot_period=5
--init_model_path=model/wmt14_model/pass-00012
2>&1 | tee 'translation/train.log'

Can any one help me?

Thanks

helinwang · 2017-04-03T21:40:18Z

@KeepLearning12138 Please refer to here as how to run PaddlePaddle: https://github.com/PaddlePaddle/Paddle/blob/develop/doc/getstarted/build_and_install/docker_install_en.rst

ghost · 2017-04-04T03:36:39Z

@livc @helinwang Thanks a lot for your reply. According to your comments, I installed docker-image on a isolate server, run the demo code shown in ch.7 of the paddle-book. (i.e., train.py) and get the following results:

Pass 0, Batch 0, Cost 164.959180, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 10, Cost 241.103516, {'classification_error_evaluator': 0.9316239356994629}
.........
Pass 0, Batch 20, Cost 341.873877, {'classification_error_evaluator': 0.9216867685317993}
.........
Pass 0, Batch 30, Cost 189.436011, {'classification_error_evaluator': 0.9130434989929199}
.........
Pass 0, Batch 40, Cost 209.991748, {'classification_error_evaluator': 0.9509803652763367}
.........
Pass 0, Batch 50, Cost 267.384741, {'classification_error_evaluator': 0.8999999761581421}
.........
Pass 0, Batch 60, Cost 213.720215, {'classification_error_evaluator': 0.942307710647583}
.........
Pass 0, Batch 70, Cost 195.130457, {'classification_error_evaluator': 0.9473684430122375}
.........
Pass 0, Batch 80, Cost 158.140320, {'classification_error_evaluator': 0.9610389471054077}
.........
Pass 0, Batch 90, Cost nan, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 100, Cost nan, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 110, Cost nan, {'classification_error_evaluator': 1.0}

Can we directly use "train.py"?

Best Wishes

helinwang · 2017-04-04T03:40:58Z

@KeepLearning12138 Sure, please see the link that I posted in the previous comment: https://github.com/PaddlePaddle/Paddle/blob/develop/doc/getstarted/build_and_install/docker_install_en.rst. The specific line is at:

docker run --rm -v ~/workspace:/workspace paddlepaddle/paddle:0.10.0rc2 python /workspace/train.py

The above comment mounts ~/workspace into /workspace inside the docker image, and run python /workspace/train.py when starting the image.
You can copy the train.py from https://github.com/PaddlePaddle/book into ~/workspace.

ghost · 2017-04-04T03:56:38Z

@helinwang Thanks for the quick reply. Actually, I follow your suggestion. What i mean is that: although I can run the "train.py", my code is easily trapped in "NaN". When i set the learning rate as "0.0", the NaN disappears. I just want to confirm where the error comes from.

helinwang · 2017-04-04T04:15:25Z

Good to know that you are experimenting around! However setting learning rate to 0.0 the network will stop to learn anything :)
I think this PR is trying to fix the problem you are encountering: PaddlePaddle/book#234. It tries to add regularization to the network to prevent weights become too big and eventually causing floating point exception (FPE, that's why you got "NaN" --- not a number).
Could you check if your train.py contains the changes in the PR, if so, you can even make the regularization rate higher.

ghost · 2017-04-04T09:09:12Z

Thanks for the reply.
I do not think the L2Reg is a key problem to neural machine translation. Because according to my previous experience, for a 1-layer GRU, it is OK not using weight decay. What I need to solve is: since you have already release a pretrained model (fr->en), I want to use your model as a warm start. After passing 100 samples, AvgCost=nan. Even if setting lr=0, the same bug appears. I am really curious about this problem. Can you solve it? OR, where can i find a script that can run the demo-paddle4NMT (including warm start)? Many thanks in advance.

luotao1 · 2017-04-05T08:08:17Z

@KeepLearning12138 after #1750, you can use

parameters = paddle.dataset.wmt14.model()

instead of

parameters = paddle.parameters.create(cost)

to do a warm start.

ghost · 2017-04-06T02:11:09Z

Still not worked yet. Many NaN are detected. Only if i set batch_size as 1, the NaN seems disappearing. The test is OK, but training does not work.

The command line is:
sudo nvidia-docker run --rm -v ~/jack/Sandbox:/Sandbox paddlepaddle/paddle:0.10.0rc2-gpu python /Sandbox/train.py

And the log is:
I0406 02:33:20.648465 1 Util.cpp:161] commandline: --use_gpu=True --trainer_count=1
[INFO 2017-04-06 02:33:25,736 networks.py:1472] The input order is [source_language_word, target_language_word, target_language_next_word]
[INFO 2017-04-06 02:33:25,737 networks.py:1478] The output order is [classification_cost_0]
[INFO 2017-04-06 02:33:25,755 networks.py:1472] The input order is [source_language_word, target_language_word, target_language_next_word]
[INFO 2017-04-06 02:33:25,755 networks.py:1478] The output order is [classification_cost_0]
I0406 02:33:25.675101 1 GradientMachine.cpp:86] Initing parameters..
I0406 02:33:28.887544 1 GradientMachine.cpp:93] Init parameters done.
Cache file /root/.cache/paddle/dataset/wmt14/wmt14.tgz not found, downloading http://paddlepaddle.cdn.bcebos.com/demo/wmt_shrinked_data/wmt14.tgz

[==================================================]
Pass 0, Batch 0, Cost 232.995337, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 10, Cost 342.125854, {'classification_error_evaluator': 0.9698795080184937}
.........
Pass 0, Batch 20, Cost 191.533044, {'classification_error_evaluator': 0.9462365508079529}
.........
Pass 0, Batch 30, Cost 226.437500, {'classification_error_evaluator': 0.9545454382896423}
.........
Pass 0, Batch 40, Cost 220.021777, {'classification_error_evaluator': 0.9345794320106506}
.........
Pass 0, Batch 50, Cost 293.818042, {'classification_error_evaluator': 0.9510489702224731}
.........
Pass 0, Batch 60, Cost 225.875562, {'classification_error_evaluator': 0.9272727370262146}
.........
Pass 0, Batch 70, Cost 201.168530, {'classification_error_evaluator': 0.9489796161651611}
.........
Pass 0, Batch 80, Cost 309.581934, {'classification_error_evaluator': 0.9602649211883545}
.........
Pass 0, Batch 90, Cost 252.174194, {'classification_error_evaluator': 0.9593495726585388}
.........
Pass 0, Batch 100, Cost 190.457666, {'classification_error_evaluator': 0.9462365508079529}
.........
Pass 0, Batch 110, Cost 284.358179, {'classification_error_evaluator': 0.9064748287200928}
.........
Pass 0, Batch 120, Cost 161.713489, {'classification_error_evaluator': 0.8987341523170471}
.........
Pass 0, Batch 130, Cost 227.047119, {'classification_error_evaluator': 0.9279279112815857}
.........
Pass 0, Batch 140, Cost 253.434033, {'classification_error_evaluator': 0.9516128897666931}
.........
Pass 0, Batch 150, Cost 375.509229, {'classification_error_evaluator': 0.907608687877655}
.........
Pass 0, Batch 160, Cost 279.544653, {'classification_error_evaluator': 0.9270073175430298}
.........
Pass 0, Batch 170, Cost 295.797314, {'classification_error_evaluator': 0.9241379499435425}
.........
Pass 0, Batch 180, Cost nan, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 190, Cost nan, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 200, Cost nan, {'classification_error_evaluator': 1.0}

Besides, I tried some examples about sentiment classification and I can get reasonable results.

alvations · 2017-06-15T10:26:07Z

I'm having similar issue with cost going to nan when training using the the new https://github.com/PaddlePaddle/book/blob/develop/07.machine_translation/README.en.md

livc · 2017-06-15T11:17:41Z

@alvations How about trying to reduce the learning rate?

alvations · 2017-06-15T23:14:16Z

I've tried the older setup which used to work in v0.8.0 and v0.9.0 with the v0.10.0 code

        optimizer = paddle.optimizer.Adam(
            learning_rate=5e-4,
            regularization=paddle.optimizer.L2Regularization(rate=8e-4))

        trainer = paddle.trainer.SGD(
            cost=cost, parameters=parameters, update_equation=optimizer)

        # define data reader
        wmt14_reader = paddle.batch(
            paddle.reader.shuffle(
                paddle.dataset.wmt14.train(dict_size), buf_size=8192),
            batch_size=50)

The current default from https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/train.py#L143 (as below) is frustratingly slow.

    batch_size = 5,
    learning_rate = 5e-5
    L2Regularization(rate=8e-4)

What are the changes to the optimizer such that the old settings doesn't work any more?

@livc I've also tried lowering the learning rate but at some point the cost also goes to a NaN and the training breaks =(

Is it because the gradient clipping is made global at 2e4c0bd ?

lcy-seso · 2017-08-27T09:48:00Z

gradient clipping has been fixed in the current develop branch. And we also fix a terrible bug of sequnece_softmax. The NMT training in 0.10.0 has been fixed now.

I close this issue due to inactivity, please feel free to reopen it.

luotao1 mentioned this issue Apr 5, 2017

add wmt14 pretrained model #1750

Merged

alvations mentioned this issue Jun 22, 2017

Cost going to NaN with Paddle v0.10.0 for MT example #2563

Closed

lcy-seso closed this as completed Aug 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot use --init_model_path #1738

Cannot use --init_model_path #1738

ghost commented Apr 1, 2017

helinwang commented Apr 3, 2017

ghost commented Apr 4, 2017 •

edited by ghost

Loading

helinwang commented Apr 4, 2017

ghost commented Apr 4, 2017

helinwang commented Apr 4, 2017 •

edited

Loading

ghost commented Apr 4, 2017 •

edited by ghost

Loading

luotao1 commented Apr 5, 2017

ghost commented Apr 6, 2017 •

edited by ghost

Loading

alvations commented Jun 15, 2017

livc commented Jun 15, 2017

alvations commented Jun 15, 2017 •

edited

Loading

lcy-seso commented Aug 27, 2017

Cannot use --init_model_path #1738

Cannot use --init_model_path #1738

Comments

ghost commented Apr 1, 2017

helinwang commented Apr 3, 2017

ghost commented Apr 4, 2017 • edited by ghost Loading

helinwang commented Apr 4, 2017

ghost commented Apr 4, 2017

helinwang commented Apr 4, 2017 • edited Loading

ghost commented Apr 4, 2017 • edited by ghost Loading

luotao1 commented Apr 5, 2017

ghost commented Apr 6, 2017 • edited by ghost Loading

alvations commented Jun 15, 2017

livc commented Jun 15, 2017

alvations commented Jun 15, 2017 • edited Loading

lcy-seso commented Aug 27, 2017

ghost commented Apr 4, 2017 •

edited by ghost

Loading

helinwang commented Apr 4, 2017 •

edited

Loading

ghost commented Apr 4, 2017 •

edited by ghost

Loading

ghost commented Apr 6, 2017 •

edited by ghost

Loading

alvations commented Jun 15, 2017 •

edited

Loading