Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot use --init_model_path #1738

Closed
ghost opened this issue Apr 1, 2017 · 12 comments
Closed

Cannot use --init_model_path #1738

ghost opened this issue Apr 1, 2017 · 12 comments

Comments

@ghost
Copy link

ghost commented Apr 1, 2017

I try to use the NMT model you trained (fr->en).
When i use --init_model_path=<your_pretrained_model>, I get the following error:

I0401 16:41:14.938364 3162 TrainerInternal.cpp:165] Batch=2 samples=100 AvgCost=nan CurrentCost=nan Eval: classification_error_evaluator=0.696074 CurrentEval: classification_error_evaluator=1

The command line is
paddle train
--config='translation/train.conf'
--save_dir='translation/model/wmt14_model'
--use_gpu=1
--num_passes=16
--show_parameter_stats_period=1
--trainer_count=1
--log_period=1
--dot_period=5
--init_model_path=model/wmt14_model/pass-00012
2>&1 | tee 'translation/train.log'

Can any one help me?

Thanks

@helinwang
Copy link
Contributor

@KeepLearning12138 Please refer to here as how to run PaddlePaddle: https://github.com/PaddlePaddle/Paddle/blob/develop/doc/getstarted/build_and_install/docker_install_en.rst

@ghost
Copy link
Author

ghost commented Apr 4, 2017

@livc @helinwang Thanks a lot for your reply. According to your comments, I installed docker-image on a isolate server, run the demo code shown in ch.7 of the paddle-book. (i.e., train.py) and get the following results:

Pass 0, Batch 0, Cost 164.959180, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 10, Cost 241.103516, {'classification_error_evaluator': 0.9316239356994629}
.........
Pass 0, Batch 20, Cost 341.873877, {'classification_error_evaluator': 0.9216867685317993}
.........
Pass 0, Batch 30, Cost 189.436011, {'classification_error_evaluator': 0.9130434989929199}
.........
Pass 0, Batch 40, Cost 209.991748, {'classification_error_evaluator': 0.9509803652763367}
.........
Pass 0, Batch 50, Cost 267.384741, {'classification_error_evaluator': 0.8999999761581421}
.........
Pass 0, Batch 60, Cost 213.720215, {'classification_error_evaluator': 0.942307710647583}
.........
Pass 0, Batch 70, Cost 195.130457, {'classification_error_evaluator': 0.9473684430122375}
.........
Pass 0, Batch 80, Cost 158.140320, {'classification_error_evaluator': 0.9610389471054077}
.........
Pass 0, Batch 90, Cost nan, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 100, Cost nan, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 110, Cost nan, {'classification_error_evaluator': 1.0}

Can we directly use "train.py"?

Best Wishes

@helinwang
Copy link
Contributor

@KeepLearning12138 Sure, please see the link that I posted in the previous comment: https://github.com/PaddlePaddle/Paddle/blob/develop/doc/getstarted/build_and_install/docker_install_en.rst. The specific line is at:

docker run --rm -v ~/workspace:/workspace paddlepaddle/paddle:0.10.0rc2 python /workspace/train.py

The above comment mounts ~/workspace into /workspace inside the docker image, and run python /workspace/train.py when starting the image.
You can copy the train.py from https://github.com/PaddlePaddle/book into ~/workspace.

@ghost
Copy link
Author

ghost commented Apr 4, 2017

@helinwang Thanks for the quick reply. Actually, I follow your suggestion. What i mean is that: although I can run the "train.py", my code is easily trapped in "NaN". When i set the learning rate as "0.0", the NaN disappears. I just want to confirm where the error comes from.

@helinwang
Copy link
Contributor

helinwang commented Apr 4, 2017

Good to know that you are experimenting around! However setting learning rate to 0.0 the network will stop to learn anything :)
I think this PR is trying to fix the problem you are encountering: PaddlePaddle/book#234. It tries to add regularization to the network to prevent weights become too big and eventually causing floating point exception (FPE, that's why you got "NaN" --- not a number).
Could you check if your train.py contains the changes in the PR, if so, you can even make the regularization rate higher.

@ghost
Copy link
Author

ghost commented Apr 4, 2017

Thanks for the reply.
I do not think the L2Reg is a key problem to neural machine translation. Because according to my previous experience, for a 1-layer GRU, it is OK not using weight decay. What I need to solve is: since you have already release a pretrained model (fr->en), I want to use your model as a warm start. After passing 100 samples, AvgCost=nan. Even if setting lr=0, the same bug appears. I am really curious about this problem. Can you solve it? OR, where can i find a script that can run the demo-paddle4NMT (including warm start)? Many thanks in advance.

@luotao1
Copy link
Contributor

luotao1 commented Apr 5, 2017

@KeepLearning12138 after #1750, you can use

parameters = paddle.dataset.wmt14.model()

instead of

parameters = paddle.parameters.create(cost)

to do a warm start.

@ghost
Copy link
Author

ghost commented Apr 6, 2017

Still not worked yet. Many NaN are detected. Only if i set batch_size as 1, the NaN seems disappearing. The test is OK, but training does not work.

The command line is:
sudo nvidia-docker run --rm -v ~/jack/Sandbox:/Sandbox paddlepaddle/paddle:0.10.0rc2-gpu python /Sandbox/train.py

And the log is:
I0406 02:33:20.648465 1 Util.cpp:161] commandline: --use_gpu=True --trainer_count=1
[INFO 2017-04-06 02:33:25,736 networks.py:1472] The input order is [source_language_word, target_language_word, target_language_next_word]
[INFO 2017-04-06 02:33:25,737 networks.py:1478] The output order is [classification_cost_0]
[INFO 2017-04-06 02:33:25,755 networks.py:1472] The input order is [source_language_word, target_language_word, target_language_next_word]
[INFO 2017-04-06 02:33:25,755 networks.py:1478] The output order is [classification_cost_0]
I0406 02:33:25.675101 1 GradientMachine.cpp:86] Initing parameters..
I0406 02:33:28.887544 1 GradientMachine.cpp:93] Init parameters done.
Cache file /root/.cache/paddle/dataset/wmt14/wmt14.tgz not found, downloading http://paddlepaddle.cdn.bcebos.com/demo/wmt_shrinked_data/wmt14.tgz

[==================================================]
Pass 0, Batch 0, Cost 232.995337, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 10, Cost 342.125854, {'classification_error_evaluator': 0.9698795080184937}
.........
Pass 0, Batch 20, Cost 191.533044, {'classification_error_evaluator': 0.9462365508079529}
.........
Pass 0, Batch 30, Cost 226.437500, {'classification_error_evaluator': 0.9545454382896423}
.........
Pass 0, Batch 40, Cost 220.021777, {'classification_error_evaluator': 0.9345794320106506}
.........
Pass 0, Batch 50, Cost 293.818042, {'classification_error_evaluator': 0.9510489702224731}
.........
Pass 0, Batch 60, Cost 225.875562, {'classification_error_evaluator': 0.9272727370262146}
.........
Pass 0, Batch 70, Cost 201.168530, {'classification_error_evaluator': 0.9489796161651611}
.........
Pass 0, Batch 80, Cost 309.581934, {'classification_error_evaluator': 0.9602649211883545}
.........
Pass 0, Batch 90, Cost 252.174194, {'classification_error_evaluator': 0.9593495726585388}
.........
Pass 0, Batch 100, Cost 190.457666, {'classification_error_evaluator': 0.9462365508079529}
.........
Pass 0, Batch 110, Cost 284.358179, {'classification_error_evaluator': 0.9064748287200928}
.........
Pass 0, Batch 120, Cost 161.713489, {'classification_error_evaluator': 0.8987341523170471}
.........
Pass 0, Batch 130, Cost 227.047119, {'classification_error_evaluator': 0.9279279112815857}
.........
Pass 0, Batch 140, Cost 253.434033, {'classification_error_evaluator': 0.9516128897666931}
.........
Pass 0, Batch 150, Cost 375.509229, {'classification_error_evaluator': 0.907608687877655}
.........
Pass 0, Batch 160, Cost 279.544653, {'classification_error_evaluator': 0.9270073175430298}
.........
Pass 0, Batch 170, Cost 295.797314, {'classification_error_evaluator': 0.9241379499435425}
.........
Pass 0, Batch 180, Cost nan, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 190, Cost nan, {'classification_error_evaluator': 1.0}
.........
Pass 0, Batch 200, Cost nan, {'classification_error_evaluator': 1.0}

Besides, I tried some examples about sentiment classification and I can get reasonable results.

@alvations
Copy link
Contributor

I'm having similar issue with cost going to nan when training using the the new https://github.com/PaddlePaddle/book/blob/develop/07.machine_translation/README.en.md

@livc
Copy link
Member

livc commented Jun 15, 2017

@alvations How about trying to reduce the learning rate?

@alvations
Copy link
Contributor

alvations commented Jun 15, 2017

I've tried the older setup which used to work in v0.8.0 and v0.9.0 with the v0.10.0 code

        optimizer = paddle.optimizer.Adam(
            learning_rate=5e-4,
            regularization=paddle.optimizer.L2Regularization(rate=8e-4))

        trainer = paddle.trainer.SGD(
            cost=cost, parameters=parameters, update_equation=optimizer)

        # define data reader
        wmt14_reader = paddle.batch(
            paddle.reader.shuffle(
                paddle.dataset.wmt14.train(dict_size), buf_size=8192),
            batch_size=50)

The current default from https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/train.py#L143 (as below) is frustratingly slow.

    batch_size = 5,
    learning_rate = 5e-5
    L2Regularization(rate=8e-4)

What are the changes to the optimizer such that the old settings doesn't work any more?

@livc I've also tried lowering the learning rate but at some point the cost also goes to a NaN and the training breaks =(

Is it because the gradient clipping is made global at 2e4c0bd ?

@lcy-seso
Copy link
Contributor

gradient clipping has been fixed in the current develop branch. And we also fix a terrible bug of sequnece_softmax. The NMT training in 0.10.0 has been fixed now.

I close this issue due to inactivity, please feel free to reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants