Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Loss was always 87.3365. Is that normal or there was some errors in it. Thanks #2

Open
sidianshuia opened this issue May 2, 2018 · 3 comments

Comments

@sidianshuia
Copy link

I0502 17:46:41.519320 22378 solver.cpp:218] Iteration 0 (0 iter/s, 1.67459s/100 iters), loss = 184.354
I0502 17:46:41.519423 22378 solver.cpp:237] Train net output #0: accuracy = 0
I0502 17:46:41.519461 22378 solver.cpp:237] Train net output #1: cross_entropy_loss = 9.21768 (* 20 = 184.354 loss)
I0502 17:46:41.519623 22378 sgd_solver.cpp:105] Iteration 0, lr = 0.01
I0502 17:48:26.913466 22378 solver.cpp:218] Iteration 100 (0.948815 iter/s, 105.395s/100 iters), loss = 1746.73
I0502 17:48:26.913585 22378 solver.cpp:237] Train net output #0: accuracy = 1
I0502 17:48:26.913609 22378 solver.cpp:237] Train net output #1: cross_entropy_loss = 87.3365 (* 20 = 1746.73 loss)
I0502 17:48:26.913686 22378 sgd_solver.cpp:105] Iteration 100, lr = 0.00998333
I0502 17:50:12.863924 22378 solver.cpp:218] Iteration 200 (0.943834 iter/s, 105.951s/100 iters), loss = 1746.73
I0502 17:50:12.864044 22378 solver.cpp:237] Train net output #0: accuracy = 1
I0502 17:50:12.864061 22378 solver.cpp:237] Train net output #1: cross_entropy_loss = 87.3365 (* 20 = 1746.73 loss)
I0502 17:50:12.864110 22378 sgd_solver.cpp:105] Iteration 200, lr = 0.00996667
I0502 17:51:58.913350 22378 solver.cpp:218] Iteration 300 (0.942953 iter/s, 106.05s/100 iters), loss = 1746.73
I0502 17:51:58.913466 22378 solver.cpp:237] Train net output #0: accuracy = 1
I0502 17:51:58.913488 22378 solver.cpp:237] Train net output #1: cross_entropy_loss = 87.3365 (* 20 = 1746.73 loss)

@peteanderson80
Copy link
Owner

Definitely you have an issue with your install. It should look something like the log below. You can check each install step, and particularly that your PYTHONPATH doesn't have an other Caffe versions or something.

I1222 13:13:32.939466 3413 solver.cpp:218] Iteration 0 (-2.29907e-05 iter/s, 1.32516s/100 iters), loss = 184.287
I1222 13:13:32.939579 3413 solver.cpp:237] Train net output #0: accuracy = 0
I1222 13:13:32.939606 3413 solver.cpp:237] Train net output #1: cross_entropy_loss = 9.21434 (* 20 = 184.287 loss)
I1222 13:13:32.939687 3413 sgd_solver.cpp:105] Iteration 0, lr = 0.01
I1222 13:13:32.975464 3413 sgd_solver.cpp:92] Gradient clipping: scaling down gradients (L2 norm 28.243 > 10) by scale factor 0.35407
I1222 13:13:32.975529 3414 sgd_solver.cpp:92] Gradient clipping: scaling down gradients (L2 norm 28.243 > 10) by scale factor 0.35407
I1222 13:14:18.270639 3413 solver.cpp:218] Iteration 100 (2.20603 iter/s, 45.3303s/100 iters), loss = 118.122
I1222 13:14:18.270699 3413 solver.cpp:237] Train net output #0: accuracy = 0.16179
I1222 13:14:18.270714 3413 solver.cpp:237] Train net output #1: cross_entropy_loss = 5.03472 (* 20 = 100.694 loss)
I1222 13:14:18.270750 3413 sgd_solver.cpp:105] Iteration 100, lr = 0.00998333
I1222 13:14:18.297389 3413 sgd_solver.cpp:92] Gradient clipping: scaling down gradients (L2 norm 42.3492 > 10) by scale factor 0.236132
I1222 13:14:18.297430 3414 sgd_solver.cpp:92] Gradient clipping: scaling down gradients (L2 norm 42.3492 > 10) by scale factor 0.236132
I1222 13:15:03.837746 3413 solver.cpp:218] Iteration 200 (2.19461 iter/s, 45.5662s/100 iters), loss = 97.5572
I1222 13:15:03.837810 3413 solver.cpp:237] Train net output #0: accuracy = 0.218412
I1222 13:15:03.837826 3413 solver.cpp:237] Train net output #1: cross_entropy_loss = 4.85795 (* 20 = 97.159 loss)
I1222 13:15:03.837858 3413 sgd_solver.cpp:105] Iteration 200, lr = 0.00996667
I1222 13:15:03.864506 3413 sgd_solver.cpp:92] Gradient clipping: scaling down gradients (L2 norm 31.2961 > 10) by scale factor 0.319529
I1222 13:15:03.864542 3414 sgd_solver.cpp:92] Gradient clipping: scaling down gradients (L2 norm 31.2961 > 10) by scale factor 0.319529
I1222 13:15:49.513130 3413 solver.cpp:218] Iteration 300 (2.1894 iter/s, 45.6746s/100 iters), loss = 92.0657
I1222 13:15:49.513195 3413 solver.cpp:237] Train net output #0: accuracy = 0.182131
I1222 13:15:49.513217 3413 solver.cpp:237] Train net output #1: cross_entropy_loss = 4.51677 (* 20 = 90.3353 loss)
I1222 13:15:49.513250 3413 sgd_solver.cpp:105] Iteration 300, lr = 0.00995
I1222 13:15:49.539894 3413 sgd_solver.cpp:92] Gradient clipping: scaling down gradients (L2 norm 30.2248 > 10) by scale factor 0.330855
I1222 13:15:49.539932 3414 sgd_solver.cpp:92] Gradient clipping: scaling down gradients (L2 norm 30.2248 > 10) by scale factor 0.330855
I1222 13:16:35.308599 3413 solver.cpp:218] Iteration 400 (2.18366 iter/s, 45.7947s/100 iters), loss = 90.0139
I1222 13:16:35.308663 3413 solver.cpp:237] Train net output #0: accuracy = 0.233628
I1222 13:16:35.308679 3413 solver.cpp:237] Train net output #1: cross_entropy_loss = 4.33856 (* 20 = 86.7711 loss)
I1222 13:16:35.308713 3413 sgd_solver.cpp:105] Iteration 400, lr = 0.00993333
I1222 13:16:35.335464 3413 sgd_solver.cpp:92] Gradient clipping: scaling down gradients (L2 norm 19.7085 > 10) by scale factor 0.507395
I1222 13:16:35.335469 3414 sgd_solver.cpp:92] Gradient clipping: scaling down gradients (L2 norm 19.7085 > 10) by scale factor 0.507395
I1222 13:17:21.180307 3413 solver.cpp:218] Iteration 500 (2.18003 iter/s, 45.8709s/100 iters), loss = 88.1978
I1222 13:17:21.180373 3413 solver.cpp:237] Train net output #0: accuracy = 0.250417
I1222 13:17:21.180390 3413 solver.cpp:237] Train net output #1: cross_entropy_loss = 4.29903 (* 20 = 85.9806 loss)
I1222 13:17:21.180424 3413 sgd_solver.cpp:105] Iteration 500, lr = 0.00991667

@akira-l
Copy link

akira-l commented Oct 31, 2018

It seems that if I use 4 gpus for training, the loss always be 87.3,
If I use 2 gpus according to the original setting, the loss would be quickly decrease to 4.3

@peteanderson80
Copy link
Owner

To change from 2 gpus to 4 gpus, I think there are changes that would need to be made in several parts of the code. It would be best to build a new experiment using create_caption_lstm.py, changing param['gpu_ids'] = '0,1' to give four gpus, and changing param['train_batch_size'] = 50 to 25. Also, it would help to split the training tsvs into 4 files instead of 2, and set the new filenames in param['train_feature_sources'] = ['data/tsv/trainval/karpathy_train_resnet101_faster_rcnn_genome.tsv.%d' % i for i in range(2)]. This works so that each gpu just uses one training file.
I'm not sure what went wrong before but if you try these steps you might have more luck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants