-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Did anyone reproduce the result listed in the paper with multi-GPUs? #9
Comments
Carefully check data_format in all keras Convolutions. Code actually doesn't support channels_first. |
6255是warmup结束 之后dropoutrate变成100 我也是在6255步之后loss发散的 你解决这个问题了吗? |
Any updates? |
I'm also interested to know if someone reproduced the results at all (on multi-gpu or on TPU) |
From what I understand, that the total loss becomes very high is normal because runtime loss is only included after the warmup stage: code |
@xingwangsfu Hi, I replaced 'tf.contrib.tpu.TPUEstimatorSpec' with 'tf.estimator.EstimatorSpec', but I found that the latter one does not have the method 'host_call', how to handle the problem? |
Hello, have you solved this problem? |
@marsggbo "host_call" can be ignored by set "skip_host_call" to be True. |
@xingwangsfu, did you fix the issue? In the paper, fig.4(left) shows the CE loss is from about 7.0 to be about 3.0, but didn't mention the total loss. |
Hello, there are some exceptions when I run the code, may you give me some advise about this dropout issue? Mant thanks! Question link: |
I did nothing to the code except replacing TPU with multi-gpus. And my training is stuck at a very high loss. I assume this is caused by the large learning rate after warm-start since my loss is normal before iteration 6255, which is the number of warm-start iterations.
Here is my training log:
I0618 17:29:12.284162 140720170149632 tf_logging.py:115] global_step/sec: 1.26958
I0618 17:29:12.284616 140720170149632 tf_logging.py:115] loss = 6.7994366, step = 5760 (50.410 sec)
I0618 17:30:03.805228 140720170149632 tf_logging.py:115] global_step/sec: 1.24221
I0618 17:30:03.805763 140720170149632 tf_logging.py:115] loss = 6.8188353, step = 5824 (51.521 sec)
I0618 17:30:54.878273 140720170149632 tf_logging.py:115] global_step/sec: 1.25311
I0618 17:30:54.878723 140720170149632 tf_logging.py:115] loss = 6.8027916, step = 5888 (51.073 sec)
I0618 17:31:46.115418 140720170149632 tf_logging.py:115] global_step/sec: 1.24909
I0618 17:31:46.144254 140720170149632 tf_logging.py:115] loss = 6.8010216, step = 5952 (51.266 sec)
I0618 17:32:37.585529 140720170149632 tf_logging.py:115] global_step/sec: 1.24344
I0618 17:32:37.585925 140720170149632 tf_logging.py:115] loss = 6.789137, step = 6016 (51.442 sec)
I0618 17:33:28.797896 140720170149632 tf_logging.py:115] global_step/sec: 1.2497
I0618 17:33:28.798456 140720170149632 tf_logging.py:115] loss = 6.799903, step = 6080 (51.213 sec)
I0618 17:34:19.681088 140720170149632 tf_logging.py:115] global_step/sec: 1.25778
I0618 17:34:19.681564 140720170149632 tf_logging.py:115] loss = 6.803883, step = 6144 (50.883 sec)
I0618 17:35:09.831330 140720170149632 tf_logging.py:115] global_step/sec: 1.27617
I0618 17:35:09.831943 140720170149632 tf_logging.py:115] loss = 6.7922, step = 6208 (50.150 sec)
I0618 17:35:46.901006 140720170149632 tf_logging.py:115] Saving checkpoints for 6255 into /mnt/cephfs_wj/cv/wangxing/tmp/model-single-path-search/lambda-val-0.020/model.ckpt.
I0618 17:36:07.512706 140720170149632 tf_logging.py:115] global_step/sec: 1.10954
I0618 17:36:07.513106 140720170149632 tf_logging.py:115] loss = 94.23678, step = 6272 (57.681 sec)
I0618 17:36:57.975293 140720170149632 tf_logging.py:115] global_step/sec: 1.26827
I0618 17:36:57.975636 140720170149632 tf_logging.py:115] loss = 84.60893, step = 6336 (50.463 sec)
I0618 17:37:49.209366 140720170149632 tf_logging.py:115] global_step/sec: 1.24917
I0618 17:37:49.210039 140720170149632 tf_logging.py:115] loss = 83.81077, step = 6400 (51.234 sec)
I0618 17:38:40.446595 140720170149632 tf_logging.py:115] global_step/sec: 1.24909
I0618 17:38:40.447212 140720170149632 tf_logging.py:115] loss = 83.7096, step = 6464 (51.237 sec)
I0618 17:39:31.800470 140720170149632 tf_logging.py:115] global_step/sec: 1.24625
I0618 17:39:31.800811 140720170149632 tf_logging.py:115] loss = 75.41687, step = 6528 (51.354 sec)
I0618 17:40:22.979326 140720170149632 tf_logging.py:115] global_step/sec: 1.25052
I0618 17:40:22.979668 140720170149632 tf_logging.py:115] loss = 75.42241, step = 6592 (51.179 sec)
I0618 17:41:14.112971 140720170149632 tf_logging.py:115] global_step/sec: 1.25162
I0618 17:41:14.137188 140720170149632 tf_logging.py:115] loss = 75.344826, step = 6656 (51.157 sec)
I0618 17:42:05.177355 140720170149632 tf_logging.py:115] global_step/sec: 1.25332
I0618 17:42:05.177694 140720170149632 tf_logging.py:115] loss = 75.358315, step = 6720 (51.041 sec)
I0618 17:42:56.014090 140720170149632 tf_logging.py:115] global_step/sec: 1.25893
I0618 17:42:56.014433 140720170149632 tf_logging.py:115] loss = 75.37303, step = 6784 (50.837 sec)
I0618 17:43:47.115759 140720170149632 tf_logging.py:115] global_step/sec: 1.25241
I0618 17:43:47.116162 140720170149632 tf_logging.py:115] loss = 75.35231, step = 6848 (51.102 sec)
I0618 17:44:38.047000 140720170149632 tf_logging.py:115] global_step/sec: 1.2566
I0618 17:44:38.047545 140720170149632 tf_logging.py:115] loss = 75.34932, step = 6912 (50.931 sec)
The text was updated successfully, but these errors were encountered: