Got an error during training #51

fearless77 · 2018-07-31T11:27:49Z

I got the error as follows but I don't know what's wrong and how to handle it.
I really need your help!

/usr/bin/python2.7 /home/vipsl-422-1/Desktop/self-critical.pytorch-master/train.py --id fc --caption_model fc --input_json ./data/cocotalk.json --input_fc_dir ./data/cocotalk_fc --input_att_dir ./data/cocotalk_att --input_label_h5 ./data/cocotalk_label.h5 --batch_size 10 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 0 --checkpoint_path log_fc --save_checkpoint_every 6000 --val_images_use 5000 --max_epochs 30 --language_eval 1
tensorboardX is not installed
DataLoader loading json file: ./data/cocotalk.json
vocab size is 9487
DataLoader loading h5 file: ./data/cocotalk_fc ./data/cocotalk_att data/cocotalk_box ./data/cocotalk_label.h5
max sequence length in data is 16
read 123287 image features
assigned 113287 images to split train
assigned 5000 images to split val
assigned 5000 images to split test
/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/data_parallel.py:24: UserWarning:
There is an imbalance between your GPUs. You may want to exclude GPU 1 which
has less than 75% of the memory or cores of GPU 0. You can do so by setting
the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
environment variable.
warnings.warn(imbalance_warn.format(device_ids[min_pos], device_ids[max_pos]))
Read data: 0.1044049263
/usr/local/lib/python2.7/dist-packages/torch/nn/functional.py:1006: UserWarning: nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.
warnings.warn("nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.")
/usr/local/lib/python2.7/dist-packages/torch/nn/functional.py:995: UserWarning: nn.functional.tanh is deprecated. Use torch.tanh instead.
warnings.warn("nn.functional.tanh is deprecated. Use torch.tanh instead.")
Terminating BlobFetcher
Traceback (most recent call last):
File "/home/vipsl-422-1/Desktop/self-critical.pytorch-master/train.py", line 224, in
train(opt)
File "/home/vipsl-422-1/Desktop/self-critical.pytorch-master/train.py", line 125, in train
loss = crit(dp_model(fc_feats, att_feats, labels, att_masks), labels[:,1:], masks[:,1:])
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, *kwargs)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/data_parallel.py", line 124, in forward
return self.gather(outputs, self.output_device)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/data_parallel.py", line 136, in gather
return gather(outputs, output_device, dim=self.dim)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/scatter_gather.py", line 67, in gather
return gather_map(outputs)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/scatter_gather.py", line 54, in gather_map
return Gather.apply(target_device, dim, outputs)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/_functions.py", line 65, in forward
return comm.gather(inputs, ctx.dim, ctx.target_device)
File "/usr/local/lib/python2.7/dist-packages/torch/cuda/comm.py", line 160, in gather
return torch._C._gather(tensors, dim, destination)
RuntimeError: Gather got an input of invalid size: got [25, 15, 9488], but expected [25, 17, 9488] (gather at torch/csrc/cuda/comm.cpp:183)
frame #0: + 0xc4f7fa (0x7f556503f7fa in /usr/local/lib/python2.7/dist-packages/torch/_C.so)
frame #1: + 0x3913db (0x7f55647813db in /usr/local/lib/python2.7/dist-packages/torch/_C.so)
frame #2: PyEval_EvalFrameEx + 0x5ca (0x4bc3fa in /usr/bin/python2.7)
frame #3: PyEval_EvalCodeEx + 0x306 (0x4b9ab6 in /usr/bin/python2.7)
frame #4: PyEval_EvalFrameEx + 0x58b7 (0x4c16e7 in /usr/bin/python2.7)
frame #5: PyEval_EvalCodeEx + 0x306 (0x4b9ab6 in /usr/bin/python2.7)
frame #6: /usr/bin/python2.7() [0x4d54b9]
frame #7: PyObject_Call + 0x3e (0x4a577e in /usr/bin/python2.7)
frame #8: PyEval_CallObjectWithKeywords + 0x30 (0x4c5e10 in /usr/bin/python2.7)
frame #9: THPFunction_apply(_object, _object) + 0x38f (0x7f5564b60c8f in /usr/local/lib/python2.7/dist-packages/torch/_C.so)
frame #10: PyEval_EvalFrameEx + 0x729e (0x4c30ce in /usr/bin/python2.7)
frame #11: PyEval_EvalCodeEx + 0x306 (0x4b9ab6 in /usr/bin/python2.7)
frame #12: PyEval_EvalFrameEx + 0x603f (0x4c1e6f in /usr/bin/python2.7)
frame #13: PyEval_EvalCodeEx + 0x306 (0x4b9ab6 in /usr/bin/python2.7)
frame #14: PyEval_EvalFrameEx + 0x58b7 (0x4c16e7 in /usr/bin/python2.7)
frame #15: PyEval_EvalFrameEx + 0x553f (0x4c136f in /usr/bin/python2.7)
frame #16: PyEval_EvalCodeEx + 0x306 (0x4b9ab6 in /usr/bin/python2.7)
frame #17: /usr/bin/python2.7() [0x4d55f3]
frame #18: PyObject_Call + 0x3e (0x4a577e in /usr/bin/python2.7)
frame #19: PyEval_EvalFrameEx + 0x2f0d (0x4bed3d in /usr/bin/python2.7)
frame #20: PyEval_EvalCodeEx + 0x306 (0x4b9ab6 in /usr/bin/python2.7)
frame #21: /usr/bin/python2.7() [0x4d54b9]
frame #22: /usr/bin/python2.7() [0x4eebee]
frame #23: PyObject_Call + 0x3e (0x4a577e in /usr/bin/python2.7)
frame #24: /usr/bin/python2.7() [0x548253]
frame #25: PyEval_EvalFrameEx + 0x578f (0x4c15bf in /usr/bin/python2.7)
frame #26: PyEval_EvalCodeEx + 0x306 (0x4b9ab6 in /usr/bin/python2.7)
frame #27: PyEval_EvalFrameEx + 0x603f (0x4c1e6f in /usr/bin/python2.7)
frame #28: PyEval_EvalCodeEx + 0x306 (0x4b9ab6 in /usr/bin/python2.7)
frame #29: /usr/bin/python2.7() [0x4eb30f]
frame #30: PyRun_FileExFlags + 0x82 (0x4e5422 in /usr/bin/python2.7)
frame #31: PyRun_SimpleFileExFlags + 0x186 (0x4e3cd6 in /usr/bin/python2.7)
frame #32: Py_Main + 0x612 (0x493ae2 in /usr/bin/python2.7)
frame #33: __libc_start_main + 0xf0 (0x7f559dba9830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #34: _start + 0x29 (0x4933e9 in /usr/bin/python2.7)

Process finished with exit code 1

KaiKangSDU · 2018-08-22T06:45:19Z

I got the same error. But when I first ran, it did not occur. After I ran the program with PG, I encountered this error. Have you dealt with this error?

fearless77 · 2018-08-22T07:32:00Z

I don't use multi-gpu, I change the code and only use one gpu to train.
And I train successfully.

fearless77 closed this as completed Aug 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Got an error during training #51

Got an error during training #51

fearless77 commented Jul 31, 2018

KaiKangSDU commented Aug 22, 2018

fearless77 commented Aug 22, 2018

Got an error during training #51

Got an error during training #51

Comments

fearless77 commented Jul 31, 2018

KaiKangSDU commented Aug 22, 2018

fearless77 commented Aug 22, 2018