Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Got an error during training #51

Closed
fearless77 opened this issue Jul 31, 2018 · 2 comments
Closed

Got an error during training #51

fearless77 opened this issue Jul 31, 2018 · 2 comments

Comments

@fearless77
Copy link

I got the error as follows but I don't know what's wrong and how to handle it.
I really need your help!

/usr/bin/python2.7 /home/vipsl-422-1/Desktop/self-critical.pytorch-master/train.py --id fc --caption_model fc --input_json ./data/cocotalk.json --input_fc_dir ./data/cocotalk_fc --input_att_dir ./data/cocotalk_att --input_label_h5 ./data/cocotalk_label.h5 --batch_size 10 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 0 --checkpoint_path log_fc --save_checkpoint_every 6000 --val_images_use 5000 --max_epochs 30 --language_eval 1
tensorboardX is not installed
DataLoader loading json file: ./data/cocotalk.json
vocab size is 9487
DataLoader loading h5 file: ./data/cocotalk_fc ./data/cocotalk_att data/cocotalk_box ./data/cocotalk_label.h5
max sequence length in data is 16
read 123287 image features
assigned 113287 images to split train
assigned 5000 images to split val
assigned 5000 images to split test
/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/data_parallel.py:24: UserWarning:
There is an imbalance between your GPUs. You may want to exclude GPU 1 which
has less than 75% of the memory or cores of GPU 0. You can do so by setting
the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
environment variable.
warnings.warn(imbalance_warn.format(device_ids[min_pos], device_ids[max_pos]))
Read data: 0.1044049263
/usr/local/lib/python2.7/dist-packages/torch/nn/functional.py:1006: UserWarning: nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.
warnings.warn("nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.")
/usr/local/lib/python2.7/dist-packages/torch/nn/functional.py:995: UserWarning: nn.functional.tanh is deprecated. Use torch.tanh instead.
warnings.warn("nn.functional.tanh is deprecated. Use torch.tanh instead.")
Terminating BlobFetcher
Traceback (most recent call last):
File "/home/vipsl-422-1/Desktop/self-critical.pytorch-master/train.py", line 224, in
train(opt)
File "/home/vipsl-422-1/Desktop/self-critical.pytorch-master/train.py", line 125, in train
loss = crit(dp_model(fc_feats, att_feats, labels, att_masks), labels[:,1:], masks[:,1:])
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, *kwargs)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/data_parallel.py", line 124, in forward
return self.gather(outputs, self.output_device)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/data_parallel.py", line 136, in gather
return gather(outputs, output_device, dim=self.dim)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/scatter_gather.py", line 67, in gather
return gather_map(outputs)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/scatter_gather.py", line 54, in gather_map
return Gather.apply(target_device, dim, outputs)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/_functions.py", line 65, in forward
return comm.gather(inputs, ctx.dim, ctx.target_device)
File "/usr/local/lib/python2.7/dist-packages/torch/cuda/comm.py", line 160, in gather
return torch._C._gather(tensors, dim, destination)
RuntimeError: Gather got an input of invalid size: got [25, 15, 9488], but expected [25, 17, 9488] (gather at torch/csrc/cuda/comm.cpp:183)
frame #0: + 0xc4f7fa (0x7f556503f7fa in /usr/local/lib/python2.7/dist-packages/torch/_C.so)
frame #1: + 0x3913db (0x7f55647813db in /usr/local/lib/python2.7/dist-packages/torch/_C.so)
frame #2: PyEval_EvalFrameEx + 0x5ca (0x4bc3fa in /usr/bin/python2.7)
frame #3: PyEval_EvalCodeEx + 0x306 (0x4b9ab6 in /usr/bin/python2.7)
frame #4: PyEval_EvalFrameEx + 0x58b7 (0x4c16e7 in /usr/bin/python2.7)
frame #5: PyEval_EvalCodeEx + 0x306 (0x4b9ab6 in /usr/bin/python2.7)
frame #6: /usr/bin/python2.7() [0x4d54b9]
frame #7: PyObject_Call + 0x3e (0x4a577e in /usr/bin/python2.7)
frame #8: PyEval_CallObjectWithKeywords + 0x30 (0x4c5e10 in /usr/bin/python2.7)
frame #9: THPFunction_apply(_object
, _object
) + 0x38f (0x7f5564b60c8f in /usr/local/lib/python2.7/dist-packages/torch/_C.so)
frame #10: PyEval_EvalFrameEx + 0x729e (0x4c30ce in /usr/bin/python2.7)
frame #11: PyEval_EvalCodeEx + 0x306 (0x4b9ab6 in /usr/bin/python2.7)
frame #12: PyEval_EvalFrameEx + 0x603f (0x4c1e6f in /usr/bin/python2.7)
frame #13: PyEval_EvalCodeEx + 0x306 (0x4b9ab6 in /usr/bin/python2.7)
frame #14: PyEval_EvalFrameEx + 0x58b7 (0x4c16e7 in /usr/bin/python2.7)
frame #15: PyEval_EvalFrameEx + 0x553f (0x4c136f in /usr/bin/python2.7)
frame #16: PyEval_EvalCodeEx + 0x306 (0x4b9ab6 in /usr/bin/python2.7)
frame #17: /usr/bin/python2.7() [0x4d55f3]
frame #18: PyObject_Call + 0x3e (0x4a577e in /usr/bin/python2.7)
frame #19: PyEval_EvalFrameEx + 0x2f0d (0x4bed3d in /usr/bin/python2.7)
frame #20: PyEval_EvalCodeEx + 0x306 (0x4b9ab6 in /usr/bin/python2.7)
frame #21: /usr/bin/python2.7() [0x4d54b9]
frame #22: /usr/bin/python2.7() [0x4eebee]
frame #23: PyObject_Call + 0x3e (0x4a577e in /usr/bin/python2.7)
frame #24: /usr/bin/python2.7() [0x548253]
frame #25: PyEval_EvalFrameEx + 0x578f (0x4c15bf in /usr/bin/python2.7)
frame #26: PyEval_EvalCodeEx + 0x306 (0x4b9ab6 in /usr/bin/python2.7)
frame #27: PyEval_EvalFrameEx + 0x603f (0x4c1e6f in /usr/bin/python2.7)
frame #28: PyEval_EvalCodeEx + 0x306 (0x4b9ab6 in /usr/bin/python2.7)
frame #29: /usr/bin/python2.7() [0x4eb30f]
frame #30: PyRun_FileExFlags + 0x82 (0x4e5422 in /usr/bin/python2.7)
frame #31: PyRun_SimpleFileExFlags + 0x186 (0x4e3cd6 in /usr/bin/python2.7)
frame #32: Py_Main + 0x612 (0x493ae2 in /usr/bin/python2.7)
frame #33: __libc_start_main + 0xf0 (0x7f559dba9830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #34: _start + 0x29 (0x4933e9 in /usr/bin/python2.7)

Process finished with exit code 1

@KaiKangSDU
Copy link

I got the same error. But when I first ran, it did not occur. After I ran the program with PG, I encountered this error. Have you dealt with this error?

@fearless77
Copy link
Author

I don't use multi-gpu, I change the code and only use one gpu to train.
And I train successfully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants