-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resume training fails #786
Comments
Might be a cuda error instead of caffe? Check if all caffe tests pass. Might also be an out of memory issue if your model/batch is too big. |
I think it is very likely to be out of memory as my model is quite big. May be the overhead memory make it become out of memory as training it without resuming is working fine (GPU memory is around 12G, run training wo resuming ~ 11GB). |
I tried this with smaller model, it still get the same issue. I think the problem is not about the memory, but some problem with CUDA or caffe. |
Resuming on CPU works OK but GPU does not. May be CUDA problem? The error happen at Thanks a lot!inline void SyncedMemory::to_gpu() { |
Thank you all for your helps, solved my problem! Cheers, |
Hi @dutran ,I meet the same problem with you. Could you share the solution! Thanks! |
@chocolate9624 : I was under-allocating memory in CPU. I guess the cudaMemcpy check and find out that cpu_ptr has smaller size than size_. |
@dutran Do you mean your CPU memory is not enough for running caffe in GPU mode? But the CPU mode is OK. Thanks! |
I got the problem. It is my data's problem. Thanks! |
My problem is out of the memory, thank you! @Yangqing |
Same error
I'm using AlexNet model with 256x256 images, I have gtx 1070 with 8Gb of memory and 8Gb of memory on host, during training memory was < 4Gb, so I don't think this is memory issue. I'm using NVIDIA branch 0.15. |
i have the exact same issue, if anyone have idea please share. i think this is about cuda problems not caffe ::: |
Once more crush on fresh master branch:
BTW: I have successfully runned AlexNet model with batchsize 256 and 128, but with batch size 64 it crtashed somewhere in the middle of of the training. |
@chocolate9624 what was your problem? |
Please use the caffe-users list for usage, installation, or modeling questions, or other requests for help. Please read the guidelines for contributing before submitting this issue. |
Hi all,
I was trying to resume training (from 25k) and got this message, anyone have ideas/hints please help me out?
Many thanks,
Du
I0725 01:06:17.916695 10039 solver.cpp:66] Restoring previous solver status from convnet_iter_25000.solverstate
I0725 01:06:18.531533 10039 solver.cpp:312] SGDSolver: restoring history
I0725 01:06:18.621152 10039 solver.cpp:106] Iteration 25000, Testing net
I0725 01:08:52.277266 10039 solver.cpp:147] Test score #0: 0.3901
I0725 01:08:52.277325 10039 solver.cpp:147] Test score #1: 3.1283
F0725 01:08:55.576004 10039 syncedmem.cpp:55] Check failed: error == cudaSuccess (4 vs. 0) unspecified launch failure
*** Check failure stack trace: ***
@ 0x7f1c4da37b4d google::LogMessage::Fail()
@ 0x7f1c4da3bb67 google::LogMessage::SendToLog()
@ 0x7f1c4da399e9 google::LogMessage::Flush()
@ 0x7f1c4da39ced google::LogMessageFatal::~LogMessageFatal()
@ 0x4709f3 caffe::SyncedMemory::to_gpu()
@ 0x470579 caffe::SyncedMemory::mutable_gpu_data()
@ 0x45aadd caffe::Blob<>::mutable_gpu_data()
@ 0x4465dc caffe::SGDSolver<>::ComputeUpdateValue()
@ 0x44776e caffe::Solver<>::Solve()
@ 0x41af86 main
@ 0x7f1c4ad09cdd __libc_start_main
@ 0x41abe9 (unknown)
Aborted
The text was updated successfully, but these errors were encountered: