Resume training fails #786

dutran · 2014-07-25T08:14:23Z

Hi all,

I was trying to resume training (from 25k) and got this message, anyone have ideas/hints please help me out?

Many thanks,
Du

I0725 01:06:17.916695 10039 solver.cpp:66] Restoring previous solver status from convnet_iter_25000.solverstate
I0725 01:06:18.531533 10039 solver.cpp:312] SGDSolver: restoring history
I0725 01:06:18.621152 10039 solver.cpp:106] Iteration 25000, Testing net
I0725 01:08:52.277266 10039 solver.cpp:147] Test score #0: 0.3901
I0725 01:08:52.277325 10039 solver.cpp:147] Test score #1: 3.1283
F0725 01:08:55.576004 10039 syncedmem.cpp:55] Check failed: error == cudaSuccess (4 vs. 0) unspecified launch failure
*** Check failure stack trace: ***
@ 0x7f1c4da37b4d google::LogMessage::Fail()
@ 0x7f1c4da3bb67 google::LogMessage::SendToLog()
@ 0x7f1c4da399e9 google::LogMessage::Flush()
@ 0x7f1c4da39ced google::LogMessageFatal::~LogMessageFatal()
@ 0x4709f3 caffe::SyncedMemory::to_gpu()
@ 0x470579 caffe::SyncedMemory::mutable_gpu_data()
@ 0x45aadd caffe::Blob<>::mutable_gpu_data()
@ 0x4465dc caffe::SGDSolver<>::ComputeUpdateValue()
@ 0x44776e caffe::Solver<>::Solve()
@ 0x41af86 main
@ 0x7f1c4ad09cdd __libc_start_main
@ 0x41abe9 (unknown)
Aborted

Yangqing · 2014-07-25T14:46:35Z

Might be a cuda error instead of caffe? Check if all caffe tests pass. Might also be an out of memory issue if your model/batch is too big.

OpenHero · 2014-07-25T14:56:48Z

Hi @Yangqing & @dutran FYI...#707 #727

dutran · 2014-07-25T18:23:32Z

I think it is very likely to be out of memory as my model is quite big. May be the overhead memory make it become out of memory as training it without resuming is working fine (GPU memory is around 12G, run training wo resuming ~ 11GB).

dutran · 2014-08-13T17:54:31Z

I tried this with smaller model, it still get the same issue. I think the problem is not about the memory, but some problem with CUDA or caffe.
Please help if you have any hints on this.
Thank a lot!

dutran · 2014-08-13T18:43:56Z

Resuming on CPU works OK but GPU does not. May be CUDA problem? The error happen at
==> CUDA_CHECK(cudaMemcpy(gpu_ptr_, cpu_ptr_, size_, cudaMemcpyHostToDevice));

Thanks a lot!

inline void SyncedMemory::to_gpu() {
switch (head_) {
case UNINITIALIZED:
CUDA_CHECK(cudaMalloc(&gpu_ptr_, size_));
CUDA_CHECK(cudaMemset(gpu_ptr_, 0, size_));
head_ = HEAD_AT_GPU;
break;
case HEAD_AT_CPU:
if (gpu_ptr_ == NULL) {
CUDA_CHECK(cudaMalloc(&gpu_ptr_, size_));
}
==> CUDA_CHECK(cudaMemcpy(gpu_ptr_, cpu_ptr_, size_, cudaMemcpyHostToDevice));
head_ = SYNCED;
break;
case HEAD_AT_GPU:
case SYNCED:
break;
}
}

dutran · 2014-08-13T21:25:04Z

Thank you all for your helps, solved my problem!

Cheers,
Du

chocolate9624 · 2014-11-25T05:59:13Z

Hi @dutran ，I meet the same problem with you. Could you share the solution! Thanks!

dutran · 2014-11-25T06:04:10Z

@chocolate9624 : I was under-allocating memory in CPU. I guess the cudaMemcpy check and find out that cpu_ptr has smaller size than size_.

chocolate9624 · 2014-11-25T06:44:47Z

@dutran Do you mean your CPU memory is not enough for running caffe in GPU mode? But the CPU mode is OK. Thanks!

chocolate9624 · 2014-11-25T07:43:26Z

I got the problem. It is my data's problem. Thanks!

kuixu · 2016-08-09T20:01:12Z

My problem is out of the memory, thank you! @Yangqing

mrgloom · 2016-09-03T22:23:59Z

Same error Check failed: error == cudaSuccess (4 vs. 0) unspecified launch failure

I0904 01:09:19.388075 11377 sgd_solver.cpp:106] Iteration 475, lr = 0.01
F0904 01:09:24.065701 11377 cudnn_conv_layer.cu:139] Check failed: error == cudaSuccess (4 vs. 0)  unspecified launch failure
*** Check failure stack trace: ***
@     0x7f3a442bddaa  (unknown)
@     0x7f3a442bdce4  (unknown)
@     0x7f3a442bd6e6  (unknown)
@     0x7f3a442c0687  (unknown)
@     0x7f3a449fa907  caffe::CuDNNConvolutionLayer<>::Backward_gpu()
@     0x7f3a4489d568  caffe::Net<>::BackwardFromTo()
@     0x7f3a4489da11  caffe::Net<>::Backward()
@     0x7f3a4488c1f7  caffe::Solver<>::Step()
@     0x7f3a4488cabe  caffe::Solver<>::Solve()
@           0x40af86  train()
@           0x4086cc  main
@     0x7f3a42dbdf45  (unknown)
@           0x408e9d  (unknown)
@              (nil)  (unknown)

I'm using AlexNet model with 256x256 images, I have gtx 1070 with 8Gb of memory and 8Gb of memory on host, during training memory was < 4Gb, so I don't think this is memory issue.

I'm using NVIDIA branch 0.15.
Also I'm using CUDA 8.0 and CUDNN 5.1.

sayadyaghoobi · 2017-04-24T12:28:18Z

i have the exact same issue, if anyone have idea please share. i think this is about cuda problems not caffe :::
I0424 16:51:36.866385 4297 caffe.cpp:218] Using GPUs 0
I0424 16:51:36.868625 4297 caffe.cpp:223] GPU 0: �7�~�
F0424 16:51:36.868661 4297 common.cpp:152] Check failed: error == cudaSuccess (30 vs. 0) unknown error
*** Check failure stack trace: ***
@ 0x7f76f21485cd google::LogMessage::Fail()
@ 0x7f76f214a433 google::LogMessage::SendToLog()
@ 0x7f76f214815b google::LogMessage::Flush()
@ 0x7f76f214ae1e google::LogMessageFatal::~LogMessageFatal()
@ 0x7f76f2911012 caffe::Caffe::SetDevice()
@ 0x40b018 train()
@ 0x4072f0 main
@ 0x7f76f10b9830 __libc_start_main
@ 0x407b19 _start
@ (nil) (unknown)
Aborted (core dumped)

mrgloom · 2017-11-04T18:40:21Z

Once more crush on fresh master branch:

ERROR: Check failed: error == cudaSuccess (4 vs. 0) unspecified launch failure

Ignoring source layer train-data
Restarting data prefetching from start.
Test net output #0: accuracy = 0.139648
Test net output #1: loss = 3.7423 (* 1 = 3.7423 loss)
Iteration 3200 (0.794172 iter/s, 20.1468s/16 iters), loss = 2.61571
Train net output #0: loss = 2.61571 (* 1 = 2.61571 loss)
Iteration 3200, lr = 0.0001
Iteration 3216 (3.13362 iter/s, 5.10591s/16 iters), loss = 2.96576
Train net output #0: loss = 2.96576 (* 1 = 2.96576 loss)
Iteration 3216, lr = 0.0001
Iteration 3232 (2.92405 iter/s, 5.47187s/16 iters), loss = 3.12505
Train net output #0: loss = 3.12505 (* 1 = 3.12505 loss)
Iteration 3232, lr = 0.0001
Iteration 3248 (2.80573 iter/s, 5.70261s/16 iters), loss = 2.75908
Train net output #0: loss = 2.75908 (* 1 = 2.75908 loss)
Iteration 3248, lr = 0.0001
Iteration 3264 (2.87587 iter/s, 5.56354s/16 iters), loss = 3.07124
Train net output #0: loss = 3.07124 (* 1 = 3.07124 loss)
Iteration 3264, lr = 0.0001
Check failed: error == cudaSuccess (4 vs. 0)  unspecified launch failure

BTW: I have successfully runned AlexNet model with batchsize 256 and 128, but with batch size 64 it crtashed somewhere in the middle of of the training.

dong-x16 · 2017-11-06T11:32:00Z

during training, I meet this error, I don't know what's wrong with it, anyone have ideas? Many thanks

sebastiangonsal · 2017-11-14T08:07:01Z

@chocolate9624 what was your problem?

shaibagon · 2017-11-14T08:13:05Z

Please use the caffe-users list for usage, installation, or modeling questions, or other requests for help.
You may also post questions on stackoverflow, make sure you tag them with caffe tag.
There is also caffe.help documenting the different layers of caffe.
Do not post such requests to Issues. Doing so interferes with the development of Caffe.

Please read the guidelines for contributing before submitting this issue.

shelhamer closed this as completed Aug 12, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resume training fails #786

Resume training fails #786

dutran commented Jul 25, 2014

Yangqing commented Jul 25, 2014

OpenHero commented Jul 25, 2014

dutran commented Jul 25, 2014

dutran commented Aug 13, 2014

dutran commented Aug 13, 2014

dutran commented Aug 13, 2014

chocolate9624 commented Nov 25, 2014

dutran commented Nov 25, 2014

chocolate9624 commented Nov 25, 2014

chocolate9624 commented Nov 25, 2014

kuixu commented Aug 9, 2016

mrgloom commented Sep 3, 2016 •

edited

Loading

sayadyaghoobi commented Apr 24, 2017 •

edited

Loading

mrgloom commented Nov 4, 2017

dong-x16 commented Nov 6, 2017

sebastiangonsal commented Nov 14, 2017

shaibagon commented Nov 14, 2017

Resume training fails #786

Resume training fails #786

Comments

dutran commented Jul 25, 2014

Yangqing commented Jul 25, 2014

OpenHero commented Jul 25, 2014

dutran commented Jul 25, 2014

dutran commented Aug 13, 2014

dutran commented Aug 13, 2014

Thanks a lot!

dutran commented Aug 13, 2014

chocolate9624 commented Nov 25, 2014

dutran commented Nov 25, 2014

chocolate9624 commented Nov 25, 2014

chocolate9624 commented Nov 25, 2014

kuixu commented Aug 9, 2016

mrgloom commented Sep 3, 2016 • edited Loading

sayadyaghoobi commented Apr 24, 2017 • edited Loading

mrgloom commented Nov 4, 2017

dong-x16 commented Nov 6, 2017

sebastiangonsal commented Nov 14, 2017

shaibagon commented Nov 14, 2017

mrgloom commented Sep 3, 2016 •

edited

Loading

sayadyaghoobi commented Apr 24, 2017 •

edited

Loading