Cuda kernel crash #39

jermainewang · 2014-01-19T15:00:16Z

Hi Yangqing,

We met problem when running Caffe. We run Caffe following the step on http://caffe.berkeleyvision.org/imagenet.html for ImageNet training. Our training set is 1000 images selected from the whole imagenet dataset and the testing set is the same.
It crashes after 600 iterations, the error message is:
“F0118 21:19:41.088841 1589 padding_layer.cu:131] Cuda kernel failed. Error: unspecified launch failure”
We run it again and it still crashes after 740 iterators, with error message:
“F0118 20:57:41.093628 27945 math_functions.cpp:45] Check failed: (cublasSgemm_v2(Caffe::cublas_handle(), cuTransB, cuTransA, N, M, K, &alpha, B, ldb, A, lda, &beta, C, N)) == CUBLAS_STATUS_SUCCESS (14 vs. 0)”

Our testbed is Ubuntu 12.04 LTS with GTX Titan, CUDA 5.5

Do you have any idea what causes this crash ? Thank you very much.

Best regards,
Minjie

mavenlin · 2014-01-19T15:22:46Z

I have not encounter this crash, but the error code 14 often means memory release or copy before the preceding kernel is finished. It might be due to early time out of cudaDeviceSynchronize, which means the kernel is still executing when memory processing occurs. I also run into this error code in cuda-convnet in the conserve memory mode, not yet figure out the problem. My clue is that the kernel may take longer than the time limit allowed by the watchdog, thus early termination of cudaDeviceSynchronize, In my case, cudaDeviceSynchronize returns 4. In devices like K20, they may have a longer timeout, so the problem may not be noticed, but I'm probably wrong.

Yangqing · 2014-01-19T18:17:25Z

It would be great if you could provide a backtrack record (e.g. from gdb).
My feeling is that the GPU might be set to some mode that cause race
conditions...? Not sure about the exact cause, as we have also run the
program on Titans and it seemed working fine.

Yangqing

On Sun, Jan 19, 2014 at 7:22 AM, Lin Min [email protected] wrote:

I have not encounter this crash, but the error code 14 often means memory
release or copy before the preceding kernel is finished. It might be due to
early time out of cudaDeviceSynchronize, which means the kernel is still
executing when memory processing occurs. I also run into this error code in
cuda-convnet in the conserve memory mode, not yet figure out the problem.
My clue is that the kernel may take longer than the time limit allowed by
the watchdog, thus early termination of cudaDeviceSynchronize, In my case,
cudaDeviceSynchronize returns 4. In devices like K20, they may have a
longer timeout, so the problem may not be noticed, but I'm probably wrong.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/39#issuecomment-32710616
.

jermainewang · 2014-01-20T06:11:05Z

Hi Lin, Yangqing,

It seems that it is not related to cudaDeviceSynchronize, since it still
crashes without this statement.

Here is the backtrace from the log Caffe printed:

*** Check failure stack trace: ***
@ 0x7fd5b3fb8b7d google::LogMessage::Fail()
@ 0x7fd5b3fbac7f google::LogMessage::SendToLog()
@ 0x7fd5b3fb876c google::LogMessage::Flush()
@ 0x7fd5b3fbb51d google::LogMessageFatal::~LogMessageFatal()
@ 0x4622b9 caffe::PaddingLayer<>::Backward_gpu()
@ 0x42cfab caffe::Net<>::Backward()
@ 0x424858 caffe::Solver<>::Solve()
@ 0x40e745 main
@ 0x7fd5b269976d (unknown)
@ 0x40fa2d (unknown)

And also the backtrace from gdb

(gdb) bt
#0 0x00007f7108833425 in __GI_raise (sig=) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x00007f7108836b8b in __GI_abort () at abort.c:91
#2 0x00007f710a147039 in google::DumpStackTraceAndExit () at
src/utilities.cc:147
#3 0x00007f710a13db7d in google::LogMessage::Fail () at
src/logging.cc:1458
#4 0x00007f710a13fc7f in google::LogMessage::SendToLog
(this=0x7fff9907ed60) at src/logging.cc:1412
#5 0x00007f710a13d76c in google::LogMessage::Flush (this=0x7fff9907ed60)
at src/logging.cc:1281
#6 0x00007f710a14051d in google::LogMessageFatal::~LogMessageFatal
(this=0x7fff9907ed60, __in_chrg=) at src/logging.cc:1984
#7 0x0000000000436512 in caffe::caffe_gpu_gemm (TransA=CblasTrans,
TransB=CblasNoTrans, M=1728, N=169, K=128, alpha=1, A=0x235b9f8000,
B=0x237b36ba00, beta=0, C=0x238289d300)
at src/caffe/util/math_functions.cpp:44
#8 0x0000000000461069 in caffe::ConvolutionLayer::Backward_gpu
(this=0x5b63dc0, top=..., propagate_down=true, bottom=0x2332778) at
src/caffe/layers/conv_layer.cpp:245
#9 0x000000000043fc7f in caffe::Layer::Backward (this=0x5b63dc0,
top=..., propagate_down=true, bottom=0x2332778) at
./include/caffe/layer.hpp:114
#10 0x000000000043a5ff in caffe::Net::Backward (this=0x2332420) at
src/caffe/net.cpp:232
#11 0x0000000000431cc3 in caffe::Net::ForwardBackward
(this=0x2332420, bottom=...) at ./include/caffe/net.hpp:52
#12 0x000000000042e21c in caffe::Solver::Solve
(this=0x7fff9907f080, resume_file=0x0) at src/caffe/solver.cpp:58
#13 0x000000000040e6f4 in main (argc=2, argv=0x7fff9907f2e8) at
examples/train_net.cpp:32

Hope this could help.

Thanks,
Minjie

2014/1/20 Yangqing Jia [email protected]

It would be great if you could provide a backtrack record (e.g. from gdb).
My feeling is that the GPU might be set to some mode that cause race
conditions...? Not sure about the exact cause, as we have also run the
program on Titans and it seemed working fine.

Yangqing

On Sun, Jan 19, 2014 at 7:22 AM, Lin Min [email protected]
wrote:

I have not encounter this crash, but the error code 14 often means
memory
release or copy before the preceding kernel is finished. It might be due
to
early time out of cudaDeviceSynchronize, which means the kernel is still
executing when memory processing occurs. I also run into this error code
in
cuda-convnet in the conserve memory mode, not yet figure out the
problem.
My clue is that the kernel may take longer than the time limit allowed
by
the watchdog, thus early termination of cudaDeviceSynchronize, In my
case,
cudaDeviceSynchronize returns 4. In devices like K20, they may have a
longer timeout, so the problem may not be noticed, but I'm probably
wrong.

—
Reply to this email directly or view it on GitHub<
https://github.com/BVLC/caffe/issues/39#issuecomment-32710616>
.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/39#issuecomment-32715229
.

Minjie Wang
Shanghai Jiao Tong University | Computer Science & Technology
Embedded & Pervasive Computing Center (EPCC)
No.800 Dongchuan Road, Shanghai, China
Homepage: https://sites.google.com/site/minjiewanghomepage/home

kloudkl · 2014-01-23T17:04:27Z

@Lin-Min, CCCP(Cascaded Cross Channel Parametric) Pooling is cool ! There used to be another CCCP (Союз Советских Социалистических Республик) which was cold. :)

jermainewang · 2014-02-21T12:12:58Z

We've found the problem. It is due to the configuration of our GPU card. After we adjust the fan speed, it runs without crashings.

shelhamer · 2014-02-25T01:21:40Z

@sergeyk the install docs could perhaps mention CUDA failures and advise checking hardware settings, since in every case so far it has been a hardware issue and not Caffe problem. Then this issue could be closed.

Cheng-Wang · 2014-07-23T09:27:20Z

@jermainewang , I encountered same problem, can you let me know the method that you used to adjust fan speed? I tried some methods, but it seems doesn't work.

thanks
Cheng

jermainewang · 2014-07-23T15:54:22Z

Hi cheng,

In fact, we still encounter the problem, but not that frequently. I guess
this is because GTX Titan card is not with ECC, which harms the stability
of a long-term running, especially programs of intensive computations like
what caffe does.

Best regards,
Minjie

2014-07-23 17:27 GMT+08:00 Cheng-Wang [email protected]:

@jermainewang https://github.com/jermainewang , I encountered same
problem, can you let me know the method that you used to adjust fan speed?
I tried some methods, but it seems doesn't work.

thanks
Cheng

—
Reply to this email directly or view it on GitHub
#39 (comment).

Minjie Wang
Shanghai Jiao Tong University | Computer Science & Technology
Embedded & Pervasive Computing Center (EPCC)
No.800 Dongchuan Road, Shanghai, China
Homepage: https://sites.google.com/site/minjiewanghomepage/home

shelhamer · 2014-07-24T01:37:21Z

ECC doesn't matter. We have run all our Caffe jobs with ECC disabled on
cards that support it or cards that don't, like Titans and GTX 770s and
never had these issues.

Look into other system issues.

Le mercredi 23 juillet 2014, jermainewang [email protected] a
écrit :

Hi cheng,

In fact, we still encounter the problem, but not that frequently. I guess
this is because GTX Titan card is not with ECC, which harms the stability
of a long-term running, especially programs of intensive computations like
what caffe does.

Best regards,
Minjie

2014-07-23 17:27 GMT+08:00 Cheng-Wang <[email protected]
javascript:_e(%7B%7D,'cvml','[email protected]');>:

@jermainewang https://github.com/jermainewang , I encountered same
problem, can you let me know the method that you used to adjust fan
speed?
I tried some methods, but it seems doesn't work.

thanks
Cheng

—
Reply to this email directly or view it on GitHub
#39 (comment).

Minjie Wang
Shanghai Jiao Tong University | Computer Science & Technology
Embedded & Pervasive Computing Center (EPCC)
No.800 Dongchuan Road, Shanghai, China
Homepage: home | minjiewanghomepage
https://sites.google.com/site/minjiewanghomepage/home

—
Reply to this email directly or view it on GitHub
#39 (comment).

Add versioning for v0.14

BVLC#38: pre-class accuracy fix

shelhamer mentioned this issue Jan 26, 2014

Crash after the iteration 1620. Check failed,cublasSgemm #58

Closed

shelhamer added the downstream problem? label Feb 25, 2014

shelhamer closed this as completed Mar 20, 2014

edwardhsiao mentioned this issue May 23, 2014

Caffe crashes with multiple GPUs in machine #441

Closed

kloudkl mentioned this issue Jun 17, 2014

CCCP pooling layer #498

Closed

Cheng-Wang mentioned this issue Jul 23, 2014

cuda crashing in training imagenet #770

Closed

roseperrone mentioned this issue Oct 6, 2014

boost --with-python required on osx for pycaffe target #465 #1193

Closed

lukeyeager added a commit to lukeyeager/caffe that referenced this issue Oct 21, 2015

Merge pull request BVLC#39 from lukeyeager/nvidia/versioning

ec192dd

Add versioning for v0.14

chensiqin mentioned this issue Nov 28, 2015

Output accuracies per class. #2935

Merged

anandthakker pushed a commit to anandthakker/caffe that referenced this issue Jul 19, 2016

Merge pull request BVLC#39 from arassadin/fix_tr#38

ccd56fa

BVLC#38: pre-class accuracy fix

Zacrain mentioned this issue Oct 9, 2018

Error in make mattest #6270

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuda kernel crash #39

Cuda kernel crash #39

jermainewang commented Jan 19, 2014

mavenlin commented Jan 19, 2014

Yangqing commented Jan 19, 2014

jermainewang commented Jan 20, 2014

kloudkl commented Jan 23, 2014

jermainewang commented Feb 21, 2014

shelhamer commented Feb 25, 2014

Cheng-Wang commented Jul 23, 2014

jermainewang commented Jul 23, 2014

shelhamer commented Jul 24, 2014

Cuda kernel crash #39

Cuda kernel crash #39

Comments

jermainewang commented Jan 19, 2014

mavenlin commented Jan 19, 2014

Yangqing commented Jan 19, 2014

jermainewang commented Jan 20, 2014

kloudkl commented Jan 23, 2014

jermainewang commented Feb 21, 2014

shelhamer commented Feb 25, 2014

Cheng-Wang commented Jul 23, 2014

jermainewang commented Jul 23, 2014

shelhamer commented Jul 24, 2014