-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cuda kernel crash #39
Comments
I have not encounter this crash, but the error code 14 often means memory release or copy before the preceding kernel is finished. It might be due to early time out of cudaDeviceSynchronize, which means the kernel is still executing when memory processing occurs. I also run into this error code in cuda-convnet in the conserve memory mode, not yet figure out the problem. My clue is that the kernel may take longer than the time limit allowed by the watchdog, thus early termination of cudaDeviceSynchronize, In my case, cudaDeviceSynchronize returns 4. In devices like K20, they may have a longer timeout, so the problem may not be noticed, but I'm probably wrong. |
It would be great if you could provide a backtrack record (e.g. from gdb). Yangqing On Sun, Jan 19, 2014 at 7:22 AM, Lin Min [email protected] wrote:
|
Hi Lin, Yangqing, It seems that it is not related to cudaDeviceSynchronize, since it still Here is the backtrace from the log Caffe printed:
And also the backtrace from gdb
Hope this could help. Thanks, 2014/1/20 Yangqing Jia [email protected]
Minjie Wang |
@Lin-Min, CCCP(Cascaded Cross Channel Parametric) Pooling is cool ! There used to be another CCCP (Союз Советских Социалистических Республик) which was cold. :) |
We've found the problem. It is due to the configuration of our GPU card. After we adjust the fan speed, it runs without crashings. |
@sergeyk the install docs could perhaps mention CUDA failures and advise checking hardware settings, since in every case so far it has been a hardware issue and not Caffe problem. Then this issue could be closed. |
@jermainewang , I encountered same problem, can you let me know the method that you used to adjust fan speed? I tried some methods, but it seems doesn't work. thanks |
Hi cheng, In fact, we still encounter the problem, but not that frequently. I guess Best regards, 2014-07-23 17:27 GMT+08:00 Cheng-Wang [email protected]:
Minjie Wang |
ECC doesn't matter. We have run all our Caffe jobs with ECC disabled on Look into other system issues. Le mercredi 23 juillet 2014, jermainewang [email protected] a
|
Add versioning for v0.14
BVLC#38: pre-class accuracy fix
Hi Yangqing,
We met problem when running Caffe. We run Caffe following the step on http://caffe.berkeleyvision.org/imagenet.html for ImageNet training. Our training set is 1000 images selected from the whole imagenet dataset and the testing set is the same.
It crashes after 600 iterations, the error message is:
“F0118 21:19:41.088841 1589 padding_layer.cu:131] Cuda kernel failed. Error: unspecified launch failure”
We run it again and it still crashes after 740 iterators, with error message:
“F0118 20:57:41.093628 27945 math_functions.cpp:45] Check failed: (cublasSgemm_v2(Caffe::cublas_handle(), cuTransB, cuTransA, N, M, K, &alpha, B, ldb, A, lda, &beta, C, N)) == CUBLAS_STATUS_SUCCESS (14 vs. 0)”
Our testbed is Ubuntu 12.04 LTS with GTX Titan, CUDA 5.5
Do you have any idea what causes this crash ? Thank you very much.
Best regards,
Minjie
The text was updated successfully, but these errors were encountered: