Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda 8: invalid device ordinal #568

Open
lhk opened this issue May 6, 2017 · 4 comments
Open

Cuda 8: invalid device ordinal #568

lhk opened this issue May 6, 2017 · 4 comments

Comments

@lhk
Copy link

lhk commented May 6, 2017

I've downloaded the repository and installed Caffe according to the instructions.

But during training, the program crashes with the following error message:

F0506 09:40:54.267431 26334 parallel.cpp:130] Check failed: error == cudaSuccess (10 vs. 0)  invalid device ordinal
*** Check failure stack trace: ***
    @     0x7faf2a0b55cd  google::LogMessage::Fail()
    @     0x7faf2a0b7433  google::LogMessage::SendToLog()
    @     0x7faf2a0b515b  google::LogMessage::Flush()
    @     0x7faf2a0b7e1e  google::LogMessageFatal::~LogMessageFatal()
    @     0x7faf2a8f6cd9  caffe::DevicePair::compute()
    @     0x7faf2a8fc910  caffe::P2PSync<>::Prepare()
    @     0x7faf2a8fd41e  caffe::P2PSync<>::Run()
    @           0x40c341  train()
    @           0x4088f8  main
    @     0x7faf287ea830  __libc_start_main
    @           0x4091c9  _start
    @              (nil)  (unknown)
Aborted (core dumped)

This seems to be a rather long-standing issue with caffe: BVLC#138

Apparently, it is caused by an incompatible cuda version.
I'm running cuda 8.

Just to be sure, I wanted to ask about the problem here, too.
Is it possible that this is related to this clone of caffe ?
On which cuda version do you train ?

Thanks,
Lars

@weiliu89
Copy link
Owner

weiliu89 commented May 9, 2017

You probably don't have 4 gpus. You need to adjust the gpu_ids in the script.

@jtara1
Copy link

jtara1 commented Jun 25, 2017

@weiliu89

Sorry, but it's not too clear on what I need to edit. I'm running python examples/ssd/ssd_pascal.py and having the same issue.

I changed this line from gpus = "0,1,2,3" to gpus = "0" but I'm getting a similar error with a different message (out of mem (I think))


Update

I made the same change to
https://github.com/weiliu89/caffe/blob/ssd/examples/ssd/ssd_pascal_orig.py and I've run it. So far it seems to be running smoothly.

edit: threw different error after running for 20 minutes:

F0624 19:48:34.887409 26803 syncedmem.cpp:56] Check failed: error == cudaSuccess (2 vs. 0)  out of memory
*** Check failure stack trace: ***
    @     0x7f703a4665cd  google::LogMessage::Fail()
    @     0x7f703a468433  google::LogMessage::SendToLog()
    @     0x7f703a46615b  google::LogMessage::Flush()
    @     0x7f703a468e1e  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f703ac065a0  caffe::SyncedMemory::to_gpu()
    @     0x7f703ac05569  caffe::SyncedMemory::mutable_gpu_data()
    @     0x7f703aa26553  caffe::Blob<>::mutable_gpu_diff()
    @     0x7f703accee6f  caffe::CuDNNConvolutionLayer<>::Backward_gpu()
    @     0x7f703ac3055b  caffe::Net<>::BackwardFromTo()
    @     0x7f703ac305bf  caffe::Net<>::Backward()
    @     0x7f703abfe3fc  caffe::Solver<>::Step()
    @     0x7f703abfee9e  caffe::Solver<>::Solve()
    @           0x40cece  train()
    @           0x4088c8  main
    @     0x7f7038b9a830  __libc_start_main
    @           0x409199  _start
    @              (nil)  (unknown)
Aborted

@griffintin
Copy link

@jtara1
I have exactly the same erorr with you, after revise "gpus = 0"
Have you resolved the problem, any advices?
Thank you

@griffintin
Copy link

griffintin commented Aug 30, 2017

Finally, traing from scratch can run now.
My GPU memory is only 2GB, thus "error == cudaSuccess (2 vs. 0) out of memory" happened. Only way I can do is to change batch_size to 1, but keep iter_size a little bit larger. Accordingly, parameters like base_lr and step value also adjusted...
Hope a good detection_eval can be achieved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants