-
Hello all! I'm having a bit of trouble running SLEAP on my university cluster and I think I need some help to get pointed in the right direction. First off, everything works great on my local desktop with a GPU installed. Then I coped the SLP file and its associated models to the cluster. That's where the problem begins. I've tried this several different ways. It's probably most clear if I show you what happens in
So far so good. Now I load my model, which I had trained on a different desktop and a different GPU, and copied here.
Now I load my video and try to predict, and that's when the error occurs.
I have left out a lot of the intervening error trace. I think the problem is most likely CUDNN_STATUS_NOT_INITIALIZED. When I google this, people suggest this is due to running out of GPU memory. But on this cluster, no one else is using the GPU, and all of its memory is available. People suggest enabling or disabling memory growth, and setting per_process_gpu_memory_fraction. Here are things I tried that made no difference:
I haven't tried setting per_process_gpu_memory_fraction because I only can find examples of doing that if you can specify the tf.Session, which I don't know how to do in sleap. Here's the info about my cluster:
Here are some things I think might be issues:
thanks for any tips for things to try!! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
Hi @cxrodgers, Thanks for providing the extensive diagnostic data! From the error messages, it actually looks like an issue with CUDA/TensorFlow x GPU drivers. Specifically:
The culprit being this bit: I presume you're installing from source using This should work with your driver version (510.108.03), but seems to be causing issues. If you can update to 525.60.13 or newer, that would eliminate that as a possibility, but I suspect there may be something else funky going on. Addressing some of the questions:
Yes! There should be no issues with portability.
Possibly, but I think we'd get a very different error message.
Possibly -- can you share the full output of Let us know when you get a chance and we'll go from there! Cheers, Talmo |
Beta Was this translation helpful? Give feedback.
-
Wow, it worked! Thank you! |
Beta Was this translation helpful? Give feedback.
Hi @cxrodgers,
I figured... I imagine they'll want to bring that up to >525 at some point soon since it's required for newer CUDA versions. Probably not the culprit here though.