CUDNN_STATUS_NOT_INITIALIZED #1380

cxrodgers · 2023-07-08T22:17:21Z

cxrodgers
Jul 8, 2023

Hello all! I'm having a bit of trouble running SLEAP on my university cluster and I think I need some help to get pointed in the right direction. First off, everything works great on my local desktop with a GPU installed. Then I coped the SLP file and its associated models to the cluster. That's where the problem begins.

I've tried this several different ways. It's probably most clear if I show you what happens in ipython. This is with a fresh install of sleap from the current git branch.

In [1]: import sleap

In [2]: sleap.system_summary()
GPUs: 1/1 available
  Device: /physical_device:GPU:0
         Available: True
        Initalized: False
     Memory growth: True

So far so good. Now I load my model, which I had trained on a different desktop and a different GPU, and copied here.

In [3]: predictor = sleap.load_model('sleap_local/20230703_octagon_from_cedric/models/230705_124255.single_instance.n=1507')
2023-07-08 22:05:45.508677: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-08 22:05:46.025356: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 46713 MB memory:  -> device: 0, name: NVIDIA RTX A6000, pci bus id: 0000:1b:00.0, compute capability: 8.6

Now I load my video and try to predict, and that's when the error occurs.

In [4]: video = sleap.load_video('data/20230707_sleap/e3v82c6-20230224T102408-105109.avi')

In [5]: predictor.predict(video[:1])
Predicting... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% ETA: -:--:-- ?2023-07-08 22:06:34.326276: E tensorflow/stream_executor/cuda/cuda_dnn.cc:371] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2023-07-08 22:06:34.326382: E tensorflow/stream_executor/cuda/cuda_dnn.cc:379] Possibly insufficient driver version: 510.108.3
2023-07-08 22:06:34.326446: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at conv_ops_fused_impl.h:560 : UNKNOWN: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
Predicting... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% ETA: -:--:-- ?

...

UnknownError: 2 root error(s) found.
  (0) UNKNOWN:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.

...

Input Source operations connected to node single_instance_inference_model/single_instance_inference_layer/model/stack0_enc0_act0_relu/Relu:
In[0] single_instance_inference_model/single_instance_inference_layer/model/stack0_enc0_conv0/BiasAdd (defined at /apps/conda/crodge4/envs/sleapdev/lib/python3.7/site-packages/keras/layers/convolutional.py:265)

...

>>>   File "/apps/conda/crodge4/envs/sleapdev/lib/python3.7/site-packages/keras/backend.py", line 4867, in relu
>>>     x = tf.nn.relu(x)
>>> 

Function call stack:
predict_function -> predict_function

I have left out a lot of the intervening error trace. I think the problem is most likely CUDNN_STATUS_NOT_INITIALIZED. When I google this, people suggest this is due to running out of GPU memory. But on this cluster, no one else is using the GPU, and all of its memory is available. People suggest enabling or disabling memory growth, and setting per_process_gpu_memory_fraction.

Here are things I tried that made no difference:

using sleap-track from the command line instead of ipython
sleap.disable_preallocation()
sleap.nn.system.enable_preallocation()
export TF_FORCE_GPU_ALLOW_GROWTH='true'
export CUDA_VISIBLE_DEVICES=0

I haven't tried setting per_process_gpu_memory_fraction because I only can find examples of doing that if you can specify the tf.Session, which I don't know how to do in sleap.

Here's the info about my cluster:

==========SYSTEM==========

utc:			2023-07-08 22:15:12.682384
python:			3.7.12
system:			Linux, x86_64, 5.4.0-122-generic, #138-Ubuntu SMP Wed Jun 22 15:00:31 UTC 2022
path:			/apps/conda/crodge4/envs/sleapdev/bin:/usr/local/beast/bin:/home/crodge4/mambaforge/condabin:/usr/local/beast/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

==========IMPORTS==========

sleap import:			True
sleap path:			/home/crodge4/dev/sleap/sleap/__init__.py
sleap version:			1.3.1
pyside2 import:			True
pyside path:			/apps/conda/crodge4/envs/sleapdev/lib/python3.7/site-packages/PySide2/__init__.py
successfully created PySide2.QtWidgets.QApplication instance

pyside6 import:			False
cv2 import:			True

==========GIT==========

call to git failed
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).

call to git failed
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).


==========TENSORFLOW==========

tensorflow import:			True
tensorflow version:			2.7.0
tensorflow path:			/apps/conda/crodge4/envs/sleapdev/lib/python3.7/site-packages/tensorflow/__init__.py
gpus:			[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


...


| NVIDIA-SMI 510.108.03   Driver Version: 510.108.03   CUDA Version: 11.6     |

Here are some things I think might be issues:

Is it okay I trained on a totally different GPU with a different version of CUDA, etc?
Maybe I am somehow not requesting enough GPU memory on the cluster slurm (I am asking IT about this)
Maybe I am somehow using the wrong GPU on the cluster

thanks for any tips for things to try!!

Answered by talmo

Jul 11, 2023

Hi @cxrodgers,

I'm unable to update nvidia drivers because I don't have admin privileges and in any case this is a shared cluster so they won't want to break other people's stuff.

I figured... I imagine they'll want to bring that up to >525 at some point soon since it's required for newer CUDA versions. Probably not the culprit here though.

Previous clusters I've used had some kind of "head node" where you could submit jobs but couldn't see any GPUs. This one is set up differently, I can see all 4 GPUs as soon as I log in. But they told me never to use a GPU outside of a slurm job or I'd get kicked off the cluster. So when I submit a slurm job, then sleap.system_setup can only see the …

View full answer

talmo · 2023-07-10T22:51:35Z

talmo
Jul 10, 2023
Maintainer

Hi @cxrodgers,

Thanks for providing the extensive diagnostic data!

From the error messages, it actually looks like an issue with CUDA/TensorFlow x GPU drivers. Specifically:

2023-07-08 22:06:34.326276: E tensorflow/stream_executor/cuda/cuda_dnn.cc:371] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2023-07-08 22:06:34.326382: E tensorflow/stream_executor/cuda/cuda_dnn.cc:379] Possibly insufficient driver version: 510.108.3
2023-07-08 22:06:34.326446: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at conv_ops_fused_impl.h:560 : UNKNOWN: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.

The culprit being this bit: Possibly insufficient driver version: 510.108.3

I presume you're installing from source using environment.yml which pulls in cudatoolkit=11.3.1 and cudnn=8.2.1.

This should work with your driver version (510.108.03), but seems to be causing issues. If you can update to 525.60.13 or newer, that would eliminate that as a possibility, but I suspect there may be something else funky going on.

Addressing some of the questions:

Is it okay I trained on a totally different GPU with a different version of CUDA, etc?

Yes! There should be no issues with portability.

Maybe I am somehow not requesting enough GPU memory on the cluster slurm (I am asking IT about this)

Possibly, but I think we'd get a very different error message.

Maybe I am somehow using the wrong GPU on the cluster

Possibly -- can you share the full output of nvidia-smi? Curious as to what the underlying hardware is.

Let us know when you get a chance and we'll go from there!

Cheers,

Talmo

2 replies

cxrodgers Jul 11, 2023
Author

Thanks Talmo, super helpful as always!

Yes, I installed following the instructions here: https://sleap.ai/installation.html#conda-from-source
So I used mamba and the provided environment.yml. This brings:

cuda-nvcc                 11.3.58              h2467b9f_0    nvidia
cudatoolkit               11.3.1              h9edb442_11    conda-forge
cudnn                     8.2.1.32             h86fa8c9_0    conda-forge

I messed around with installing a few adjacent version of cudnn (8.1 and 8.3) using mamba, but this didn't fix it, so I went back to the one in the yml file.

I'm unable to update nvidia drivers because I don't have admin privileges and in any case this is a shared cluster so they won't want to break other people's stuff.

Here's what I get from nvidia-smi:

crodge4@bio:~$ nvidia-smi
Tue Jul 11 00:42:39 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03   Driver Version: 510.108.03   CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    On   | 00000000:1B:00.0 Off |                  Off |
| 30%   28C    P8    17W / 300W |      5MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000    On   | 00000000:1C:00.0 Off |                  Off |
| 30%   28C    P8    24W / 300W |      5MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A6000    On   | 00000000:1D:00.0 Off |                  Off |
| 30%   30C    P8    21W / 300W |      5MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A6000    On   | 00000000:1E:00.0 Off |                  Off |
| 30%   30C    P8    17W / 300W |      5MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      3245      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      3245      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A      3245      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A      3245      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

When I start my process, I can see it get up to about 500 MiB of GPU Memory Usage, and then the crash occurs. I don't know if the memory usage is causal to the crash though.

Previous clusters I've used had some kind of "head node" where you could submit jobs but couldn't see any GPUs. This one is set up differently, I can see all 4 GPUs as soon as I log in. But they told me never to use a GPU outside of a slurm job or I'd get kicked off the cluster. So when I submit a slurm job, then sleap.system_setup can only see the GPU that I reserve, and that's when the problem occurs.

talmo Jul 11, 2023
Maintainer

Hi @cxrodgers,

I'm unable to update nvidia drivers because I don't have admin privileges and in any case this is a shared cluster so they won't want to break other people's stuff.

I figured... I imagine they'll want to bring that up to >525 at some point soon since it's required for newer CUDA versions. Probably not the culprit here though.

Previous clusters I've used had some kind of "head node" where you could submit jobs but couldn't see any GPUs. This one is set up differently, I can see all 4 GPUs as soon as I log in. But they told me never to use a GPU outside of a slurm job or I'd get kicked off the cluster. So when I submit a slurm job, then sleap.system_setup can only see the GPU that I reserve, and that's when the problem occurs.

Yeah weird that they have GPUs on the head node at all! Maybe they're just reusing a compute node?

In any case, assuming the compute nodes where the SLURM jobs actually get run is also using A6000s, I don't think this would be the issue.

This brings us back to the CUDA/cuDNN version. Because we install it via conda, the appropriate versions should be used, but if you're on TensorFlow 2.7.0, you might be running into this issue where newer versions of TensorFlow fail to detect the conda cuda/cudnn libraries. Normally you'd get some errors about failing to load cuda/cudnn binaries, but because you have a system-level cuda/cudnn installed, it might be falling back to that and generating the weird driver version errors we're seeing.

Try adding these lines in your SLURM script as a workaround:

conda activate sleap
conda env config vars set LD_PRELOAD=$CONDA_PREFIX/lib/libcudart.so:$CONDA_PREFIX/lib/libcublas.so:$CONDA_PREFIX/lib/libcublasLt.so:$CONDA_PREFIX/lib/libcufft.so:$CONDA_PREFIX/lib/libcurand.so:$CONDA_PREFIX/lib/libcusolver.so:$CONDA_PREFIX/lib/libcusparse.so:$CONDA_PREFIX/lib/libcudnn.so
conda deactivate
conda activate sleap

# ... run things as usual

Give that a go and let us know if it works!

Talmo

Answer selected by talmo

cxrodgers · 2023-07-11T02:04:34Z

cxrodgers
Jul 11, 2023
Author

Wow, it worked! Thank you!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDNN_STATUS_NOT_INITIALIZED #1380

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

CUDNN_STATUS_NOT_INITIALIZED #1380

cxrodgers Jul 8, 2023

Replies: 2 comments · 2 replies

talmo Jul 10, 2023 Maintainer

cxrodgers Jul 11, 2023 Author

talmo Jul 11, 2023 Maintainer

cxrodgers Jul 11, 2023 Author

cxrodgers
Jul 8, 2023

Replies: 2 comments 2 replies

talmo
Jul 10, 2023
Maintainer

cxrodgers Jul 11, 2023
Author

talmo Jul 11, 2023
Maintainer

cxrodgers
Jul 11, 2023
Author