Add instructions for CUDA issues with new TensorFlow versions #1390

talmo · 2023-07-17T17:43:28Z

We should add something in the docs about adding the CUDA paths manually to get around the annoying TensorFlow-not-finding-CUDA issue:

conda activate sleap
conda env config vars set LD_PRELOAD=$CONDA_PREFIX/lib/libcudart.so:$CONDA_PREFIX/lib/libcublas.so:$CONDA_PREFIX/lib/libcublasLt.so:$CONDA_PREFIX/lib/libcufft.so:$CONDA_PREFIX/lib/libcurand.so:$CONDA_PREFIX/lib/libcusolver.so:$CONDA_PREFIX/lib/libcusparse.so:$CONDA_PREFIX/lib/libcudnn.so
conda deactivate
conda activate sleap

# ... run things as usual

Original discussion:

Hi @cxrodgers,

I'm unable to update nvidia drivers because I don't have admin privileges and in any case this is a shared cluster so they won't want to break other people's stuff.

I figured... I imagine they'll want to bring that up to >525 at some point soon since it's required for newer CUDA versions. Probably not the culprit here though.

Previous clusters I've used had some kind of "head node" where you could submit jobs but couldn't see any GPUs. This one is set up differently, I can see all 4 GPUs as soon as I log in. But they told me never to use a GPU outside of a slurm job or I'd get kicked off the cluster. So when I submit a slurm job, then sleap.system_setup can only see the GPU that I reserve, and that's when the problem occurs.

Yeah weird that they have GPUs on the head node at all! Maybe they're just reusing a compute node?

In any case, assuming the compute nodes where the SLURM jobs actually get run is also using A6000s, I don't think this would be the issue.

This brings us back to the CUDA/cuDNN version. Because we install it via conda, the appropriate versions should be used, but if you're on TensorFlow 2.7.0, you might be running into this issue where newer versions of TensorFlow fail to detect the conda cuda/cudnn libraries. Normally you'd get some errors about failing to load cuda/cudnn binaries, but because you have a system-level cuda/cudnn installed, it might be falling back to that and generating the weird driver version errors we're seeing.

Try adding these lines in your SLURM script as a workaround:

conda activate sleap
conda env config vars set LD_PRELOAD=$CONDA_PREFIX/lib/libcudart.so:$CONDA_PREFIX/lib/libcublas.so:$CONDA_PREFIX/lib/libcublasLt.so:$CONDA_PREFIX/lib/libcufft.so:$CONDA_PREFIX/lib/libcurand.so:$CONDA_PREFIX/lib/libcusolver.so:$CONDA_PREFIX/lib/libcusparse.so:$CONDA_PREFIX/lib/libcudnn.so
conda deactivate
conda activate sleap

# ... run things as usual

Give that a go and let us know if it works!

Talmo

Originally posted by @talmo in #1380 (reply in thread)

The text was updated successfully, but these errors were encountered:

roomrys · 2023-09-12T20:04:46Z

This should be outdated for the current source code where the environment.yml for computers with NVIDIA GPUs use tensorflow from the sleap channel:

sleap/environment.yml

Line 38 in e4fca4f

- sleap::tensorflow >=2.6.3,<2.11 # No windows GPU support for >2.10

But...

if we decide to pip install tensorflow in the near future (say to upgrade our python version 👀), then we should add this to the install docs.

talmo added the documentation This issue exists because the documentation is outdated. label Jul 17, 2023

This was referenced Sep 12, 2023

Failure to Start GUI #1476

Closed

Set LD_LIBRARY_PATH on mamba activate #1496

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add instructions for CUDA issues with new TensorFlow versions #1390

Add instructions for CUDA issues with new TensorFlow versions #1390

talmo commented Jul 17, 2023

roomrys commented Sep 12, 2023 •

edited

Loading

Add instructions for CUDA issues with new TensorFlow versions #1390

Add instructions for CUDA issues with new TensorFlow versions #1390

Comments

talmo commented Jul 17, 2023

roomrys commented Sep 12, 2023 • edited Loading

But...

roomrys commented Sep 12, 2023 •

edited

Loading