Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add instructions for CUDA issues with new TensorFlow versions #1390

Open
talmo opened this issue Jul 17, 2023 · 1 comment
Open

Add instructions for CUDA issues with new TensorFlow versions #1390

talmo opened this issue Jul 17, 2023 · 1 comment
Labels
documentation This issue exists because the documentation is outdated.

Comments

@talmo
Copy link
Collaborator

talmo commented Jul 17, 2023

We should add something in the docs about adding the CUDA paths manually to get around the annoying TensorFlow-not-finding-CUDA issue:

conda activate sleap
conda env config vars set LD_PRELOAD=$CONDA_PREFIX/lib/libcudart.so:$CONDA_PREFIX/lib/libcublas.so:$CONDA_PREFIX/lib/libcublasLt.so:$CONDA_PREFIX/lib/libcufft.so:$CONDA_PREFIX/lib/libcurand.so:$CONDA_PREFIX/lib/libcusolver.so:$CONDA_PREFIX/lib/libcusparse.so:$CONDA_PREFIX/lib/libcudnn.so
conda deactivate
conda activate sleap

# ... run things as usual

Original discussion:

Hi @cxrodgers,

I'm unable to update nvidia drivers because I don't have admin privileges and in any case this is a shared cluster so they won't want to break other people's stuff.

I figured... I imagine they'll want to bring that up to >525 at some point soon since it's required for newer CUDA versions. Probably not the culprit here though.

Previous clusters I've used had some kind of "head node" where you could submit jobs but couldn't see any GPUs. This one is set up differently, I can see all 4 GPUs as soon as I log in. But they told me never to use a GPU outside of a slurm job or I'd get kicked off the cluster. So when I submit a slurm job, then sleap.system_setup can only see the GPU that I reserve, and that's when the problem occurs.

Yeah weird that they have GPUs on the head node at all! Maybe they're just reusing a compute node?

In any case, assuming the compute nodes where the SLURM jobs actually get run is also using A6000s, I don't think this would be the issue.

This brings us back to the CUDA/cuDNN version. Because we install it via conda, the appropriate versions should be used, but if you're on TensorFlow 2.7.0, you might be running into this issue where newer versions of TensorFlow fail to detect the conda cuda/cudnn libraries. Normally you'd get some errors about failing to load cuda/cudnn binaries, but because you have a system-level cuda/cudnn installed, it might be falling back to that and generating the weird driver version errors we're seeing.

Try adding these lines in your SLURM script as a workaround:

conda activate sleap
conda env config vars set LD_PRELOAD=$CONDA_PREFIX/lib/libcudart.so:$CONDA_PREFIX/lib/libcublas.so:$CONDA_PREFIX/lib/libcublasLt.so:$CONDA_PREFIX/lib/libcufft.so:$CONDA_PREFIX/lib/libcurand.so:$CONDA_PREFIX/lib/libcusolver.so:$CONDA_PREFIX/lib/libcusparse.so:$CONDA_PREFIX/lib/libcudnn.so
conda deactivate
conda activate sleap

# ... run things as usual

Give that a go and let us know if it works!

Talmo

Originally posted by @talmo in #1380 (reply in thread)

@talmo talmo added the documentation This issue exists because the documentation is outdated. label Jul 17, 2023
@roomrys
Copy link
Collaborator

roomrys commented Sep 12, 2023

This should be outdated for the current source code where the environment.yml for computers with NVIDIA GPUs use tensorflow from the sleap channel:

- sleap::tensorflow >=2.6.3,<2.11 # No windows GPU support for >2.10

But...

if we decide to pip install tensorflow in the near future (say to upgrade our python version 👀), then we should add this to the install docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation This issue exists because the documentation is outdated.
Projects
None yet
Development

No branches or pull requests

2 participants