You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We should add something in the docs about adding the CUDA paths manually to get around the annoying TensorFlow-not-finding-CUDA issue:
conda activate sleap
conda env config vars set LD_PRELOAD=$CONDA_PREFIX/lib/libcudart.so:$CONDA_PREFIX/lib/libcublas.so:$CONDA_PREFIX/lib/libcublasLt.so:$CONDA_PREFIX/lib/libcufft.so:$CONDA_PREFIX/lib/libcurand.so:$CONDA_PREFIX/lib/libcusolver.so:$CONDA_PREFIX/lib/libcusparse.so:$CONDA_PREFIX/lib/libcudnn.so
conda deactivate
conda activate sleap
# ... run things as usual
I'm unable to update nvidia drivers because I don't have admin privileges and in any case this is a shared cluster so they won't want to break other people's stuff.
I figured... I imagine they'll want to bring that up to >525 at some point soon since it's required for newer CUDA versions. Probably not the culprit here though.
Previous clusters I've used had some kind of "head node" where you could submit jobs but couldn't see any GPUs. This one is set up differently, I can see all 4 GPUs as soon as I log in. But they told me never to use a GPU outside of a slurm job or I'd get kicked off the cluster. So when I submit a slurm job, then sleap.system_setup can only see the GPU that I reserve, and that's when the problem occurs.
Yeah weird that they have GPUs on the head node at all! Maybe they're just reusing a compute node?
In any case, assuming the compute nodes where the SLURM jobs actually get run is also using A6000s, I don't think this would be the issue.
This brings us back to the CUDA/cuDNN version. Because we install it via conda, the appropriate versions should be used, but if you're on TensorFlow 2.7.0, you might be running into this issue where newer versions of TensorFlow fail to detect the conda cuda/cudnn libraries. Normally you'd get some errors about failing to load cuda/cudnn binaries, but because you have a system-level cuda/cudnn installed, it might be falling back to that and generating the weird driver version errors we're seeing.
Try adding these lines in your SLURM script as a workaround:
conda activate sleap
conda env config vars set LD_PRELOAD=$CONDA_PREFIX/lib/libcudart.so:$CONDA_PREFIX/lib/libcublas.so:$CONDA_PREFIX/lib/libcublasLt.so:$CONDA_PREFIX/lib/libcufft.so:$CONDA_PREFIX/lib/libcurand.so:$CONDA_PREFIX/lib/libcusolver.so:$CONDA_PREFIX/lib/libcusparse.so:$CONDA_PREFIX/lib/libcudnn.so
conda deactivate
conda activate sleap
# ... run things as usual
We should add something in the docs about adding the CUDA paths manually to get around the annoying TensorFlow-not-finding-CUDA issue:
Original discussion:
Hi @cxrodgers,
I figured... I imagine they'll want to bring that up to >525 at some point soon since it's required for newer CUDA versions. Probably not the culprit here though.
Yeah weird that they have GPUs on the head node at all! Maybe they're just reusing a compute node?
In any case, assuming the compute nodes where the SLURM jobs actually get run is also using A6000s, I don't think this would be the issue.
This brings us back to the CUDA/cuDNN version. Because we install it via conda, the appropriate versions should be used, but if you're on TensorFlow 2.7.0, you might be running into this issue where newer versions of TensorFlow fail to detect the conda cuda/cudnn libraries. Normally you'd get some errors about failing to load cuda/cudnn binaries, but because you have a system-level cuda/cudnn installed, it might be falling back to that and generating the weird driver version errors we're seeing.
Try adding these lines in your SLURM script as a workaround:
Give that a go and let us know if it works!
Talmo
Originally posted by @talmo in #1380 (reply in thread)
The text was updated successfully, but these errors were encountered: