Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorFlow 2.7 does not detect CUDA installed through conda #52988

Closed
drasmuss opened this issue Nov 8, 2021 · 34 comments
Closed

TensorFlow 2.7 does not detect CUDA installed through conda #52988

drasmuss opened this issue Nov 8, 2021 · 34 comments
Assignees
Labels
stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author subtype: ubuntu/linux Ubuntu/Linux Build/Installation Issues TF 2.7 Issues related to TF 2.7.0 type:bug Bug type:build/install Build and install issues

Comments

@drasmuss
Copy link
Contributor

drasmuss commented Nov 8, 2021

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): 2.7.0
  • Python version: 3.8
  • Bazel version (if compiling from source): N/A
  • GCC/Compiler version (if compiling from source): N/A
  • CUDA/cuDNN version: 11.2/8.1
  • GPU model and memory: GTX 2080Ti

Describe the current behavior

After installing cuda/cudnn through conda (conda install cudatoolkit=11.2 cudnn=8.1), TensorFlow 2.7 reports that it cannot find the cuda libraries.

2021-11-08 14:49:16.412959: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-11-08 14:49:16.413006: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-11-08 14:49:22.640508: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-11-08 14:49:22.640617: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2021-11-08 14:49:22.640698: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2021-11-08 14:49:22.640776: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory
2021-11-08 14:49:22.640853: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcurand.so.10'; dlerror: libcurand.so.10: cannot open shared object file: No such file or directory
2021-11-08 14:49:22.640941: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory
2021-11-08 14:49:22.641022: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory
2021-11-08 14:49:22.641099: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory
2021-11-08 14:49:22.641120: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and se
tup the required libraries for your platform.

Installing TensorFlow 2.6 (or earlier) in the same environment, with the same cuda/cudnn installation, doesn't show any problem, it detects the libraries and GPU support works as expected.

The problem can be worked around by manually adding the conda lib directory to LD_LIBRARY_PATH (export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib). However, obviously this is not ideal, as it needs to be repeated/adjusted for every new conda environment. It would be better if TensorFlow just detected the conda installed libraries, as it did in TensorFlow < 2.7.

Describe the expected behavior

TensorFlow should detect cuda/cudnn libraries installed through conda, as it did in TensorFlow<2.7.

Contributing

  • Do you want to contribute a PR? (yes/no): no
  • Briefly describe your candidate solution(if contributing):

Standalone code to reproduce the issue

conda create -n tmp python=3.8
conda activate tmp
conda install -c conda-forge cudatoolkit=11.2 cudnn=8.1
pip install "tensorflow==2.7.0"
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"  # displays []
LD_LIBRARY_PATH=LD_LIBRARY_PATH:$CONDA_PREFIX/lib python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"  # displays [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
pip install "tensorflow<2.7.0"
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"  # displays [[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]]
@drasmuss drasmuss added the type:bug Bug label Nov 8, 2021
@tilakrayal tilakrayal added TF 2.7 Issues related to TF 2.7.0 subtype: ubuntu/linux Ubuntu/Linux Build/Installation Issues type:build/install Build and install issues labels Nov 9, 2021
@tilakrayal
Copy link
Contributor

@drasmuss ,
We can see that you have installed tensorflow from conda environment.Installation issues within the Anaconda environment are tracked in the Anaconda repo.Please try to install in new virtual environment from this link and let us know if it is still an issue.Thanks!

@tilakrayal tilakrayal added the stat:awaiting response Status - Awaiting response from author label Nov 9, 2021
@drasmuss
Copy link
Contributor Author

drasmuss commented Nov 9, 2021

I'm not installing tensorflow from conda, just cuda/cudnn. Tensorflow is being installed from pip like normal. And you can see in the reproduction steps I posted above that we're starting from a new virtual environment (repeated below for convenience).

conda create -n tmp python=3.8
conda activate tmp
conda install -c conda-forge cudatoolkit=11.2 cudnn=8.1
pip install "tensorflow==2.7.0"
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"  # displays []
LD_LIBRARY_PATH=LD_LIBRARY_PATH:$CONDA_PREFIX/lib python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"  # displays [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
pip install "tensorflow<2.7.0"
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"  # displays [[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]]

Also note that nothing has changed on the conda side of things; we're still using the exact same environment with the same cuda/cudnn libraries, but it works in TF 2.6 and fails in TF 2.7. So I don't think the issue is on the conda side, something has changed in TensorFlow that has made this stop working.

@tilakrayal tilakrayal removed the stat:awaiting response Status - Awaiting response from author label Nov 9, 2021
@tilakrayal tilakrayal assigned Saduf2019 and sanatmpa1 and unassigned Saduf2019 and tilakrayal Nov 9, 2021
@sanatmpa1 sanatmpa1 assigned jvishnuvardhan and unassigned sanatmpa1 Nov 9, 2021
@pradyyadav
Copy link

pradyyadav commented Nov 12, 2021

Open the terminal and type

nano ~/.bashrc

at the end of the file add the following two lines

export PATH=$PATH:/usr/local/cuda-11.2/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.2/lib64

ensure no spaces on both side of '=' sign.

if it still does not works, try adding for version 11.0

export PATH=$PATH:/usr/local/cuda-11.0/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.0/lib64

@drasmuss
Copy link
Contributor Author

As mentioned, CUDA is being installed through conda, so /usr/local/cuda- is not the correct path (the correct path is given in the original post: $CONDA_PREFIX/lib). However, hard coding that into .bashrc isn't a solution, because $CONDA_PREFIX changes depending on which conda environment you have active.

@jvishnuvardhan jvishnuvardhan added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Nov 15, 2021
@mihaimaruseac
Copy link
Collaborator

Conda installs are not officially supported by Google

@ddaspit
Copy link

ddaspit commented Nov 29, 2021

I installed Tensorflow 2.7 on Windows with CUDA 11.2 and cuDNN 8.1 (no conda involved). I received the same Could not load dynamic library errors. I switched to CUDA to 11.0 and it worked. I am guessing that the pip packages for Tensorflow 2.7 were accidentally built against CUDA 11.0 instead of 11.2.

@janniksinz
Copy link

janniksinz commented Nov 29, 2021

Open the terminal and type

nano ~/.bashrc

at the end of the file add the following two lines

export PATH=$PATH:/usr/local/cuda-11.2/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.2/lib64

ensure no spaces on both side of '=' sign.

if it still does not works, try adding for version 11.0

export PATH=$PATH:/usr/local/cuda-11.0/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.0/lib64

Thank you, this also works with cuda-11.4. But how would you fix this issue in a jupyter notebook? For the pretty niche use case that you would need tf=2.7.0 features.

When I start a jupyter server within a env that has these PATHs exported, it only shows the CPU. When exporting Paths in the notebook it doesn't work either.

@jesusdpa1
Copy link

jesusdpa1 commented Dec 1, 2021

This seems to solve the issue:

conda activate ENVNAME

cd $CONDA_PREFIX
mkdir -p ./etc/conda/activate.d
mkdir -p ./etc/conda/deactivate.d
touch ./etc/conda/activate.d/env_vars.sh
touch ./etc/conda/deactivate.d/env_vars.sh

Edit ./etc/conda/activate.d/env_vars.sh as follows:

#!/bin/sh

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib

Edit ./etc/conda/deactivate.d/env_vars.sh as follows:

#!/bin/sh

unset LD_LIBRARY_PATH

Source

https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#macos-and-linux

@holongate
Copy link

holongate commented Dec 12, 2021

I don't want to be dismissive here, but there is a lack of understanding of the problem specifically introduced by TF 2.7:

  • A conda environment does install native libraries and does ensure they will be found by the os dynamic loader mechanism for the programs that want to find these libraries.
  • Until TF 2.7 this was the way it worked, like the gazillion other native apps (including cuda ones)
  • TF 2.7, not conda, specifically broke that by ignoring the os loading mechanism for an unknown/undocumented reason

This problem is not just a techie point, it does have deep implication for businesses that do real products.
This method of working is the only reliable one for teams that work on more than one TF project, require multiple TF/CUDA/Python combinations on the same workstation (without root access).
By the way, the CUDA stack from the official nvidia channel, like nvcc/ptxas perfectly work in conda and is recommended by Nvidia itself.

For my suffering peers, if you don't have access to root, you can use this small poorly-documented feature in your environment.yml:

name: base-tf-cuda-env
channel:
  - nvidia
  - conda-forge
  - defaults
dependencies:
  - python=3.8
# Install cuda libs + ptxas compiler from nvidia channel
# This will accelerate the compilation of kernels for your specific card
  - cudatoolkit=11
  - cudnn=8
  - cupti=11
  - cuda-nvcc
...
  - pip
  - pip:
     - tensorflow==2.7.*
variables:
  # In case you want to see your own logs and tame the TF loggorrhea
  TF_CPP_MIN_LOG_LEVEL: 3
  # Adjust to point to your local env path:
  LD_LIBRARY_PATH: /home/me/.conda/envs/thisenvname/lib

Upon conda activate, the env variables will be set for you, and unset on deactivation.
Better than nothing, but might interfere with some other configuration...

@chainyo
Copy link

chainyo commented Jan 19, 2022

Upon conda activate, the env variables will be set for you, and unset on deactivation.
Better than nothing, but might interfere with some other configuration...

Really appreciate the file you provided!
There is a typo for channelS part, but that's awesome, thanks 👍

@filippocastelli
Copy link

Upon conda activate, the env variables will be set for you, and unset on deactivation. Better than nothing, but might interfere with some other configuration...

@holongate 's env is a good workaround and solves the problem for me.

I'm quite astonished by how little thought was given on the issue - which is clearly a problem with TF 2.7 itself, and not with conda - and by how much time you waste on commenting that conda installs are not supported by Google.

@drasmuss
Copy link
Contributor Author

For anyone looking for a one-liner solution, you can do

conda env config vars set LD_LIBRARY_PATH=$CONDA_PREFIX/lib

(with the environment you want to modify activated). This has a similar effect as @jesusdpa1's solution here #52988 (comment), it'll set LD_LIBRARY_PATH when the environment is activated and unset it when it's deactivated.

You still need to repeat that for every new conda environment though. It would be better if TensorFlow just detected the conda installed libraries, as it did in TensorFlow<=2.6.

@drasmuss
Copy link
Contributor Author

drasmuss commented Nov 22, 2022

The official documentation suggests manually doing export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/ every time you want to use TensorFlow, which obviously isn't really a feasible solution.

This is discussed above, but I'll reiterate the main points here for anyone coming across this thread:

  1. Currently the best solution is to use the community-maintained TensorFlow installation from conda-forge (e.g. conda install -c conda-forge tensorflow). Generally speaking that should just work.
  2. If 1. isn't possible/working for some reason (e.g. because you need to use a very recent release of TensorFlow that isn't yet available on conda-forge), the easiest solution is here TensorFlow 2.7 does not detect CUDA installed through conda #52988 (comment).
  3. However, sometimes 2. can cause problems with other system packages, since you're modifying the global LD_LIBRARY_PATH (note that this is also a problem with the approach recommended in the official documentation). If you run into issues like that, you can try this approach TensorFlow 2.7 does not detect CUDA installed through conda #52988 (comment), with the caveats mentioned there that this might break in future updates.

I'll reiterate again, that all of these solutions are a downgrade in the user experience from TensorFlow < 2.7, when TensorFlow just correctly detected the conda-installed CUDA libraries without any fiddling required from the user.

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Nov 22, 2022
@mohantym mohantym removed their assignment Nov 22, 2022
@TuanBC
Copy link

TuanBC commented Nov 23, 2022

A kind-of semi-automated snippet for solving cudatoolkit PATH problem in conda environment that I am using:

conda activate tf_env
conda install –c conda-forge cudatoolkit cudnn

mkdir -p $CONDA_PREFIX/etc/conda/activate.d
mkdir -p $CONDA_PREFIX/etc/conda/deactivate.d

printf '#!/bin/sh\nexport OLD_LD_LIBRARY_PATH=$LD_LIBRARY_PATH\nexport LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/\n' > $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh 
printf '#!/bin/sh\nexport LD_LIBRARY_PATH=$OLD_LD_LIBRARY_PATH\nunset OLD_LD_LIBRARY_PATH\n' > $CONDA_PREFIX/etc/conda/deactivate.d/env_vars.sh 

This snippet automatically set and unset neccessary environment variables when you activate or deactivate conda environment. It could be useful not only for TF users, but for some other library where it needs CUDA dependencies to be built manually from source.

@SuryanarayanaY
Copy link
Collaborator

Hi @drasmuss ,
Could you please refer this documentation source.

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/

For your convenience it is recommended that you automate it with the following commands. The system paths will be automatically configured when you activate this conda environment.

mkdir -p $CONDA_PREFIX/etc/conda/activate.d
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/'$CONDA_PREFIX/etc/conda/activate.d/env_vars.sh

With the above two lines of code it is not required to use the command export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/ every time you want to use tensorflow.Its one time setup and after that you can use the environment any no of times.

I hope this shall address the issue.Please confirm if still missing anything here. Thanks!

@SuryanarayanaY SuryanarayanaY added the stat:awaiting response Status - Awaiting response from author label Mar 1, 2023
@drasmuss
Copy link
Contributor Author

drasmuss commented Mar 1, 2023

Hi @SuryanarayanaY,

See #52988 (comment) for a summary of the discussion in this thread. The short answer is that no, that solution doesn't address the issue.

Longer answer: The solution you describe from the docs is basically a worse version of idea 2 from that summary above. Worse in that it's more complicated, and it won't unset LD_LIBRARY_PATH when the environment is deactivated. But as mentioned above, idea 2 is not really a viable solution because LD_LIBRARY_PATH is a global environment variable, and modifying it has negative side effects on lots of other system packages besides TensorFlow.

And, to reiterate again, all of these "solutions" are downgrades from the behaviour prior to TensorFlow 2.7, where TensorFlow just correctly detected the CUDA libraries without requiring any manual intervention from users.

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Mar 1, 2023
@sachinprasadhs
Copy link
Contributor

@drasmuss , I'm just curious if you have observed the same behavior in 2.11 version.
Also, since 2.12 release is around the corner, you can wait for few days and check it, since we are bumping the CUDA supported version to 11.8. Thanks!

@sachinprasadhs sachinprasadhs added the stat:awaiting response Status - Awaiting response from author label Mar 13, 2023
@drasmuss
Copy link
Contributor Author

drasmuss commented Mar 13, 2023

Yes, the behaviour is the same in 2.11 and 2.12.0rc1 (I wouldn't expect it to change between rc1 and the full 2.12 release).

Note that in 2.12 the error message has changed, so it displays

2023-03-13 14:41:41.580759: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-03-13 14:41:41.602435: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.

instead of the old "Could not load dynamic library..." errors, but it's the same issue.

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Mar 13, 2023
@sachinprasadhs sachinprasadhs added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Mar 14, 2023
@githubskiy
Copy link

did you guys solve this problem?

@Venkat6871
Copy link
Contributor

Hi,

Thank you for opening this issue. Since this issue has been open for a long time, the code/debug information for this issue may not be relevant with the current state of the code base.

The Tensorflow team is constantly improving the framework by fixing bugs and adding new features. We suggest you try the latest TensorFlow version with the latest compatible hardware configuration which could potentially resolve the issue. If you are still facing the issue, please create a new GitHub issue with your latest findings, with all the debugging information which could help us investigate.

Please follow the release notes to stay up to date with the latest developments which are happening in the Tensorflow space.

@Venkat6871 Venkat6871 added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Oct 11, 2024
Copy link

This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.

@github-actions github-actions bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Oct 19, 2024
Copy link

This issue was closed because it has been inactive for 7 days since being marked as stale. Please reopen if you'd like to work on this further.

Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author subtype: ubuntu/linux Ubuntu/Linux Build/Installation Issues TF 2.7 Issues related to TF 2.7.0 type:bug Bug type:build/install Build and install issues
Projects
None yet
Development

No branches or pull requests