Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ImportError: libcudart.so.11.0: cannot open shared object file: No such file or directory #461

Closed
vikashg opened this issue Oct 8, 2023 · 17 comments
Assignees
Labels
bug Something isn't working

Comments

@vikashg
Copy link
Collaborator

vikashg commented Oct 8, 2023

Describe the bug
When I try to build the monai-deploy docker image, I get the following error

Traceback (most recent call last):
  File "/home/gupta/miniconda3/envs/monai-deploy-test/bin/monai-deploy", line 5, in <module>
    from holoscan.cli.__main__ import main
  File "/home/gupta/miniconda3/envs/monai-deploy-test/lib/python3.8/site-packages/holoscan/__init__.py", line 18, in <module>
    from . import core, gxf
  File "/home/gupta/miniconda3/envs/monai-deploy-test/lib/python3.8/site-packages/holoscan/core/__init__.py", line 65, in <module>
    from ..graphs._graphs import FragmentGraph, OperatorGraph
  File "/home/gupta/miniconda3/envs/monai-deploy-test/lib/python3.8/site-packages/holoscan/graphs/__init__.py", line 24, in <module>
    from ._graphs import FragmentFlowGraph, OperatorFlowGraph
ImportError: libcudart.so.11.0: cannot open shared object file: No such file or directory

I have the following pytorch version '2.0.1+cu117'. and cuda version 12.0. Maybe the version mismatch is causing this problem?

Steps/Code to reproduce bug
With the above version, try running the simple_imaging_app in the notebook tutorials

Environment details (please complete the following information)

  • OS/Platform: Ubuntu 20.04
  • Python Version: 3.8.17
  • Method of MONAI Deploy App SDK install: [pip, conda, Docker, or from source] from source
  • pytorch version: '2.0.1+cu117'
  • Cuda version: 12.0
@vikashg vikashg added the bug Something isn't working label Oct 8, 2023
@MMelQin
Copy link
Collaborator

MMelQin commented Oct 9, 2023

@vikashg Thanks for the issue.

Yes, it is a compatibility issue between the underlying Holoscan SDK, CUDA, and Torch, and I posted Discussion 460 a couple of days ago. Please see the combo that I have tested.

@MMelQin
Copy link
Collaborator

MMelQin commented Oct 9, 2023

@vikashg here is the CUDA Toolkit, GPU Driver version, and torch in my environment. With this combo, the apps work and torch can access the GPU.

$ nvidia-smi
Mon Oct  9 13:46:27 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Quadro GV100                   On  | 00000000:65:00.0 Off |                  Off |
| 29%   40C    P2              26W / 250W |     89MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1193      G   /usr/lib/xorg/Xorg                           74MiB |
|    0   N/A  N/A      1338      G   /usr/bin/gnome-shell                         12MiB |
+---------------------------------------------------------------------------------------+

$ pip list | grep torch
torch                         2.0.1

python --version
Python 3.8.10

Can you please ensure torch is pinned at ~=2.0.1 and try again?

@MMelQin
Copy link
Collaborator

MMelQin commented Oct 9, 2023

@vikashg So, I have reproduced the Packager error with Torch 2.0.1 and CUDA 12.02, though I had been using Torch 2.0.1 and CUDA 11.7 without issues with the app itself or packager. The underlying Holoscan SDK v0.6 supports CUDA 11.8, so I will reset the CUDA on my desktop to it and test.

@MMelQin
Copy link
Collaborator

MMelQin commented Oct 10, 2023

@vikashg Here is the confirmed workaround/fix, confirmed by @mocsharp and me.
sudo apt-get install cuda-runtime-11.7

I tried to remove CUDA 12.2 and install CUDA 11.8, but after multiple attempts, I still had CUDA 12.2 per report by nvidia-smi. The same ImportError: libcudart.so.11.0 persisted.

@mocsharp had the similar error, though on libcudart.so.12.2, and apt-get install cuda-runtime fixed it. I can confirm sudo apt-get install cuda-runtime-11.8 does the trick with the CUDA 12.2 in my dev env. In your case with CUDA 11.7, the cuda-runtime-11.7 can be used.

Please confirm the results once you can give it a shot.

@vikashg
Copy link
Collaborator Author

vikashg commented Oct 11, 2023

Hey @MMelQin Thanks for looking into it . I will test again today and post my outcome. I was bit occupied with couple of other things.
Thanks

@mpsampat
Copy link

mpsampat commented Oct 12, 2023

@MMelQin , just a quick report on the same issue in case it provides a clue.
In the past I was able to build with monai-deploy on CPU only machine. (tested 0.5.0);
we put monai-deploy in our CI/CD (very happy about that :))

https://gitlab.com/flywheel-io/scientific-solutions/gears/breast-density-classifier/-/blob/main/.gitlab-ci.yml?ref_type=heads#L34

Now with 0.6.0 I am still having this error that @vikashg reported (shown below). I am on CPU only machine;
I did "conda install -c "nvidia/label/cuda-11.8.0" cuda-runtime' But no luck resolving it. I am on Ubuntu 18.04.
I also try "conda list | grep nvidia" and i dont have any 12.x.x (see below);
Will try on a VM with GPU on vertex AI tomorrow.
Mehul

the error is:

"monai-deploy exec ct-seg-monai-swin-unetr/ -i ~/monai-bundles/1_axial_chest_ct.nii.gz -o output. -m ct-seg-monai-swin-unetr/best_metric_model.ts
Traceback (most recent call last):
File "/home/msampat/miniconda3/envs/py38b/bin/monai-deploy", line 5, in
from holoscan.cli.main import main
File "/home/msampat/miniconda3/envs/py38b/lib/python3.8/site-packages/holoscan/init.py", line 17, in
from . import cli, core, gxf
File "/home/msampat/miniconda3/envs/py38b/lib/python3.8/site-packages/holoscan/core/init.py", line 65, in
from ..graphs._graphs import FragmentGraph, OperatorGraph
File "/home/msampat/miniconda3/envs/py38b/lib/python3.8/site-packages/holoscan/graphs/init.py", line 24, in
from ._graphs import FragmentFlowGraph, OperatorFlowGraph
ImportError: libcudart.so.11.0: cannot open shared object file: No such file or directory"

(py38b) msampat@scien-dev1:~/gears$ conda list | grep nvidia
cuda-cudart 11.8.89 0 nvidia/label/cuda-11.8.0
cuda-cupti 11.8.87 0 nvidia
cuda-libraries 11.8.0 0 nvidia/label/cuda-11.8.0
cuda-nvrtc 11.8.89 0 nvidia/label/cuda-11.8.0
cuda-nvtx 11.8.86 0 nvidia
cuda-runtime 11.8.0 0 nvidia/label/cuda-11.8.0
libcublas 11.11.3.6 0 nvidia/label/cuda-11.8.0
libcufft 10.9.0.58 0 nvidia/label/cuda-11.8.0
libcufile 1.4.0.31 0 nvidia/label/cuda-11.8.0
libcurand 10.3.0.86 0 nvidia/label/cuda-11.8.0
libcusolver 11.4.1.48 0 nvidia/label/cuda-11.8.0
libcusparse 11.7.5.86 0 nvidia/label/cuda-11.8.0
libnpp 11.8.0.86 0 nvidia/label/cuda-11.8.0
libnvjpeg 11.9.0.86 0 nvidia/label/cuda-11.8.0
nvidia-cublas-cu11 11.10.3.66 pypi_0 pypi
nvidia-cuda-cupti-cu11 11.7.101 pypi_0 pypi
nvidia-cuda-nvrtc-cu11 11.7.99 pypi_0 pypi
nvidia-cuda-runtime-cu11 11.7.99 pypi_0 pypi
nvidia-cudnn-cu11 8.5.0.96 pypi_0 pypi
nvidia-cufft-cu11 10.9.0.58 pypi_0 pypi
nvidia-curand-cu11 10.2.10.91 pypi_0 pypi
nvidia-cusolver-cu11 11.4.0.1 pypi_0 pypi
nvidia-cusparse-cu11 11.7.4.91 pypi_0 pypi
nvidia-nccl-cu11 2.14.3 pypi_0 pypi
nvidia-nvtx-cu11 11.7.91 pypi_0 pypi

@SameerShanbhogue
Copy link

Hi

Is this issue related to PATH ?

image

After export the PATH the issue gets resolved temporarily.

image

Also during monai-deploy-sdk install, there are warnings related to holoscan and monai-deploy not in PATH

@MMelQin
Copy link
Collaborator

MMelQin commented Oct 12, 2023

@mpsampat I noticed another issue in the log, the sub-command, monai-deploy exec, is not supported in v0.6 CLI (this command behaves similarly to directly running the app with Python interpreter but with the required env vars set based on the command line options)

For the real issue with not finding CUDA 11 runtime, I cannot say for sure, as I have not tested with conda and Ubuntu 18.04. MD App SDK (and the underlying Holoscan SDK needs Ubuntu 20.04).

@mpsampat
Copy link

@MMelQin thank you very much for your super quick response.

I see both your points. Tomorrow I will try on a VM with Ubuntu 20.04.

Also will try the path issue Sameer mentioned in this thread. Maybe some environment variable is not setup correctly on my side.

Will test tomorrow and report back here.

@vikashg
Copy link
Collaborator Author

vikashg commented Oct 18, 2023

@vikashg Here is the confirmed workaround/fix, confirmed by @mocsharp and me. sudo apt-get install cuda-runtime-11.7

Hi @MMelQin, So I am still getting the error. I am installing cuda-runtime-11.7 from here as sudo apt-get is not working for me. I will try again after the installation is complete.
Thanks

I tried to remove CUDA 12.2 and install CUDA 11.8, but after multiple attempts, I still had CUDA 12.2 per report by nvidia-smi. The same ImportError: libcudart.so.11.0 persisted.

@mocsharp had the similar error, though on libcudart.so.12.2, and apt-get install cuda-runtime fixed it. I can confirm sudo apt-get install cuda-runtime-11.8 does the trick with the CUDA 12.2 in my dev env. In your case with CUDA 11.7, the cuda-runtime-11.7 can be used.

Please confirm the results once you can give it a shot.

I have CUDA 12.0. Waiting for the above trial with cuda-runtime-11.7 before trying to installing CUDA 12.2.

@vikashg
Copy link
Collaborator Author

vikashg commented Oct 24, 2023

Hi @MMelQin After installing Cuda 11.7 it is working but docker packaging is not working. I am not able to copy the text but here is a screenshot
image

Initially, I thought it is because of the holoscan base image needs nvcr.io login. So I downloaded the holoscan image manually. But I still get this error.

@mpsampat
Copy link

mpsampat commented Oct 24, 2023

@vikashg could you tell me which version of torch and python you are using.
We use monai-deploy in our CI/CD and I now seeing this error with 0.5.1
https://gitlab.com/flywheel-io/scientific-solutions/gears/breast-density-classifier/-/jobs/5365017305#L82
"OSError: libcufft.so.11: cannot open shared object file: No such file or directory" from this line:
https://gitlab.com/flywheel-io/scientific-solutions/gears/breast-density-classifier/-/jobs/5365017305#L90

@MMelQin My pyproject.toml file has this:

[tool.poetry.dependencies]
python = "^3.10"
flywheel-gear-toolkit = "^0.2"
monai = "1.1.0"
monai-deploy-app-sdk = "0.5.1"
pydicom = "2.3.1"
highdicom = "0.20.0"
torch = ">=2.0.0, !=2.0.1"

So it picked python 3.11. @MMelQin do you have a preference for python version ? I can pin that and try with a recommended python version

@MMelQin
Copy link
Collaborator

MMelQin commented Oct 24, 2023

Hi @MMelQin After installing Cuda 11.7 it is working but docker packaging is not working. I am not able to copy the text but here is a screenshot image

Initially, I thought it is because of the holoscan base image needs nvcr.io login. So I downloaded the holoscan image manually. But I still get this error.

@vikashg Sorry about this error, which already logged as an issue. Workaround is to download the Wheel file from pypi.org, and then use the commanline option to pass the Wheel file to the Packager.

It will be fixed in the next public release of Holoscan SDK, and hence in MD App SDK. In the meantime, you can also patch the HS SDK code once it is installed in the env, as documented in the issue.

@MMelQin
Copy link
Collaborator

MMelQin commented Oct 26, 2023

@vikashg could you tell me which version of torch and python you are using. We use monai-deploy in our CI/CD and I now seeing this error with 0.5.1 https://gitlab.com/flywheel-io/scientific-solutions/gears/breast-density-classifier/-/jobs/5365017305#L82 "OSError: libcufft.so.11: cannot open shared object file: No such file or directory" from this line: https://gitlab.com/flywheel-io/scientific-solutions/gears/breast-density-classifier/-/jobs/5365017305#L90

@MMelQin My pyproject.toml file has this:

[tool.poetry.dependencies] python = "^3.10" flywheel-gear-toolkit = "^0.2" monai = "1.1.0" monai-deploy-app-sdk = "0.5.1" pydicom = "2.3.1" highdicom = "0.20.0" torch = ">=2.0.0, !=2.0.1"

So it picked python 3.11. @MMelQin do you have a preference for python version ? I can pin that and try with a recommended python version

@mpsampat My env has Python 3.8.10, monai 1.2.0 monai-deploy-app-sdk 0.6.0, torch 2.1.0, CUDA 12.2, on Ubuntu 20.04.

Noticed in your CI/CD log, torch version is at 2.1.0. I wonder if you looked at the version of torch when your CI/CD was successful; for me, torch 2.1 had issues detecting GPU because of the unsupported GPU driver version that was installed with CUDA 11.8, awhile back; completely removing CUDA and re-installing CUDA 12.2 got the CUDA 11 runtime lib error.
I don't know if you want to limit torch <2.0.1 given that 2.0.1 was already eliminated in your config.

Also, for MD App SDK v0.5.1, the Packager (and MAP) uses base image nvcr.io/nvidia/pytorch:22.08-py3, which has cuda11.7 and PyTorch 1.13 on Ubuntu 20.04.

@vikashg
Copy link
Collaborator Author

vikashg commented Oct 26, 2023

@mpsampat I am using python 3.8.17. I installed cuda 11.7 from here. After that I install monai-deploy-app-sdk from source. My torch version is 2.0.1.

However after doing this I lost my nvidia-smi
image
So, I believe something isn't right but the monai-deploy-app-sdk works.

@MMelQin
Copy link
Collaborator

MMelQin commented Nov 17, 2023

It's been a while but this issue will remain open since it has hit not just MONAI Deploy App SDK, but also other project(s) too, see here.

In summary, the reason for MD App SDK to encounter this issue is that v0.6 changed to depend on Holoscan SDK v0.6, which requires CUDA 11 runtime. MD App SDK (example applications) also depends on torch. Torch v2.0.1 depends on CUDA 11 runtime, so when torch is installed, CUDA 11 runtime lib got installed. However, torch v2.1 changed to depend on CUDA 12, while Holoscan SDK v0.6 package does not have its own means of installing CUDA 11 runtime, thus requiring its dependents use other means to get CUDA 11 in the env, either by having CUDA 11 installed, and/or setting the LD+LIBRARY_PATH.

The good news is that MD App SDK v1.0 (riding on Holoscan SDK v1.0), expected in Jan 2024, will support CUDA 12. In the meantime, one can also hope pinning torch version to 2.0.1 to still get CUDA 11 runtime installed in the env; this approach, however, turned out not reliable, at least within Github CI/CD actions, as importing holoscan again got hit with libcudart.so.11.0 not found issue.

The final piece of the solution is then to export the library path in the version specific Python site-packages, see PR#471 as well as the earlier mention with conda env. The snippet is also shown below,

          source .venv/bin/activate
          python3 -c 'import sys; print(sys.executable)'
          python3 -c 'import site; print(site.getsitepackages())'
          python3 -m pip freeze
          export LD_LIBRARY_PATH=`pwd`/.venv/lib/python3.8/site-packages/nvidia/cuda_runtime/lib:$LD_LIBRARY_PATH
          python3 -c 'from holoscan.core import *'

@MMelQin
Copy link
Collaborator

MMelQin commented Apr 15, 2024

App SDK v1.0 has just been released. In this version, CUDA 12 Runtime is required even for CPU only applications, so DUDA runtime shared library must be present, if full CUDA Toolkit not installed, and the lib path is set to the package folder in both cases, e.g.

          source .venv/bin/activate
          python3 -m pip install nvidia-cuda-runtime-cu12
          export LD_LIBRARY_PATH=`pwd`/.venv/lib/python3.8/site-packages/nvidia/cuda_runtime/lib:$LD_LIBRARY_PATH

@MMelQin MMelQin closed this as completed Apr 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants