-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] ImportError: libcudart.so.11.0: cannot open shared object file: No such file or directory #461
Comments
@vikashg Thanks for the issue. Yes, it is a compatibility issue between the underlying Holoscan SDK, CUDA, and Torch, and I posted Discussion 460 a couple of days ago. Please see the combo that I have tested. |
@vikashg here is the CUDA Toolkit, GPU Driver version, and torch in my environment. With this combo, the apps work and torch can access the GPU.
Can you please ensure torch is pinned at ~=2.0.1 and try again? |
@vikashg So, I have reproduced the Packager error with Torch 2.0.1 and CUDA 12.02, though I had been using Torch 2.0.1 and CUDA 11.7 without issues with the app itself or packager. The underlying Holoscan SDK v0.6 supports CUDA 11.8, so I will reset the CUDA on my desktop to it and test. |
@vikashg Here is the confirmed workaround/fix, confirmed by @mocsharp and me. I tried to remove CUDA 12.2 and install CUDA 11.8, but after multiple attempts, I still had CUDA 12.2 per report by @mocsharp had the similar error, though on libcudart.so.12.2, and Please confirm the results once you can give it a shot. |
Hey @MMelQin Thanks for looking into it . I will test again today and post my outcome. I was bit occupied with couple of other things. |
@MMelQin , just a quick report on the same issue in case it provides a clue. Now with 0.6.0 I am still having this error that @vikashg reported (shown below). I am on CPU only machine; the error is: "monai-deploy exec ct-seg-monai-swin-unetr/ -i ~/monai-bundles/1_axial_chest_ct.nii.gz -o output. -m ct-seg-monai-swin-unetr/best_metric_model.ts (py38b) msampat@scien-dev1:~/gears$ conda list | grep nvidia |
@mpsampat I noticed another issue in the log, the sub-command, For the real issue with not finding CUDA 11 runtime, I cannot say for sure, as I have not tested with conda and Ubuntu 18.04. MD App SDK (and the underlying Holoscan SDK needs Ubuntu 20.04). |
@MMelQin thank you very much for your super quick response. I see both your points. Tomorrow I will try on a VM with Ubuntu 20.04. Also will try the path issue Sameer mentioned in this thread. Maybe some environment variable is not setup correctly on my side. Will test tomorrow and report back here. |
Hi @MMelQin, So I am still getting the error. I am installing cuda-runtime-11.7 from here as
I have CUDA 12.0. Waiting for the above trial with cuda-runtime-11.7 before trying to installing CUDA 12.2. |
Hi @MMelQin After installing Cuda 11.7 it is working but docker packaging is not working. I am not able to copy the text but here is a screenshot Initially, I thought it is because of the holoscan base image needs nvcr.io login. So I downloaded the holoscan image manually. But I still get this error. |
@vikashg could you tell me which version of torch and python you are using. @MMelQin My pyproject.toml file has this: [tool.poetry.dependencies] So it picked python 3.11. @MMelQin do you have a preference for python version ? I can pin that and try with a recommended python version |
@vikashg Sorry about this error, which already logged as an issue. Workaround is to download the Wheel file from pypi.org, and then use the commanline option to pass the Wheel file to the Packager. It will be fixed in the next public release of Holoscan SDK, and hence in MD App SDK. In the meantime, you can also patch the HS SDK code once it is installed in the env, as documented in the issue. |
@mpsampat My env has Python 3.8.10, monai 1.2.0 monai-deploy-app-sdk 0.6.0, torch 2.1.0, CUDA 12.2, on Ubuntu 20.04. Noticed in your CI/CD log, torch version is at 2.1.0. I wonder if you looked at the version of torch when your CI/CD was successful; for me, torch 2.1 had issues detecting GPU because of the unsupported GPU driver version that was installed with CUDA 11.8, awhile back; completely removing CUDA and re-installing CUDA 12.2 got the CUDA 11 runtime lib error. Also, for MD App SDK v0.5.1, the Packager (and MAP) uses base image nvcr.io/nvidia/pytorch:22.08-py3, which has cuda11.7 and PyTorch 1.13 on Ubuntu 20.04. |
It's been a while but this issue will remain open since it has hit not just MONAI Deploy App SDK, but also other project(s) too, see here. In summary, the reason for MD App SDK to encounter this issue is that v0.6 changed to depend on Holoscan SDK v0.6, which requires CUDA 11 runtime. MD App SDK (example applications) also depends on torch. Torch v2.0.1 depends on CUDA 11 runtime, so when torch is installed, CUDA 11 runtime lib got installed. However, torch v2.1 changed to depend on CUDA 12, while Holoscan SDK v0.6 package does not have its own means of installing CUDA 11 runtime, thus requiring its dependents use other means to get CUDA 11 in the env, either by having CUDA 11 installed, and/or setting the LD+LIBRARY_PATH. The good news is that MD App SDK v1.0 (riding on Holoscan SDK v1.0), expected in Jan 2024, will support CUDA 12. In the meantime, one can also hope pinning torch version to 2.0.1 to still get CUDA 11 runtime installed in the env; this approach, however, turned out not reliable, at least within Github CI/CD actions, as importing holoscan again got hit with libcudart.so.11.0 not found issue. The final piece of the solution is then to export the library path in the version specific Python site-packages, see PR#471 as well as the earlier mention with conda env. The snippet is also shown below,
|
App SDK v1.0 has just been released. In this version, CUDA 12 Runtime is required even for CPU only applications, so DUDA runtime shared library must be present, if full CUDA Toolkit not installed, and the lib path is set to the package folder in both cases, e.g.
|
Describe the bug
When I try to build the monai-deploy docker image, I get the following error
I have the following pytorch version '2.0.1+cu117'. and cuda version 12.0. Maybe the version mismatch is causing this problem?
Steps/Code to reproduce bug
With the above version, try running the simple_imaging_app in the notebook tutorials
Environment details (please complete the following information)
The text was updated successfully, but these errors were encountered: