Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Restoring TVMOp tests #18542

Draft
wants to merge 14 commits into
base: master
Choose a base branch
from
Draft

Restoring TVMOp tests #18542

wants to merge 14 commits into from

Conversation

jinboci
Copy link
Contributor

@jinboci jinboci commented Jun 12, 2020

Description

(Brief description on what this PR is about)
Restoring TVMOp tests. #18204 #18526 #17840

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@mxnet-bot
Copy link

Hey @jinboci , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [clang, centos-cpu, miscellaneous, sanity, windows-gpu, windows-cpu, unix-gpu, centos-gpu, unix-cpu, website, edge]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@jinboci
Copy link
Contributor Author

jinboci commented Jun 12, 2020

@mxnet-bot run ci [unix-cpu, unix-gpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-gpu, unix-cpu]

@jinboci
Copy link
Contributor Author

jinboci commented Jun 12, 2020

@mxnet-bot run ci [centos-cpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [centos-cpu]

@leezu
Copy link
Contributor

leezu commented Jun 12, 2020

You need to investigate why libcuda is not found in the container. Previously there was a hack of putting /usr/local/cuda/compat on the path, but that may not be the correct solution. AFAIK libcuda will be provided by https://github.com/NVIDIA/nvidia-docker/ inside the container based on the host system libcuda, typically only on a host system with gpus.

@yzhliu
Copy link
Member

yzhliu commented Jun 18, 2020

@leezu Just check whether my understanding is correct. libcuda.so exists on the hosts which build mxnet, while it does not exist on hosts which run the tests. libcudart.so exist on both hosts, is it correct?

@leezu
Copy link
Contributor

leezu commented Jun 18, 2020

@yzhliu It should be the other way round. Let's open the CI Docker container: docker run -it mxnetci/build.ubuntu_gpu_cu102 /bin/bash and look at the shared libraries in /usr/local/cuda:

root@de49f0e1966c:/work/mxnet# find /usr/local/cuda-10.2 -name "*.so*"
/usr/local/cuda-10.2/compat/libnvidia-ptxjitcompiler.so.440.33.01
/usr/local/cuda-10.2/compat/libcuda.so
/usr/local/cuda-10.2/compat/libcuda.so.1
/usr/local/cuda-10.2/compat/libcuda.so.440.33.01
/usr/local/cuda-10.2/compat/libnvidia-fatbinaryloader.so.440.33.01
/usr/local/cuda-10.2/compat/libnvidia-ptxjitcompiler.so
/usr/local/cuda-10.2/compat/libnvidia-ptxjitcompiler.so.1
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudart.so.10.2
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudart.so.10.2.89
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcupti.so.10.2
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnppim.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnppc.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnppicc.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcurand.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnpps.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnppial.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libOpenCL.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnvrtc.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnppist.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcuinj64.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libOpenCL.so.1.1
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnppig.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnppidei.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcusolver.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libaccinj64.so.10.2.89
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnppicom.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libaccinj64.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libOpenCL.so.1
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnppif.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcufftw.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libaccinj64.so.10.2
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcusolverMg.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcuinj64.so.10.2
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcupti.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcusparse.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnvgraph.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/stubs/libnppim.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/stubs/libnppc.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/stubs/libnppicc.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/stubs/libcurand.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/stubs/libnpps.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/stubs/libnppial.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/stubs/libnvrtc.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/stubs/libnppist.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/stubs/libcuda.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/stubs/libnppig.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/stubs/libnppidei.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/stubs/libcusolver.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/stubs/libnppicom.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/stubs/libnppif.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/stubs/libcufftw.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/stubs/libcusolverMg.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/stubs/libcusparse.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/stubs/libnvgraph.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/stubs/libnvjpeg.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/stubs/libnppisu.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/stubs/libnppitc.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/stubs/libcufft.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnvjpeg.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcuinj64.so.10.2.89
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudart.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnvperf_target.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnppisu.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnppitc.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcupti.so.10.2.75
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnvperf_host.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcufft.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnvToolsExt.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnppisu.so.10
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnppist.so.10.2.1.89
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnvjpeg.so.10.3.1.89
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnppitc.so.10.2.1.89
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcusparse.so.10
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcurand.so.10.1.2.89
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnvrtc.so.10.2.89
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnppif.so.10.2.1.89
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnpps.so.10
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnppc.so.10
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnppial.so.10
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnpps.so.10.2.1.89
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnppidei.so.10.2.1.89
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnppc.so.10.2.1.89
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnppicom.so.10
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnvrtc-builtins.so.10.2
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnvToolsExt.so.1
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcufftw.so.10
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcusolverMg.so.10.3.0.89
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcufftw.so.10.1.2.89
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnppicc.so.10
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnppicc.so.10.2.1.89
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcurand.so.10
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnppicom.so.10.2.1.89
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcusolverMg.so.10
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnppial.so.10.2.1.89
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnppist.so.10
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcusparse.so.10.3.1.89
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnvrtc-builtins.so.10.2.89
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcusolver.so.10
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnvgraph.so.10
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnppim.so.10
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnvToolsExt.so.1.0.0
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnvrtc-builtins.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcufft.so.10.1.2.89
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnppidei.so.10
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcufft.so.10
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnppig.so.10.2.1.89
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnvjpeg.so.10
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnvrtc.so.10.2
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnppig.so.10
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcusolver.so.10.3.0.89
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnppif.so.10
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnppisu.so.10.2.1.89
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnppitc.so.10
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnppim.so.10.2.1.89
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libnvgraph.so.10.2.89
/usr/local/cuda-10.2/nvvm/lib64/libnvvm.so.3.3.0
/usr/local/cuda-10.2/nvvm/lib64/libnvvm.so
/usr/local/cuda-10.2/nvvm/lib64/libnvvm.so.3
/usr/local/cuda-10.2/nvvmx/lib64/libnvvm.so.3.3.0
/usr/local/cuda-10.2/nvvmx/lib64/libnvvm.so
/usr/local/cuda-10.2/nvvmx/lib64/libnvvm.so.3
/usr/local/cuda-10.2/extras/Sanitizer/libsanitizer-public.so

Because we don't use the nvidia docker command to run the container, only stubs/libcuda.so is available. If we're on a host with GPUs, we can use docker run --gpus all -it mxnetci/build.ubuntu_gpu_cu102 /bin/bash and the libcuda.so from the host as well as the host GPUs will be available inside the container. But on a CPU host this just leads to

docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded\\\\n\\\"\"": unknown.
ERRO[0000] error waiting for container: context canceled

The problem is that some part of the tvmop setup currenly requires libcuda.so to be available (it's listed as shared library dependency of some shared library that is opened). We need to check which library is introducing the dependency and consider how to fix it. Ideally there shouldn't be a dependency on libcuda.so as it's only available on GPU hosts.

You can also refer to NVIDIA/nvidia-container-toolkit#185 for a little background. The problem with the compat/libcuda.so AFAIK is that it does not necessarily fit the driver version of the host system.

@jinboci
Copy link
Contributor Author

jinboci commented Jun 18, 2020

@yzhliu @leezu Thank you for your suggestions. I tried to directly disable the linkage of libcuda.so with

diff --git a/cmake/modules/CUDA.cmake b/cmake/modules/CUDA.cmake
index 936bb681b..32d13de38 100644
--- a/cmake/modules/CUDA.cmake
+++ b/cmake/modules/CUDA.cmake
@@ -35,7 +35,7 @@ if(USE_CUDA)
 
   list(APPEND TVM_LINKER_LIBS ${CUDA_NVRTC_LIBRARY})
   list(APPEND TVM_RUNTIME_LINKER_LIBS ${CUDA_CUDART_LIBRARY})
-  list(APPEND TVM_RUNTIME_LINKER_LIBS ${CUDA_CUDA_LIBRARY})
+  #list(APPEND TVM_RUNTIME_LINKER_LIBS ${CUDA_CUDA_LIBRARY})
   list(APPEND TVM_RUNTIME_LINKER_LIBS ${CUDA_NVRTC_LIBRARY})
 
   if(USE_CUDNN)
diff --git a/cmake/util/FindCUDA.cmake b/cmake/util/FindCUDA.cmake
index f971c87f2..5e2118148 100644
--- a/cmake/util/FindCUDA.cmake
+++ b/cmake/util/FindCUDA.cmake
@@ -58,9 +58,9 @@ macro(find_cuda use_cuda)
   # additional libraries
   if(CUDA_FOUND)
     if(MSVC)
-      find_library(CUDA_CUDA_LIBRARY cuda
-        ${CUDA_TOOLKIT_ROOT_DIR}/lib/x64
-        ${CUDA_TOOLKIT_ROOT_DIR}/lib/Win32)
+      #find_library(CUDA_CUDA_LIBRARY cudart
+        #${CUDA_TOOLKIT_ROOT_DIR}/lib/x64
+        #${CUDA_TOOLKIT_ROOT_DIR}/lib/Win32)
       find_library(CUDA_NVRTC_LIBRARY nvrtc
         ${CUDA_TOOLKIT_ROOT_DIR}/lib/x64
         ${CUDA_TOOLKIT_ROOT_DIR}/lib/Win32)
@@ -74,13 +74,13 @@ macro(find_cuda use_cuda)
         ${CUDA_TOOLKIT_ROOT_DIR}/lib/x64
         ${CUDA_TOOLKIT_ROOT_DIR}/lib/Win32)
     else(MSVC)
-      find_library(_CUDA_CUDA_LIBRARY cuda
-        PATHS ${CUDA_TOOLKIT_ROOT_DIR}
-        PATH_SUFFIXES lib lib64 targets/x86_64-linux/lib targets/x86_64-linux/lib/stubs lib64/stubs
-        NO_DEFAULT_PATH)
-      if(_CUDA_CUDA_LIBRARY)
-        set(CUDA_CUDA_LIBRARY ${_CUDA_CUDA_LIBRARY})
-      endif()
+      #find_library(_CUDA_CUDA_LIBRARY cudart
+        #PATHS ${CUDA_TOOLKIT_ROOT_DIR}
+        #PATH_SUFFIXES lib lib64 targets/x86_64-linux/lib targets/x86_64-linux/lib/stubs lib64/stubs
+        #NO_DEFAULT_PATH)
+      #if(_CUDA_CUDA_LIBRARY)
+        #set(CUDA_CUDA_LIBRARY ${_CUDA_CUDA_LIBRARY})
+      #endif()
       find_library(CUDA_NVRTC_LIBRARY nvrtc
         PATHS ${CUDA_TOOLKIT_ROOT_DIR}
         PATH_SUFFIXES lib lib64 targets/x86_64-linux/lib targets/x86_64-linux/lib/stubs lib64/stubs lib/x86_64-linux-gnu

However, getting errors while building tvm:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/Documents/tvm/python/tvm/__init__.py", line 25, in <module>
    from ._ffi.base import TVMError, __version__
  File "/home/ubuntu/Documents/tvm/python/tvm/_ffi/__init__.py", line 28, in <module>
    from .base import register_error
  File "/home/ubuntu/Documents/tvm/python/tvm/_ffi/base.py", line 62, in <module>
    _LIB, _LIB_NAME = _load_lib()
  File "/home/ubuntu/Documents/tvm/python/tvm/_ffi/base.py", line 50, in _load_lib
    lib = ctypes.CDLL(lib_path[0], ctypes.RTLD_GLOBAL)
  File "/home/ubuntu/anaconda3/lib/python3.7/ctypes/__init__.py", line 364, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /home/ubuntu/Documents/tvm/build/libtvm.so: undefined symbol: cuLaunchKernel

It seems that cuLaunchKernel is one function needed from libcuda.so (I am not sure if it is). How could we call this function without linking libcuda.so?

@leezu
Copy link
Contributor

leezu commented Jun 18, 2020

@jinboci would it be possible to dlopen libcuda at runtime?

@jinboci
Copy link
Contributor Author

jinboci commented Jun 18, 2020

@leezu @yzhliu Hi, I am still unclear about:

  1. Does the machine in CI that builds mxnet provide libcuda.so?
  2. When USE_TVM_OP is OFF, does building mxnet require the dependencies on libcuda.so?

I compiled mxnet with USE_TVM_OP OFF and USE_CUDA USE_CUDNN ON, and got:

(base) ubuntu@ip-172-31-37-194:~/Documents/mxnet/build$ ldd libmxnet.so
        linux-vdso.so.1 (0x00007ffda2ae3000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f68de615000)
        libopenblas.so.0 => /usr/local/lib/libopenblas.so.0 (0x00007f68dd688000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f68dd480000)
        libomp.so => /home/ubuntu/Documents/mxnet/build/3rdparty/openmp/runtime/src/libomp.so (0x00007f68dd19a000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f68dcf7b000)
        libcudnn.so.7 => /usr/local/cuda/lib64/libcudnn.so.7 (0x00007f68c795c000)
        libcuda.so.1 => /usr/lib/x86_64-linux-gnu/libcuda.so.1 (0x00007f68c6774000)
        libnvidia-ml.so.1 => /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 (0x00007f68c614e000)
        libnccl.so.2 => /usr/local/cuda/lib/libnccl.so.2 (0x00007f68bf6fa000)
        libopencv_imgcodecs.so.4.2 => /usr/local/lib/libopencv_imgcodecs.so.4.2 (0x00007f68bed0d000)
        libopencv_imgproc.so.4.2 => /usr/local/lib/libopencv_imgproc.so.4.2 (0x00007f68bd409000)
        libopencv_core.so.4.2 => /usr/local/lib/libopencv_core.so.4.2 (0x00007f68bc124000)
        libcudart.so.10.0 => /usr/local/cuda/lib64/libcudart.so.10.0 (0x00007f68bbeaa000)
        libcufft.so.10.0 => /usr/local/cuda/lib64/libcufft.so.10.0 (0x00007f68b59f6000)
        libcublas.so.10.0 => /usr/local/cuda/lib64/libcublas.so.10.0 (0x00007f68b1460000)
        libcusolver.so.10.0 => /usr/local/cuda/lib64/libcusolver.so.10.0 (0x00007f68a8d79000)
        libcurand.so.10.0 => /usr/local/cuda/lib64/libcurand.so.10.0 (0x00007f68a4c12000)
        libnvrtc.so.10.0 => /usr/local/cuda/lib64/libnvrtc.so.10.0 (0x00007f68a35f6000)
        libnvToolsExt.so.1 => /usr/local/cuda/lib64/libnvToolsExt.so.1 (0x00007f68a33ed000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f68a3064000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f68a2cc6000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f68a2aae000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f68a26bd000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f6905ee6000)
        libgfortran.so.4 => /usr/lib/x86_64-linux-gnu/libgfortran.so.4 (0x00007f68a22de000)
        libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007f68a20af000)
        libjpeg.so.8 => /usr/lib/x86_64-linux-gnu/libjpeg.so.8 (0x00007f68a1e47000)
        libpng16.so.16 => /usr/lib/x86_64-linux-gnu/libpng16.so.16 (0x00007f68a1c15000)
        libtiff.so.5 => /usr/lib/x86_64-linux-gnu/libtiff.so.5 (0x00007f68a199e000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f68a1781000)
        libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0 (0x00007f68a1541000)
        liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5 (0x00007f68a131b000)
        libjbig.so.0 => /usr/lib/x86_64-linux-gnu/libjbig.so.0 (0x00007f68a110d000)

@jinboci
Copy link
Contributor Author

jinboci commented Jun 18, 2020

@leezu I set some breakpoints. I am not sure if this is okay. By only building TVM:

>>> import tvm
> /home/ubuntu/anaconda3/lib/python3.7/ctypes/__init__.py(365)__init__()
-> self._handle = _dlopen(self._name, mode)
(Pdb) c
> /home/ubuntu/Documents/tvm/python/tvm/_ffi/base.py(51)_load_lib()
-> lib = ctypes.CDLL(lib_path[0], ctypes.RTLD_GLOBAL)
(Pdb) c
> /home/ubuntu/anaconda3/lib/python3.7/ctypes/__init__.py(365)__init__()
-> self._handle = _dlopen(self._name, mode)
(Pdb) _dlopen("libcuda.so")
94685554591904

@leezu
Copy link
Contributor

leezu commented Jun 18, 2020

@jinboci for the libmxnet.so, it currently has a libcuda dependency when compiled with nvrtc. This will be fixed eventually (#17858), but if it blocks the TVMOp tests, I suggest you simply disable nvrtc feature in the tvmop builds. Then the dependency on libcuda.so.1 in libmxnet.so will disappear.

You need to check if the error is due to libmxnet.so or libtvm.so. Once you have identified the cause, the next step is to look into fixing it.

@yzhliu
Copy link
Member

yzhliu commented Jun 19, 2020

@leezu in CI mxnet is built without nvrtc?

@leezu
Copy link
Contributor

leezu commented Jun 19, 2020

@yzhliu NVRTC is enabled by default and thus built by the CI unless disabled: https://github.com/apache/incubator-mxnet/blob/497bf7efb403a9174817f07ab3d2f9be033845ad/CMakeLists.txt#L82

If libmxnet's dependency is causing the issue, we can just disable this flag in the TVMOp builds, until libmxnet.so is fixed. Based on the error logs posted in this issue, I'm not sure though if the error is due to libtvm or libmxnet

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants