-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Restoring TVMOp tests #18542
base: master
Are you sure you want to change the base?
Restoring TVMOp tests #18542
Conversation
Hey @jinboci , Thanks for submitting the PR
CI supported jobs: [clang, centos-cpu, miscellaneous, sanity, windows-gpu, windows-cpu, unix-gpu, centos-gpu, unix-cpu, website, edge] Note: |
@mxnet-bot run ci [unix-cpu, unix-gpu] |
Jenkins CI successfully triggered : [unix-gpu, unix-cpu] |
@mxnet-bot run ci [centos-cpu] |
Jenkins CI successfully triggered : [centos-cpu] |
You need to investigate why libcuda is not found in the container. Previously there was a hack of putting /usr/local/cuda/compat on the path, but that may not be the correct solution. AFAIK libcuda will be provided by https://github.com/NVIDIA/nvidia-docker/ inside the container based on the host system libcuda, typically only on a host system with gpus. |
@leezu Just check whether my understanding is correct. libcuda.so exists on the hosts which build mxnet, while it does not exist on hosts which run the tests. libcudart.so exist on both hosts, is it correct? |
@yzhliu It should be the other way round. Let's open the CI Docker container:
Because we don't use the nvidia docker command to run the container, only
The problem is that some part of the tvmop setup currenly requires You can also refer to NVIDIA/nvidia-container-toolkit#185 for a little background. The problem with the |
@yzhliu @leezu Thank you for your suggestions. I tried to directly disable the linkage of diff --git a/cmake/modules/CUDA.cmake b/cmake/modules/CUDA.cmake
index 936bb681b..32d13de38 100644
--- a/cmake/modules/CUDA.cmake
+++ b/cmake/modules/CUDA.cmake
@@ -35,7 +35,7 @@ if(USE_CUDA)
list(APPEND TVM_LINKER_LIBS ${CUDA_NVRTC_LIBRARY})
list(APPEND TVM_RUNTIME_LINKER_LIBS ${CUDA_CUDART_LIBRARY})
- list(APPEND TVM_RUNTIME_LINKER_LIBS ${CUDA_CUDA_LIBRARY})
+ #list(APPEND TVM_RUNTIME_LINKER_LIBS ${CUDA_CUDA_LIBRARY})
list(APPEND TVM_RUNTIME_LINKER_LIBS ${CUDA_NVRTC_LIBRARY})
if(USE_CUDNN)
diff --git a/cmake/util/FindCUDA.cmake b/cmake/util/FindCUDA.cmake
index f971c87f2..5e2118148 100644
--- a/cmake/util/FindCUDA.cmake
+++ b/cmake/util/FindCUDA.cmake
@@ -58,9 +58,9 @@ macro(find_cuda use_cuda)
# additional libraries
if(CUDA_FOUND)
if(MSVC)
- find_library(CUDA_CUDA_LIBRARY cuda
- ${CUDA_TOOLKIT_ROOT_DIR}/lib/x64
- ${CUDA_TOOLKIT_ROOT_DIR}/lib/Win32)
+ #find_library(CUDA_CUDA_LIBRARY cudart
+ #${CUDA_TOOLKIT_ROOT_DIR}/lib/x64
+ #${CUDA_TOOLKIT_ROOT_DIR}/lib/Win32)
find_library(CUDA_NVRTC_LIBRARY nvrtc
${CUDA_TOOLKIT_ROOT_DIR}/lib/x64
${CUDA_TOOLKIT_ROOT_DIR}/lib/Win32)
@@ -74,13 +74,13 @@ macro(find_cuda use_cuda)
${CUDA_TOOLKIT_ROOT_DIR}/lib/x64
${CUDA_TOOLKIT_ROOT_DIR}/lib/Win32)
else(MSVC)
- find_library(_CUDA_CUDA_LIBRARY cuda
- PATHS ${CUDA_TOOLKIT_ROOT_DIR}
- PATH_SUFFIXES lib lib64 targets/x86_64-linux/lib targets/x86_64-linux/lib/stubs lib64/stubs
- NO_DEFAULT_PATH)
- if(_CUDA_CUDA_LIBRARY)
- set(CUDA_CUDA_LIBRARY ${_CUDA_CUDA_LIBRARY})
- endif()
+ #find_library(_CUDA_CUDA_LIBRARY cudart
+ #PATHS ${CUDA_TOOLKIT_ROOT_DIR}
+ #PATH_SUFFIXES lib lib64 targets/x86_64-linux/lib targets/x86_64-linux/lib/stubs lib64/stubs
+ #NO_DEFAULT_PATH)
+ #if(_CUDA_CUDA_LIBRARY)
+ #set(CUDA_CUDA_LIBRARY ${_CUDA_CUDA_LIBRARY})
+ #endif()
find_library(CUDA_NVRTC_LIBRARY nvrtc
PATHS ${CUDA_TOOLKIT_ROOT_DIR}
PATH_SUFFIXES lib lib64 targets/x86_64-linux/lib targets/x86_64-linux/lib/stubs lib64/stubs lib/x86_64-linux-gnu However, getting errors while building tvm: Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ubuntu/Documents/tvm/python/tvm/__init__.py", line 25, in <module>
from ._ffi.base import TVMError, __version__
File "/home/ubuntu/Documents/tvm/python/tvm/_ffi/__init__.py", line 28, in <module>
from .base import register_error
File "/home/ubuntu/Documents/tvm/python/tvm/_ffi/base.py", line 62, in <module>
_LIB, _LIB_NAME = _load_lib()
File "/home/ubuntu/Documents/tvm/python/tvm/_ffi/base.py", line 50, in _load_lib
lib = ctypes.CDLL(lib_path[0], ctypes.RTLD_GLOBAL)
File "/home/ubuntu/anaconda3/lib/python3.7/ctypes/__init__.py", line 364, in __init__
self._handle = _dlopen(self._name, mode)
OSError: /home/ubuntu/Documents/tvm/build/libtvm.so: undefined symbol: cuLaunchKernel It seems that |
@jinboci would it be possible to |
@leezu @yzhliu Hi, I am still unclear about:
I compiled mxnet with USE_TVM_OP OFF and USE_CUDA USE_CUDNN ON, and got: (base) ubuntu@ip-172-31-37-194:~/Documents/mxnet/build$ ldd libmxnet.so
linux-vdso.so.1 (0x00007ffda2ae3000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f68de615000)
libopenblas.so.0 => /usr/local/lib/libopenblas.so.0 (0x00007f68dd688000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f68dd480000)
libomp.so => /home/ubuntu/Documents/mxnet/build/3rdparty/openmp/runtime/src/libomp.so (0x00007f68dd19a000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f68dcf7b000)
libcudnn.so.7 => /usr/local/cuda/lib64/libcudnn.so.7 (0x00007f68c795c000)
libcuda.so.1 => /usr/lib/x86_64-linux-gnu/libcuda.so.1 (0x00007f68c6774000)
libnvidia-ml.so.1 => /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 (0x00007f68c614e000)
libnccl.so.2 => /usr/local/cuda/lib/libnccl.so.2 (0x00007f68bf6fa000)
libopencv_imgcodecs.so.4.2 => /usr/local/lib/libopencv_imgcodecs.so.4.2 (0x00007f68bed0d000)
libopencv_imgproc.so.4.2 => /usr/local/lib/libopencv_imgproc.so.4.2 (0x00007f68bd409000)
libopencv_core.so.4.2 => /usr/local/lib/libopencv_core.so.4.2 (0x00007f68bc124000)
libcudart.so.10.0 => /usr/local/cuda/lib64/libcudart.so.10.0 (0x00007f68bbeaa000)
libcufft.so.10.0 => /usr/local/cuda/lib64/libcufft.so.10.0 (0x00007f68b59f6000)
libcublas.so.10.0 => /usr/local/cuda/lib64/libcublas.so.10.0 (0x00007f68b1460000)
libcusolver.so.10.0 => /usr/local/cuda/lib64/libcusolver.so.10.0 (0x00007f68a8d79000)
libcurand.so.10.0 => /usr/local/cuda/lib64/libcurand.so.10.0 (0x00007f68a4c12000)
libnvrtc.so.10.0 => /usr/local/cuda/lib64/libnvrtc.so.10.0 (0x00007f68a35f6000)
libnvToolsExt.so.1 => /usr/local/cuda/lib64/libnvToolsExt.so.1 (0x00007f68a33ed000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f68a3064000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f68a2cc6000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f68a2aae000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f68a26bd000)
/lib64/ld-linux-x86-64.so.2 (0x00007f6905ee6000)
libgfortran.so.4 => /usr/lib/x86_64-linux-gnu/libgfortran.so.4 (0x00007f68a22de000)
libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007f68a20af000)
libjpeg.so.8 => /usr/lib/x86_64-linux-gnu/libjpeg.so.8 (0x00007f68a1e47000)
libpng16.so.16 => /usr/lib/x86_64-linux-gnu/libpng16.so.16 (0x00007f68a1c15000)
libtiff.so.5 => /usr/lib/x86_64-linux-gnu/libtiff.so.5 (0x00007f68a199e000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f68a1781000)
libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0 (0x00007f68a1541000)
liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5 (0x00007f68a131b000)
libjbig.so.0 => /usr/lib/x86_64-linux-gnu/libjbig.so.0 (0x00007f68a110d000) |
@leezu I set some breakpoints. I am not sure if this is okay. By only building TVM: >>> import tvm
> /home/ubuntu/anaconda3/lib/python3.7/ctypes/__init__.py(365)__init__()
-> self._handle = _dlopen(self._name, mode)
(Pdb) c
> /home/ubuntu/Documents/tvm/python/tvm/_ffi/base.py(51)_load_lib()
-> lib = ctypes.CDLL(lib_path[0], ctypes.RTLD_GLOBAL)
(Pdb) c
> /home/ubuntu/anaconda3/lib/python3.7/ctypes/__init__.py(365)__init__()
-> self._handle = _dlopen(self._name, mode)
(Pdb) _dlopen("libcuda.so")
94685554591904 |
@jinboci for the You need to check if the error is due to |
@leezu in CI mxnet is built without nvrtc? |
@yzhliu NVRTC is enabled by default and thus built by the CI unless disabled: https://github.com/apache/incubator-mxnet/blob/497bf7efb403a9174817f07ab3d2f9be033845ad/CMakeLists.txt#L82 If libmxnet's dependency is causing the issue, we can just disable this flag in the TVMOp builds, until libmxnet.so is fixed. Based on the error logs posted in this issue, I'm not sure though if the error is due to libtvm or libmxnet |
Description
(Brief description on what this PR is about)
Restoring TVMOp tests. #18204 #18526 #17840
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments