[RFC] Use TVMOp with GPU & Build without libcuda.so in CI #18716

jinboci · 2020-07-15T06:50:51Z

Problem 1: TVMOp doesn't work well with GPU builds #17840

The error message:

>>> import mxnet as mx
>>> x = mx.np.array([[0, 1], [1, 1], [2, 2]], ctx=mx.gpu())
>>> idx = x < 2
>>> x[idx]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/Documents/mxnet/python/mxnet/numpy/multiarray.py", line 1013, in __lt__
    return less(self, other)
  File "/home/ubuntu/Documents/mxnet/python/mxnet/numpy/multiarray.py", line 8672, in less
    return _mx_nd_np.less(x1, x2, out)
  File "/home/ubuntu/Documents/mxnet/python/mxnet/ndarray/numpy/_op.py", line 6869, in less
    return _api_internal.less(x1, x2, out)
  File "/home/ubuntu/Documents/mxnet/python/mxnet/_ffi/_ctypes/function.py", line 115, in __call__
    raise get_last_ffi_error()
mxnet.base.MXNetError: Traceback (most recent call last):
  File "../3rdparty/tvm/src/runtime/module.cc", line 125
  File "../3rdparty/tvm/src/runtime/library_module.cc", line 94
TVMError: Check failed: ret == 0 (-1 vs. 0) : Check failed: f != nullptr: Cannot find function less_scalar_gpufloat32_2bool_2_kernel0 in the imported modules or global registry

Root cause:

In mxnet/contrib/tvmop/compile.py, only function_binary (llvm Module)is saved in libtvmop.so. The imported_modules (cuda Module)is not saved. So TVM cannot import any gpu functions and cannot find less_scalar_gpufloat32_2bool_2_kernel0.

(Pdb) func_binary
Module(llvm, 55d7ce519d48)
(Pdb) func_binary.imported_modules[0]
Module(cuda, 55d7c7a09818)

Solution (Github PR):

Save imported_modules[0] to libtvmop.cubin:
Define Import function (using TVMOpModule->Import ):
Import cubin_module to global_module:
Outputs:

>>> import mxnet as mx
>>> x = mx.np.array([[0, 1], [1, 1], [2, 2]], ctx=mx.gpu())
[10:19:41] ../src/base.cc:80: cuDNN lib mismatch: linked-against version 7605 != compiled-against version 7501. Set MXNET_CUDNN_LIB_CHECKING=0 to quiet this warning.
[10:19:41] ../src/base.cc:84: Upgrade advisory: this mxnet has been built against cuDNN lib version 7501, which is older than the oldest version tested by CI (7600). Set MXNET_CUDNN_LIB_CHECKING=0 to quiet this warning.
>>> idx = x < 2
>>> x[idx]
array([0., 1., 1., 1.], ctx=gpu(0))

Problem 2: CI Checks: libcuda.so does exist on the machine builds mxnet

The error message:

When running unix-gpu checks:

[2020-06-18T08:26:18.355Z] Traceback (most recent call last):
[2020-06-18T08:26:18.355Z]   File "/work/mxnet/contrib/tvmop/compile.py", line 20, in <module>
[2020-06-18T08:26:18.355Z]     import tvm
[2020-06-18T08:26:18.355Z]   File "/work/mxnet/3rdparty/tvm/python/tvm/__init__.py", line 27, in <module>
[2020-06-18T08:26:18.355Z]     from . import tensor
[2020-06-18T08:26:18.355Z]   File "/work/mxnet/3rdparty/tvm/python/tvm/tensor.py", line 20, in <module>
[2020-06-18T08:26:18.355Z]     from ._ffi.object import Object, register_object, ObjectGeneric, \
[2020-06-18T08:26:18.355Z]   File "/work/mxnet/3rdparty/tvm/python/tvm/_ffi/object.py", line 24, in <module>
[2020-06-18T08:26:18.355Z]     from .base import _FFI_MODE, _RUNTIME_ONLY, check_call, _LIB, c_str
[2020-06-18T08:26:18.355Z]   File "/work/mxnet/3rdparty/tvm/python/tvm/_ffi/base.py", line 65, in <module>
[2020-06-18T08:26:18.355Z]     _LIB, _LIB_NAME = _load_lib()
[2020-06-18T08:26:18.355Z]   File "/work/mxnet/3rdparty/tvm/python/tvm/_ffi/base.py", line 57, in _load_lib
[2020-06-18T08:26:18.355Z]     lib = ctypes.CDLL(lib_path[0], ctypes.RTLD_GLOBAL)
[2020-06-18T08:26:18.355Z]   File "/usr/lib/python3.6/ctypes/__init__.py", line 348, in __init__
[2020-06-18T08:26:18.355Z]     self._handle = _dlopen(self._name, mode)
[2020-06-18T08:26:18.355Z] OSError: libcuda.so.1: cannot open shared object file: No such file or directory

Root cause:

The unix-gpu machine that builds mxnet does not have libcuda.so

Solution 1:

Link libtvm.so with stub/libcuda.so on the machine that builds CI Checks.

Solution 1 Pros/Cons/Workloads:

Pros: Solve the issue easily.
Cons: Violates the effort of removing libcuda.so totally, (would be great if someone can elaborate the motivation behind it).
Workloads: ~1 week

Solution 2 (Possible) (Github PR) :

TVM links libcuda.so because it invokes CUDA driver API during runtime. While these functions are not executed during compile-time. Therefore it is possible to remove them for compile-only purpose.
I have made a prototype to remove the linkage of libcuda.so from libtvm.so:

Set target_link_libraries of tvm and tvm_runtime differently. (CMakeLists.txt)
Set an variable CUDA_COMPILE_ONLY to be ON to indicate “Building libtvm.so without libcuda.so” (CMakeLists.txt)
When CUDA_COMPILE_ONLY is ON, add compilation definition -DCUDA_COMPILE_ONLY (CMakeLists.txt)
When CUDA_COMPILE_ONLY is defined (when compiling libtvm.so), ignore any cuXXX CUDA Driver API functions: (cmake/modules/CUDA.cmake, src/runtime/cuda/cuda_common.h, src/runtime/cuda/cuda_device_api.cc, src/runtime/cuda/cuda_module.cc)

Solution 2 Pros/Cons/Workloads:

Pros: Not depend on libcuda.so
Cons: After unlinking libtvm.so with libcuda.so, the CI checks still cannot pass. GPU CUDA RTC (When -DUSE_TVM_OP=ON) outputs the error message:

[2020-07-13T09:30:36.870Z] /usr/bin/ld: warning: libcuda.so.1, needed by /work/build/3rdparty/tvm/libtvm_runtime.so, not found (try using -rpath or -rpath-link)
[2020-07-13T09:30:36.870Z] /work/build/3rdparty/tvm/libtvm_runtime.so: undefined reference to `cuMemsetD32_v2'
[2020-07-13T09:30:36.870Z] /work/build/3rdparty/tvm/libtvm_runtime.so: undefined reference to `cuModuleLoadData'
[2020-07-13T09:30:36.870Z] /work/build/3rdparty/tvm/libtvm_runtime.so: undefined reference to `cuLaunchKernel'
[2020-07-13T09:30:36.870Z] /work/build/3rdparty/tvm/libtvm_runtime.so: undefined reference to `cuModuleGetFunction'
[2020-07-13T09:30:36.870Z] /work/build/3rdparty/tvm/libtvm_runtime.so: undefined reference to `cuModuleUnload'
[2020-07-13T09:30:36.870Z] /work/build/3rdparty/tvm/libtvm_runtime.so: undefined reference to `cuGetErrorName'
[2020-07-13T09:30:36.870Z] /work/build/3rdparty/tvm/libtvm_runtime.so: undefined reference to `cuDeviceGetName'
[2020-07-13T09:30:36.870Z] /work/build/3rdparty/tvm/libtvm_runtime.so: undefined reference to `cuModuleGetGlobal_v2'
[2020-07-13T09:30:36.870Z] collect2: error: ld returned 1 exit status
[2020-07-13T09:30:36.870Z] ninja: build stopped: subcommand failed.
[2020-07-13T09:30:36.870Z] 2020-07-13 09:30:30,154 - root - INFO - Waiting for status of container e95e5c4ca642 for 600 s.
[2020-07-13T09:30:36.870Z] 2020-07-13 09:30:30,377 - root - INFO - Container exit status: {'Error': None, 'StatusCode': 1}
[2020-07-13T09:30:36.870Z] 2020-07-13 09:30:30,377 - root - ERROR - Container exited with an error 😞
[2020-07-13T09:30:36.870Z] 2020-07-13 09:30:30,377 - root - INFO - Executed command for reproduction:
[2020-07-13T09:30:36.870Z]

It seems unlinking libcuda.so from libtvm.so is not enough and we should also unlink libcuda.so from libtvm_runtime.so. However, libtvm_runtime does require libcuda.so during runtime. if we remove the linkage on build instances, tvm_runtime is not able to run on test instances.
In order to fully address the problem, we have two options,

Build two versions of tvm, one links libcuda.so for compiling the tvm operators; another does not link libcuda.so which will be transferred to test instances for tvmop tests.
We do dlopen(“libcuda.so”) in tvm.

Workloads:
- option 1: ~1 week to modify tvm, 1.5-2 weeks modify CI
- option 2: ~2 weeks to modify tvm
- Both options require big surgery on TVM, thus contributing back to upstream tvm might be difficult. Moreover, there’s also risk that tvm community would push back our suggestion. Even though they agree to do so, it might cost ~2 weeks to upstream the changes, and another ~1.5 weeks to sync mxnet’s tvm with updated apache/tvm.

Proposal:

Given the fact that it might take another 4 - 6 weeks to fully address the CI problem, we propose to,

Submit the fix for problem 1.
Link stub/libcuda.so and enable 1 instance for testing. If this is not acceptable, can we keep disabling the tvmop CI for now as it is not an essential component?
Open an issue/RFC in MXNet and TVM to track the remaining problems.

Comments:

Also, When setting -DUSE_TVM_OP=OFF the CI checks would be stuck. The output of GPU CUDA RTC looks like:

[2020-07-13T18:04:12.876Z] + md5sum build/3rdparty/tvm/libtvm_runtime.so
[2020-07-13T18:04:12.876Z] md5sum: build/3rdparty/tvm/libtvm_runtime.so: No such file or directory
[2020-07-13T18:04:12.876Z] + ls -lh build/3rdparty/tvm/libtvm_runtime.so
[2020-07-13T18:04:12.876Z] ls: cannot access 'build/3rdparty/tvm/libtvm_runtime.so': No such file or directory
[2020-07-13T18:04:12.876Z] + md5sum build/libtvmop.so
[2020-07-13T18:04:12.876Z] md5sum: build/libtvmop.so: No such file or directory
[2020-07-13T18:04:12.876Z] + ls -lh build/libtvmop.so
[2020-07-13T18:04:12.876Z] ls: cannot access 'build/libtvmop.so': No such file or directory
[2020-07-13T18:04:12.876Z] + md5sum build/tvmop.conf
[2020-07-13T18:04:12.876Z] md5sum: build/tvmop.conf: No such file or directory
[2020-07-13T18:04:12.876Z] + ls -lh build/tvmop.conf
[2020-07-13T18:04:12.876Z] ls: cannot access 'build/tvmop.conf': No such file or directory
[2020-07-13T18:04:12.876Z] + md5sum build/tests/mxnet_unit_tests
[2020-07-13T18:04:12.876Z] 66aa8c8a37ffaaa9692ae98bda88491c  build/tests/mxnet_unit_tests
[2020-07-13T18:04:12.876Z] + ls -lh build/tests/mxnet_unit_tests
[2020-07-13T18:04:12.876Z] -rwxr-xr-x 1 jenkins_slave jenkins_slave 34M Jul 13 18:04 build/tests/mxnet_unit_tests
[2020-07-13T18:04:12.876Z] + md5sum build/3rdparty/openmp/runtime/src/libomp.so
[2020-07-13T18:04:12.876Z] 819a0c986ae9e233b0a9525e71c906d9  build/3rdparty/openmp/runtime/src/libomp.so
[2020-07-13T18:04:12.876Z] + ls -lh build/3rdparty/openmp/runtime/src/libomp.so
[2020-07-13T18:04:12.876Z] -rwxr-xr-x 1 jenkins_slave jenkins_slave 1.1M Jul 13 17:56 build/3rdparty/openmp/runtime/src/libomp.so
[2020-07-13T18:04:12.876Z] + return 0

The text was updated successfully, but these errors were encountered:

jinboci · 2020-07-15T06:56:47Z

@leezu Would you please take a look? Thank you!

samskalicky · 2020-07-15T07:11:37Z

@mxnet-label-bot add [Numpy]

leezu · 2020-07-15T15:50:06Z

Violates the effort of removing libcuda.so totally, (would be great if someone can elaborate the motivation behind it).

Many customers use a single mxnet build that supports gpu features and deploy it to both gpu and cpu machines. Due to the way how cuda containers are designed, libcuda.so won't be present on the cpu machines. That's why it's better to dlopen(cuda) only once needed. This not only affects tvmop but als nvrtc feature in mxnet.

Using the stubs is a workaround for using dlopen, but adds additional requirements for modifying the LD_LIBRARY_PATH on users cpu machines. That's not always feasible for users and for mxnet 1.6, which introduced nvrtc, users typically just disable the nvrtc feature to be able to deploy the libmxnet.so to both cpu and gpu machines.

Why not fix the underlying problem and then enable tvmop feature?

Also, When setting -DUSE_TVM_OP=OFF the CI checks would be stuck.

That doesn't make sense as we are running CI successfully with tvm op disabled since a couple of months? Maybe you ran into some unrelated flakyness and need to retrigger the run?

yzhliu · 2020-07-15T23:44:45Z

I'm fine to disable tvm op (or mark it as experimental) for now, if it does need another 4 - 6 weeks to fully address the underlying problem, as we have some more urgent tasks on numpy side.

szha · 2020-07-16T02:56:27Z

Instead of linking tvm to mxnet, can we use TVM to generate source code and test as custom c++ operator?

yzhliu · 2020-08-24T04:49:19Z

Problem 1 should have been fixed by #18818

jinboci added Bug needs triage labels Jul 15, 2020

lanking520 added the Numpy label Jul 15, 2020

szha removed the needs triage label Jul 16, 2020

szha added the RFC Post requesting for comments label Jul 31, 2020

jinboci mentioned this issue Aug 18, 2020

[CI][1.x] Cherrypick: Upgrade unix gpu toolchain (#18186) #18785

Merged

yzhliu mentioned this issue Aug 24, 2020

[numpy] fix logical ops gpu build #18983

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Use TVMOp with GPU & Build without libcuda.so in CI #18716

[RFC] Use TVMOp with GPU & Build without libcuda.so in CI #18716

jinboci commented Jul 15, 2020

jinboci commented Jul 15, 2020

samskalicky commented Jul 15, 2020

leezu commented Jul 15, 2020

yzhliu commented Jul 15, 2020

szha commented Jul 16, 2020

yzhliu commented Aug 24, 2020