-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[RFC] Use TVMOp with GPU & Build without libcuda.so in CI #18716
Comments
@leezu Would you please take a look? Thank you! |
@mxnet-label-bot add [Numpy] |
Many customers use a single mxnet build that supports gpu features and deploy it to both gpu and cpu machines. Due to the way how cuda containers are designed, libcuda.so won't be present on the cpu machines. That's why it's better to dlopen(cuda) only once needed. This not only affects tvmop but als nvrtc feature in mxnet. Using the stubs is a workaround for using dlopen, but adds additional requirements for modifying the LD_LIBRARY_PATH on users cpu machines. That's not always feasible for users and for mxnet 1.6, which introduced nvrtc, users typically just disable the nvrtc feature to be able to deploy the libmxnet.so to both cpu and gpu machines. Why not fix the underlying problem and then enable tvmop feature?
That doesn't make sense as we are running CI successfully with tvm op disabled since a couple of months? Maybe you ran into some unrelated flakyness and need to retrigger the run? |
I'm fine to disable tvm op (or mark it as experimental) for now, if it does need another 4 - 6 weeks to fully address the underlying problem, as we have some more urgent tasks on numpy side. |
Instead of linking tvm to mxnet, can we use TVM to generate source code and test as custom c++ operator? |
Problem 1 should have been fixed by #18818 |
Problem 1: TVMOp doesn't work well with GPU builds #17840
The error message:
Root cause:
In mxnet/contrib/tvmop/compile.py, only
function_binary
(llvm Module)
is saved inlibtvmop.so
. Theimported_modules
(cuda Module)
is not saved. So TVM cannot import any gpu functions and cannot findless_scalar_gpufloat32_2bool_2_kernel0.
Solution (Github PR):
imported_modules[0]
tolibtvmop.cubin:
Import
function (usingTVMOpModule->Import
):cubin_module
toglobal_module
:Problem 2: CI Checks: libcuda.so does exist on the machine builds mxnet
The error message:
When running unix-gpu checks:
Root cause:
The unix-gpu machine that builds mxnet does not have libcuda.so
Solution 1:
Link
libtvm.so
withstub/libcuda.so
on the machine that builds CI Checks.Solution 1 Pros/Cons/Workloads:
libcuda.so
totally, (would be great if someone can elaborate the motivation behind it).Solution 2 (Possible) (Github PR) :
TVM links libcuda.so because it invokes CUDA driver API during runtime. While these functions are not executed during compile-time. Therefore it is possible to remove them for compile-only purpose.
I have made a prototype to remove the linkage of libcuda.so from libtvm.so:
target_link_libraries
of tvm and tvm_runtime differently. (CMakeLists.txt)CUDA_COMPILE_ONLY
to beON
to indicate “Building libtvm.so without libcuda.so” (CMakeLists.txt)CUDA_COMPILE_ONLY
isON
, add compilation definition-DCUDA_COMPILE_ONLY
(CMakeLists.txt)CUDA_COMPILE_ONLY
is defined (when compiling libtvm.so), ignore any cuXXX CUDA Driver API functions: (cmake/modules/CUDA.cmake, src/runtime/cuda/cuda_common.h, src/runtime/cuda/cuda_device_api.cc, src/runtime/cuda/cuda_module.cc)Solution 2 Pros/Cons/Workloads:
Pros: Not depend on libcuda.so
Cons: After unlinking libtvm.so with libcuda.so, the CI checks still cannot pass. GPU CUDA RTC (When
-DUSE_TVM_OP=ON
) outputs the error message:It seems unlinking
libcuda.so
fromlibtvm.so
is not enough and we should also unlinklibcuda.so
fromlibtvm_runtime.so
. However,libtvm_runtime
does requirelibcuda.so
during runtime. if we remove the linkage on build instances,tvm_runtime
is not able to run on test instances.In order to fully address the problem, we have two options,
libcuda.so
for compiling the tvm operators; another does not linklibcuda.so
which will be transferred to test instances for tvmop tests.dlopen(“libcuda.so”)
in tvm.Proposal:
Given the fact that it might take another 4 - 6 weeks to fully address the CI problem, we propose to,
Comments:
Also, When setting
-DUSE_TVM_OP=OFF
the CI checks would be stuck. The output of GPU CUDA RTC looks like:The text was updated successfully, but these errors were encountered: