Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorRT crashes on Windows unless it is the first imported module #2853

Closed
mantaionut opened this issue Apr 5, 2023 · 12 comments
Closed

TensorRT crashes on Windows unless it is the first imported module #2853

mantaionut opened this issue Apr 5, 2023 · 12 comments
Assignees
Labels
internal-bug-tracked Tracked internally, will be fixed in a future release. triaged Issue has been triaged by maintainers

Comments

@mantaionut
Copy link

mantaionut commented Apr 5, 2023

Description

The exception mechanism in pybind11 causes a crash in TensorRT if its not the first module imported.
If another module throws an exception than it will cause tensorRT to crash.
This issue seems similar to this one onnx/onnx#3493 but I was not able to build TensorRT with debug symbols so I can't be sure.

 	tensorrt.cp39-win_amd64.pyd!00007ffca92426a0()	Unknown
 	tensorrt.cp39-win_amd64.pyd!00007ffca92428a5()	Unknown
 	tensorrt.cp39-win_amd64.pyd!00007ffca913bf49()	Unknown
 	tensorrt.cp39-win_amd64.pyd!00007ffca9148fe7()	Unknown
 	tensorrt.cp39-win_amd64.pyd!00007ffca9149ad6()	Unknown
 	python_example.cp39-win_amd64.pyd!00007ffce8d063b8()	Unknown
 	python_example.cp39-win_amd64.pyd!00007ffce8d132f9()	Unknown
 	vcruntime140_1.dll!00007ffd4b281080()	Unknown
 	vcruntime140_1.dll!00007ffd4b2826f5()	Unknown
 	ntdll.dll!RcConsolidateFrames�()	Unknown
 	python_example.cp39-win_amd64.pyd!00007ffce8d08fda()	Unknown
 	[External Code]	
>	PythonApplication1.py!<module> Line 3	Python

Environment

TensorRT Version: TensorRT-8.6.0.12
NVIDIA GPU: RTX 3060 Laptop GPU
NVIDIA Driver Version: 526.56
CUDA Version: 11.8
CUDNN Version: 8.5.0
Operating System: Windows 11
Python Version (if applicable): 3.9
Tensorflow Version (if applicable):
PyTorch Version (if applicable): 2.0
Baremetal or Container (if so, version):

Relevant Files

Steps To Reproduce

2 modules have this issues with TensorRT. One is torch, but I also created a small module https://github.com/mantaionut/python_example that has the same issue.
repro 1:

import torch
import tensorrt

graph = torch._C.Graph()
graph.addInput()

for i in graph.inputs():
    print(i)

print('finish')

repro 2:
git clone https://github.com/mantaionut/python_example
cd python_example
pip install .

import python_example
import tensorrt
python_example.add(3,5)
print('finish')
@mantaionut mantaionut changed the title TensorRT crashes if not the first imported module TensorRT crashes on Windows unless it is the first imported module Apr 5, 2023
@zerollzeng
Copy link
Collaborator

@oxana-nvidia ^ ^

@zerollzeng
Copy link
Collaborator

also @pranavm-nvidia may know more about this, I have a vague memory that this could be caused by pycuda or something, so the import order does matter.

@zerollzeng zerollzeng self-assigned this Apr 5, 2023
@zerollzeng zerollzeng added the triaged Issue has been triaged by maintainers label Apr 5, 2023
@oxana-nvidia
Copy link
Collaborator

We tested TensorRT 8.6 with up to PyTorch 1.13. Please see release notes https://docs.nvidia.com/deeplearning/tensorrt/release-notes/index.html#rel-8-6-0-EA
I've filed internal issue 4059899 to investigate it

For your example, you can try to replace

throw py::value_error("some exception");

with

PyErr_SetString(PyExc_ValueError, "some exception");
throw py::error_already_set();

@mantaionut
Copy link
Author

This issue also happens with PyTorch 1.13.

@pwuertz
Copy link

pwuertz commented Jul 21, 2023

I'm seeing a strange behavior that looks related to this.

I have a completely unrelated pybind11 module that crashes the python process when trying to throw an Exception. But only if the tensorrt module is present, and only on Windows.

@oxana-nvidia
Copy link
Collaborator

@pwuertz any repro steps you can provide so we can test it on our side?

@pwuertz
Copy link

pwuertz commented Jul 21, 2023

@oxana-nvidia Yes, it's fairly simple. My pybind11 module is dioptic.profileparser. The affected version is installed via pip install dioptic.profileparser==0.1.

import tensorrt
import dioptic.profileparser
dioptic.profileparser.Profile("syntax error")  # Crash
import dioptic.profileparser
import tensorrt
dioptic.profileparser.Profile("syntax error")  # Crash
import dioptic.profileparser
dioptic.profileparser.Profile("syntax error")  # Ok, raises `RuntimeError`

@oxana-nvidia
Copy link
Collaborator

Thanks for provided repro.
I've opened internal issue 4207761 to investigate

@pwuertz
Copy link

pwuertz commented Jul 24, 2023

Oh, it's a fundamental bug in pybind11 isn't it?
pybind/pybind11#2898

Pybind11 is using a global C++ data-structure for exception handling, and it is shared across all pybind11-based modules regardless of compiler or standard-lib version. What we are seeing is probably an ABI induced crash/corruption.

@pwuertz
Copy link

pwuertz commented Jul 24, 2023

Confirmed, the problem is fixed by preventing global data sharing between multiple pybind11 modules, which pybind11 does by default for some reason.

The workaround is to make sure that PYBIND11_INTERNALS_ID is different for all pybind11 modules. This can be achieved by pre-defining PYBIND11_COMPILER_TYPE to some module-specific ID.

Here is a diff for TensorRT python/CMakeLists.txt that prevents issues with other pybind modules that haven't been patched:

diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt
index 35ae486d..4dcb775d 100644
--- a/python/CMakeLists.txt
+++ b/python/CMakeLists.txt
@@ -113,6 +113,12 @@ message(STATUS "PY_CONFIG_INCLUDE: ${PY_CONFIG_INCLUDE}")
 include_directories(${TENSORRT_ROOT}/include ${PROJECT_SOURCE_DIR}/include ${CUDA_INCLUDE_DIRS} ${PROJECT_SOURCE_DIR}/docstrings ${ONNX_INC_DIR} ${PYBIND11_DIR})
 link_directories(${TENSORRT_LIBPATH})
 
+if (MSVC)
+  # Prevent pybind11 from sharing resources with other, potentially ABI incompatible modules
+  # https://github.com/pybind/pybind11/issues/2898
+  add_definitions(-DPYBIND11_COMPILER_TYPE="_${PROJECT_NAME}_abi")
+endif()
+
 if (MSVC)
     message(STATUS "include_dirs: ${MSVC_COMPILER_DIR}/include ${MSVC_COMPILER_DIR}/../ucrt/include ${NV_WDKSDK_INC}/um ${NV_WDKSDK_INC}/shared")
     message(STATUS "link dirs: ${PY_LIB_DIR} ${NV_WDKSDK_LIB}/um/x64 ${MSVC_COMPILER_DIR}/lib/amd64 ${MSVC_COMPILER_DIR}/../ucrt/lib/x64")

@oxana-nvidia
Copy link
Collaborator

Thanks for provided solution! We will verify it and add to the next release if no issues.

@zerollzeng
Copy link
Collaborator

The original issue has been fixed, so close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal-bug-tracked Tracked internally, will be fixed in a future release. triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

4 participants