-
Notifications
You must be signed in to change notification settings - Fork 354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 [Bug] Compilation causes error: RuntimeError: [Error thrown at core/partitioning/shape_analysis.cpp:66] Expected ivalues_maps.count(input) to be true but got false Could not find torch::jit::Value* 47 produced from %47 : int = prim::dtype(%52) in lowering graph for mini graph input.
#922
Comments
This looks similar to issue #756, with fix #757. Looking at the sources, it looks like this fix may not have made it into |
Looks like the issue is in this line:
as changing this line to:
appears to bypass the issue. |
@peri044 Can you take a look at this, looks related to your past work on dtype |
@chaoz-dev I had the same error
and your method of bypassing is not applicable in my case. |
You're probably facing a different but possibly related issue. Can you file a new bug report with the above information? |
I took a look into this issue. This is caused by resolveNonTensorInput function.
It will introduce a NonTensorInput for a.device for the minigraph, this will induce ResolveNonTensorInput function to segment this subgraph again. This explains why it's fine when you change it to
Let me see if we can refactor this function since it's doing a mess here. |
@chaoz-dev I raised a PR for this bug just now here #1024. |
Sorry to re-raise this issue, but I'm still getting the same runtime error for deformable convolutions on the latest build of master (commit 91a92ca), which includes PR #1024. Expanding upon the original reproduction code above trt_bug.py, I'm getting
ENV info:
|
I am still getting this runtime error on another model; I don't believe I'm using deformable convolution here. I'm in the process of trying to clean up source code to show, but the error is something like:
with the Note also that this error did not occur with float16, only with int8. This is with Torch-TensorRT v1.1.0, so PR #1024 is included. |
@BrettRyland Could you please try this PR: #1067 |
@Hodapp87 Could you please try this #1067 as well? Or could you please provide a reproducer if you still hit this issue? |
I still get this error
(full log: trt_bug_log.txt) with PR #1067 merged in:
Side note: the trt_bug.py script has a typo on line 93, it should be Another side note: I don't think it's relevant to this issue, but I get the warning
despite not having cublas 11.8.0 on my system
|
@BrettRyland Did you clear your cache? I could get your model work after I used that PR. |
For testing this, I used the Dockerfile straight out of the repository and then ran inside this container. Unless this caches something I'm unaware of, this should have been a clean build. Here's the truncated output of my run, which shows versions as well (
If I get a chance soon, I'll see if I can extract a simpler model out of this that I can send to try. |
Clearing the cache (I also ran
I'll need to try to isolate where in my full model it's happening now to see what's triggering it. |
OK, I've reduced my model to a smaller repro script trt_bug.py which still gives
It appears to be being caused by using a single-valued
where
which can also be avoided by using an explicit tensor instead of the
or
|
Hi @BrettRyland I took a look into your model. This line TensorRT/core/lowering/lowering.cpp Line 95 in 058a511
|
can I get more details? thanks @Hodapp87 |
hey @BrettRyland did you bypass the issue? |
Still working on the repro, but I just build torch_tensorrt from source (using master) to see if it helped and it it did partially solve the issue above. Unfortunately, I get a different Here is the stacktrace for reference:
|
Hey @Belval could you please try either add this line: Details about why this happens could be found here: #1173 |
Interestingly,
Any ideas? |
@Belval I was trying to reproduce the error on ngc 22.06. However, I kept getting library loading errors if I want to reproduce your error by building torch_tensorrt from source. Did I miss anything? |
I am not sure that I understand your question. If you are referring to the repro package I sent you, then it could be the |
@Belval may I ask how did you build torch-tensorrt from source on NGC 22.06? I tried to build the library from source but kept getting this error when i import torch_tensorrt: |
From the error it could be the In a container based on
I just tried it an I got an exception while building, unfortunately I built it in a running container that I stopped a few days ago so I can't sent you the exact config but I did something along the above. I don't remember installing additional dependencies. |
hey @Belval I was able to reproduce your error and the error msg is in fact shown before the graph.
This explains why you are getting the missing value error when you use TensorRT/core/lowering/lowering.cpp Line 95 in 5ad9826
This looks like what happens here #922 (comment). In his model, this LowerGraph function also modified the graph and introduced other inputs, and it turns out that this happens because he used a Tensor as an Int for indexing. Could you please check why this Float input are introduced in your model? I took a look and found all the introduced Float value as used here:
so in fact all the input float values are used as the second input parameter for custom_deform_conv node. Could you please take a look if there is some kind of abnormal use around that part? I don't think this kind of error comes from torch_tensorrt, since this is an internal function from torch. |
hi @Belval any update? |
@Belval
then you can use torch_tensorrt.compile to compile the retrieved JIT graph. |
Sorry for not getting back at you earlier. I tried your suggestion but it does not seem to removes the backbone = model.backbone.eval().cuda()
backbone = torch.jit.freeze(torch.jit.script(backbone))
with torch_tensorrt.logging.debug():
torch_tensorrt.compile(
backbone,
inputs=[torch_tensorrt.Input((1, 3, 1440, 1440))],
min_block_size=1
) I get the same error as described here. The model is in eval mode and |
FYI @bowang007, I had this bug show up again in my complicated proprietary model (not entirely sure why though). Applying #1298 on top of the current |
Still getting this error after checking out
Here is my Dockerfile: FROM nvcr.io/nvidia/pytorch:22.08-py3
# Install system deps
RUN DEBIAN_FRONTEND=noninteractive apt-get update && DEBIAN_FRONTEND=noninteractive apt-get upgrade -y && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends tzdata && apt-get install sudo ffmpeg libsm6 libxext6 -y
# Install python deps, copy packages
## INSTALL DEPS ##
RUN apt-get remove ninja-build
# Rebuild TensorRT
RUN cd /opt/tensorrt && \
wget https://github.com/bazelbuild/bazelisk/releases/download/v1.12.0/bazelisk-linux-amd64 && \
chmod +x bazelisk-linux-amd64 && \
mv bazelisk-linux-amd64 bazel && \
git clone https://github.com/pytorch/TensorRT.git && \
cd TensorRT && \
git checkout param_input
ENV PATH=/opt/tensorrt:$PATH
# Modified WORKSPACE with correct paths
COPY ./WORKSPACE /opt/tensorrt/TensorRT
RUN cd /opt/tensorrt/TensorRT/py/ && python3 setup.py bdist_wheel --use-cxx11-abi
RUN cd /compiled_deps/extensions/modulated_deform_conv/ && rm -rf build && mkdir build && cd build && cmake -DCMAKE_PREFIX_PATH="$(python -c 'import torch.utils; print(torch.utils.cmake_prefix_path)')" .. && make
RUN cd /compiled_deps/extensions/ms_deform_attn/ && rm -rf build && mkdir build && cd build && cmake -DCMAKE_PREFIX_PATH="$(python -c 'import torch.utils; print(torch.utils.cmake_prefix_path)')" .. && make My script to reproduce the issue: torch.ops.load_library("/compiled_deps/extensions/modulated_deform_conv/build/libcustom_deform_conv.so")
with torch.no_grad():
with open("repro.ts", "rb") as f:
backbone = torch.jit.load(f)
with torch_tensorrt.logging.debug():
torch_tensorrt.compile(
backbone,
inputs=[torch_tensorrt.Input((1, 3, 1440, 1440))],
torch_executed_ops=["prim::ListConstruct"],
min_block_size=1
) I'll send the torchscripted backbone file so that you can try it on your side. WORKSPACE file:
|
Hey @Belval I was able to reproduce the error finally. |
Hey @Belval, I have your model supported using this branch: Could you try it as well? |
Did you change anything else? Reusing the docker file I sent (but with
|
No. Did you uninstall pre-installed torch_tensorrt?
|
@Belval I set up a clean container and tested again, here is the detailed steps:
Then run this script:
|
I think doing model.eval() did the trick for me. ` trt_ts_module = trt.compile(traced_model, inputs=[sample_input], enabled_precisions={torch.float32}) |
I get the same error trying to convert torchvision fcos with resnet50. Relevant issue: pytorch/vision#6200 |
Also seeing this on stable diffusion autoencoder.encode Edit: Resolved for me by updating to 1.3.0 |
Bug Description
Compiling the graph throws the following error:
Looking at the output torchscript graph, %47 is defined in a prior node, however, it does not appear to be visible in the current node.
To Reproduce
Throws the following error:
Expected behavior
Compilation should not fail, and should produce the following output when run:
Environment
Ubuntu 18.04 x86-64, run with NGC
21.11-py3
and22.02-py3
.Additional context
See output.txt for full torchscript graph output.
The text was updated successfully, but these errors were encountered: