Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export with ONNX Simplifier with --grid error #2558

Closed
antlamon opened this issue Mar 22, 2021 · 32 comments · Fixed by #2856 or #2982
Closed

Export with ONNX Simplifier with --grid error #2558

antlamon opened this issue Mar 22, 2021 · 32 comments · Fixed by #2856 or #2982
Labels
bug Something isn't working Stale Stale and schedule for closing soon

Comments

@antlamon
Copy link

🐛 Bug

An exported model as ONNX using --grid parameter cannot be used by onnx-runtime or simplified by onnx-simplifier
A Mul Node triggers a shape inference error Incompatible dimensions

To Reproduce

Replace ONNX export in export.py with this code and run with command python3 models/export.py --grid

try:
        import onnx
        from onnxsim import simplify
        print('\nStarting ONNX export with onnx %s...' % onnx.__version__)
        f = opt.weights.replace('.pt', '.onnx')  # filename
        torch.onnx.export(model, img, f, verbose=False, opset_version=12, input_names=['images'],
                          output_names=['classes',
                                        'boxes'] if y is None else ['output'],
                          dynamic_axes={'images': {0: 'batch', 2: 'height', 3: 'width'},  # size(1,3,640,640)
                                        'output': {0: 'batch', 2: 'y', 3: 'x'}} if opt.dynamic else None)

        # Checks
        onnx_model = onnx.load(f)  # load onnx model
        onnx.checker.check_model(onnx_model)  # check onnx model

        # This step triggers the error
        model_simp, check = simplify(onnx_model)
        onnx.save(model_simp, f)

        # print(onnx.helper.printable_graph(onnx_model.graph))  # print a human readable model
        print('ONNX export success, saved as %s' % f)
    except Exception as e:
        print('ONNX export failure: %s' % e)

Output:

Starting ONNX export with onnx 1.8.1...
ONNX export failure: [ONNXRuntimeError] : 1 : FAIL : Node (Mul_925) Op (Mul) [ShapeInferenceError] Incompatible dimensions

Expected behavior

Any yolov5 model exported as ONNX should be valid

@antlamon antlamon added the bug Something isn't working label Mar 22, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Mar 22, 2021

👋 Hello @antlamon, thank you for your interest in 🚀 YOLOv5! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://www.ultralytics.com or email Glenn Jocher at [email protected].

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.7. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

@glenn-jocher
Copy link
Member

@antlamon thanks for the bug report. We don't generally provide support for code customizations and external package not in requirements.txt.

If an external package is causing an error you may also want to raise an issue with the package authors.

@glenn-jocher glenn-jocher changed the title Exported model with grid as ONNX cannot be used Export with ONNX Simplifier with --grid error Mar 23, 2021
@tommy2is
Copy link

tommy2is commented Mar 25, 2021

I would like to add that without any modifications to export.py, the --grid option also results in an unusable .onnx file when ran on the yolov5s.pt model. It works fine without the --grid option

During the running of the script, the following warning was produced (By torchscipt, not onnx though):

Starting TorchScript export with torch 1.8.0+cu101...
./models/yolo.py:48: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if self.grid[i].shape[2:4] != x[i].shape[2:4]:
/usr/local/lib/python3.7/dist-packages/torch/jit/_trace.py:940: TracerWarning: Encountering a list at the output of the tracer might cause the trace to be incorrect, this is only valid if the container structure does not change based on the module's inputs. Consider using a constant container instead (e.g. for `list`, use a `tuple` instead. for `dict`, use a `NamedTuple` instead). If you absolutely need this and know the side effects, pass strict=False to trace() to allow this behavior.
  _force_outplace,
TorchScript export success, saved as yolov5s.torchscript.pt

Attempting to run a inference session results in

Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from yolov5s.onnx failed:Node (Mul_925) Op (Mul) [ShapeInferenceError] Incompatible dimensions```

@Lucashsmello
Copy link

This error occurred to me when exporting the onnx model using torch==1.8.1 with torchvision==0.9.1. When i export using torch==1.7.1, the loading of the onnx model works fine in both torch==1.7.1 and torch==1.8.1.

@tommy2is
Copy link

Thank you for pointing that out. That indeed was the issue

@thestonehead
Copy link

Will there be a fix, since torchvision==0.8.2 (required by torch 1.7.1) doesn't exist for windows?

@TheanMS
Copy link

TheanMS commented Apr 1, 2021

I am also getting the same error,downgrading torch and torchvision versions didn't help me out to fix this issue.

@timstokman
Copy link
Contributor

When I downgrade the pytorch version and export with --dynamic --grid, I can load the model, but it fails when doing inference on a (1, 3, 1088, 1920) tensor with this:

2021-04-15 20:26:39.079097920 [E:onnxruntime:, sequential_executor.cc:339 Execute] Non-zero status code returned while running Add node. Name:'Add_945' Status Message: /onnxruntime_src/onnxruntime/core/providers/cpu/math/element_wise_ops.h:487 void onnxruntime::BroadcastIterator::Append(ptrdiff_t, ptrdiff_t) axis == 1 || axis == largest was false. Attempting to broadcast an axis by a dimension other than 1. 34 by 60

Traceback (most recent call last):
File "test_onnx.py", line 42, in
outputs = sess.run(None, {input_name: image})
File "venv/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 188, in run
return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Add node. Name:'Add_945' Status Message: /onnxruntime_src/onnxruntime/core/providers/cpu/math/element_wise_ops.h:487 void onnxruntime::BroadcastIterator::Append(ptrdiff_t, ptrdiff_t) axis == 1 || axis == largest was false. Attempting to broadcast an axis by a dimension other than 1. 34 by 60

It does work with just --dynamic. I've seen some approaches where people re-implement the last layer with onnx. I guess that's probably the best approach for now.

@glenn-jocher
Copy link
Member

@antlamon @Lucashsmello @thestonehead @timstokman we've integrated onnx-simplifier into export.py now in ONNX Simplifier PR #2815 and verified it's passing CI on all operating systems.

I'm not sure if this resolves the original issue, but hopefully it's a step in the right direction.

@piotlinski
Copy link

piotlinski commented Apr 17, 2021

@antlamon @Lucashsmello @thestonehead @timstokman we've integrated onnx-simplifier into export.py now in ONNX Simplifier PR #2815 and verified it's passing CI on all operating systems.

I'm not sure if this resolves the original issue, but hopefully it's a step in the right direction.

@glenn-jocher Unfortunately, the problem still persists: I am using the docker image (version v5.0) and --grid causes ONNX export to fail on simplifying. The resulting onnx file cannot be used due to Incompatible dimensions error. However, rolling back pytorch to 1.8.0 (i.e. using the docker image v4.0 with latest repository version, which includes ONNX simplifier) works OK.

@timstokman
Copy link
Contributor

@glenn-jocher I'm seeing the same issues, both --grid and --dynamic don't work with the simplifier. --grid export only seems to work in a few cases, even without the simplifier. I made a pull request for the "--dynamic" export issue: #2856

@glenn-jocher
Copy link
Member

@timstokman thanks for the PR, I'll take a look over there!

@timstokman
Copy link
Contributor

timstokman commented Apr 20, 2021

To give a reproduction of the grid export issue now that the PR is merged:

python models/export.py --simplify --grid
Namespace(batch_size=1, device='cpu', dynamic=False, grid=True, img_size=[640, 640], simplify=True, weights='./yolov5s.pt')
YOLOv5 🚀 v5.0-15-g1df8c6c torch 1.8.1+cu102 CPU

Fusing layers... 
Model Summary: 224 layers, 7266973 parameters, 0 gradients, 17.0 GFLOPS

TorchScript: starting export with torch 1.8.1+cu102...
./models/yolo.py:50: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if self.grid[i].shape[2:4] != x[i].shape[2:4]:
TorchScript: export success, saved as ./yolov5s.torchscript.pt
ONNX: starting export with onnx 1.9.0...
ONNX: simplifying with onnx-simplifier 0.3.5...
ONNX: simplifier failure: [ONNXRuntimeError] : 1 : FAIL : Node (Mul_925) Op (Mul) [ShapeInferenceError] Incompatible dimensions
ONNX: export success, saved as ./yolov5s.onnx
CoreML: export failure: No module named 'coremltools'

Export complete (4.56s). Visualize with https://github.com/lutzroeder/netron.

Without --simplify the model simply can't be loaded by the runtime.

It looks like the last layer has incompatible dimensions when exported.

@glenn-jocher
Copy link
Member

@timstokman hmm, so the onnx runtime only succeeds with a --simplify model, but --simplify fails when --grid is also used?

@timstokman
Copy link
Contributor

timstokman commented Apr 20, 2021

@glenn-jocher They both fail:

  • When using --grid without simplify, it generates a model that can't be loaded with onnxruntime. It fails with this error: 1 : FAIL : Load model from yolov5s.onnx failed:Node (Mul_925) Op (Mul) [ShapeInferenceError] Incompatible dimensions.
  • When using --grid --simplify, the simplifier probably notices that the last layer has issues, and generates the exact same error: simplifier failure: [ONNXRuntimeError] : 1 : FAIL : Node (Mul_925) Op (Mul) [ShapeInferenceError] Incompatible dimensions

The root cause is in how the last layer is exported seemingly. Some sort of tensor dimension mismatch.

@piotlinski
Copy link

piotlinski commented Apr 20, 2021

@ timstokman out of curiosity: what pytorch version are you using? EDIT: I see, 1.8.1, sry for the question

@glenn-jocher I managed to make simplify with grid work by rolling back pytorch to 1.8 (1.9 used in the latest docker image did not work, I don't know what happens if installed on host OS, not in docker)

Perhaps it's ONNX version that causes the issue? In the older yolov5 image (v4.0) it is 1.7.0 AFAIR

@timstokman
Copy link
Contributor

timstokman commented Apr 20, 2021

@piotlinski I used the latest version, and the one you suggested. With pytorch 1.8 it works with the default options, but as soon as you use --dynamic or --img-size it stops working. With the latest version, it doesn't work at all.

@piotlinski
Copy link

piotlinski commented Apr 20, 2021

@timstokman interesting, I tried pytorch 1.8 and can set img-size, (did not try dynamic though). I use the older version, where simplifier is always run. (the log says YOLOv5 v4.0, but I manually check out a newer commit)

docker exec -it yolov5 python models/export.py --weights /usr/src/model.pt --img 288 480 --batch-size 1 --grid
Namespace(batch_size=1, device='cpu', dynamic=False, grid=True, img_size=[288, 480], weights='/usr/src/model.pt')
YOLOv5 🚀 v4.0-207-gaff03be torch 1.8.0a0+1606899 CPU

Fusing layers... 
Model Summary: 224 layers, 7053910 parameters, 0 gradients, 16.3 GFLOPS

TorchScript: starting export with torch 1.8.0a0+1606899...
/usr/src/app/models/yolo.py:50: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if self.grid[i].shape[2:4] != x[i].shape[2:4]:
TorchScript: export success, saved as /usr/src/model.torchscript.pt
ONNX: starting export with onnx 1.7.0...
ONNX: simplifying with onnx-simplifier 0.3.4...
(op_type:Slice, name:Slice_4): Inferred shape and existing shape differ in dimension 2: (288) vs (144)
(op_type:Slice, name:Slice_4): Inferred shape and existing shape differ in dimension 2: (288) vs (144)
(op_type:Slice, name:Slice_4): Inferred shape and existing shape differ in dimension 2: (288) vs (144)
ONNX: export success, saved as /usr/src/model.onnx
CoreML: export failure: No module named 'coremltools'

Export complete (5.41s). Visualize with https://github.com/lutzroeder/netron.

EDIT: with --dynamic I get:

ONNX: simplifying with onnx-simplifier 0.3.4...
(op_type:Slice, name:Slice_266): Inferred shape and existing shape differ in dimension 4: (6) vs (2)
ONNX: simplifier failure: The shape of input "images" has dynamic size "[0, 3, 0, 0]", please determine the input size manually by "--dynamic-input-shape --input-shape xxx" or "--input-shape xxx". Run "python3 -m onnxsim -h" for details

@timstokman
Copy link
Contributor

timstokman commented Apr 20, 2021

@piotlinski Update to the latest yolo version to fix the error with --dynamic, and get the actual error.

@piotlinski
Copy link

piotlinski commented Apr 20, 2021

@timstokman no error with --dynamic and latest version here, provided the same versions of libraries as above.

python models/export.py --weights /usr/src/model.pt --img 288 480 --grid --dynamic --batch-size 1
Namespace(batch_size=1, device='cpu', dynamic=True, grid=True, img_size=[288, 480], simplify=False, weights='/usr/src/model.pt')
YOLOv5 🚀 v5.0-17-gc949fc8 torch 1.8.0a0+1606899 CPU

Fusing layers...
Model Summary: 224 layers, 7053910 parameters, 0 gradients, 16.3 GFLOPS

TorchScript: starting export with torch 1.8.0a0+1606899...
/usr/src/app/models/yolo.py:50: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if self.grid[i].shape[2:4] != x[i].shape[2:4]:
TorchScript: export success, saved as /usr/src/model.torchscript.pt
ONNX: starting export with onnx 1.7.0...
ONNX: export success, saved as /usr/src/model.onnx
CoreML: export failure: No module named 'coremltools'

Export complete (3.73s). Visualize with https://github.com/lutzroeder/netron.

when running with --simplify I get only some info

ONNX: starting export with onnx 1.7.0...
ONNX: simplifying with onnx-simplifier 0.3.4...
(op_type:Slice, name:Slice_266): Inferred shape and existing shape differ in dimension 4: (6) vs (2)
(op_type:Slice, name:Slice_266): Inferred shape and existing shape differ in dimension 4: (6) vs (2)
(op_type:Slice, name:Slice_266): Inferred shape and existing shape differ in dimension 4: (6) vs (2)
ONNX: export success, saved as /usr/src/model.onnx

@timstokman
Copy link
Contributor

Looks like the docker image also has different versions of onnx and onnx-simplifier. Maybe the requirements.txt of the yolo project needs to start pinning a few versions for this to work reliably.

@piotlinski Can you actually do inference with the exported model?

@piotlinski
Copy link

piotlinski commented Apr 20, 2021

@timstokman the ones exported earlier (without the --dynamic flag) work OK. I haven't checked the dynamic models

@jylink
Copy link
Contributor

jylink commented Apr 28, 2021

I change this line in model/yolo.py, and then pass both --grid and --grid --simplify. Onnx runtime works fine too

# y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * torch.tensor(self.anchor_grid[i].tolist()).float()  # wh

@goderent
Copy link

I found that the cause of the bug is the inconsistent behavior of the [i] symbol in pytorch and onnx. the shape of anchor_grid is (3,1,3,1,1,2), the shape of anchor_grid[i] is (1,3,1,1,2) in pytorch, but it is (1,1,3,1,1,2) in onnx.
so we must clarify the shape of anchor_grid[i]. just modified the line in model/yolo.py:

# y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i].view(bs, self.na, 1, 1, 2) # wh

@jylink
Copy link
Contributor

jylink commented Apr 29, 2021

btw, the exported onnx cannot be converted to tensorrt engine because subscript assignments generate unsupported ScatterND nodes. I rewrite the code to avoid generating ScatterND

# y[..., 0:2] = (y[..., 0:2] * 2. - 0.5 + self.grid[i]) * self.stride[i]  # xy
# y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
# z.append(y.view(bs, -1, self.no))
xy = (y[..., 0:2] * 2. - 0.5 + self.grid[i]) * self.stride[i]  # xy
wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i].view(bs, self.na, 1, 1, 2)  # wh
rest = y[..., 4:]
yy = torch.cat((xy, wh, rest), -1)
z.append(yy.view(bs, -1, self.no))

@timstokman
Copy link
Contributor

timstokman commented Apr 29, 2021

@jylink Tried your code, exporting works fine now, when I try to use dynamic axes it still seems to fail when running the model:

E onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Add node. Name:'Add_455' Status Message: /onnxruntime_src/onnxruntime/core/providers/cpu/math/element_wise_ops.h:487 void onnxruntime::BroadcastIterator::Append(ptrdiff_t, ptrdiff_t) axis == 1 || axis == largest was false. Attempting to broadcast an axis by a dimension other than 1. 34 by 60

Here I tried a tensor of 1088x1920x3 as input (stride 32 padded) for an image that was originally 1080x1920x3.

When using a fully padded tensor, 1920x1920x3, the predict layer does seem to work correctly, so this is a big improvement. I suggest you create a pull request for it.

Personally I still can't use --grid exports, dynamic axes gives me an almost 2x speed improvement and helps with CUDA memory usage.

@jylink
Copy link
Contributor

jylink commented Apr 30, 2021

@timstokman Hi, I found that the self.grid[i] mismatch the dynamic y[..., 0:2]. Dont know if it is the best way but I add a variable self.dynamic and pass all --dynamic, --grid, --simplify, onnxruntime and tensorrtEngine

# model/yolo.py
class Detect(nn.Module):
    stride = None  # strides computed during build
    export = False  # onnx export
    dynamic = False  # <--NEW
        ...
            if not self.training:  # inference
                if self.dynamic or self.grid[i].shape[2:4] != x[i].shape[2:4]:  # <--NEW
                    self.grid[i] = self._make_grid(nx, ny).to(x[i].device)

                y = x[i].sigmoid()
                xy = (y[..., 0:2] * 2. - 0.5 + self.grid[i]) * self.stride[i]  # xy
                wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i].view(bs, self.na, 1, 1, 2)  # wh
                rest = y[..., 4:]
                y_ = torch.cat((xy, wh, rest), -1)
                z.append(y_.view(bs, -1, self.no))

# model/export.py
    model.model[-1].export = not opt.grid  # set Detect() layer grid export
    model.model[-1].dynamic = opt.dynamic  # <--NEW
    for _ in range(2):
        y = model(img)  # dry runs

Test:

# gen onnx
!python models/export.py --img 352 608 --batch 1 --dynamic --grid --simplify --weights weights/best.pt

# onnxruntime
sess = rt.InferenceSession('weights/best.onnx')
input_name = sess.get_inputs()[0].name
output_name = []
for output in sess.get_outputs():
    output_name.append(output.name)
for i in range(-5, 5):
    input = np.random.rand(1, 3, 608 + 32 * i, 608).astype(np.float32)
    pred = sess.run(output_name, {input_name: input})
    input = np.random.rand(1, 3, 608, 608 + 32 * i).astype(np.float32)
    pred = sess.run(output_name, {input_name: input})

@timstokman
Copy link
Contributor

Yes, that fixes all the issues for me. Outputs seem exactly the same, with and without dynamic, and it works for different image sizes. Guess I can throw away my own numpy implementation of the detect layer. It also fixes the framework version compatibility issues. To me, the implementation seems good. Pull request time?

@glenn-jocher Looks like this fixes the remaining options with "--grid".

@jylink
Copy link
Contributor

jylink commented Apr 30, 2021

PR #2982

@glenn-jocher
Copy link
Member

glenn-jocher commented May 3, 2021

@antlamon @tommy2is @timstokman good news 😃! Your original issue may now been fixed ✅ in merged PR #2982 by @jylink. To receive this update you can:

  • git pull from within your yolov5/ directory
  • git clone https://github.com/ultralytics/yolov5 again
  • Force-reload PyTorch Hub: model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
  • View our updated notebooks: Open In Colab Open In Kaggle

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

@github-actions
Copy link
Contributor

github-actions bot commented Jun 4, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the Stale Stale and schedule for closing soon label Jun 4, 2021
@ganleiboy
Copy link

anchor_grid

so the bug is in self.anchor_grid not in self.grid hhhha. nice work!

rajames added a commit to edgeimpulse/example-custom-ml-block-ti-yolox that referenced this issue Jan 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale Stale and schedule for closing soon
Projects
None yet