Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YOLOX-TI] ERROR: onnx_op_name: /head/ScatterND #269

Closed
mikel-brostrom opened this issue Mar 24, 2023 · 70 comments
Closed

[YOLOX-TI] ERROR: onnx_op_name: /head/ScatterND #269

mikel-brostrom opened this issue Mar 24, 2023 · 70 comments
Labels
discussion Specification Discussion OP:ScatterND OP:ScatterND OP:Sigmoid OP:Sigmoid Parameter replacement Use Parameter replacement Quantization Quantization

Comments

@mikel-brostrom
Copy link
Contributor

mikel-brostrom commented Mar 24, 2023

Issue Type

Others

onnx2tf version number

1.8.1

onnx version number

1.13.1

tensorflow version number

2.12.0

Download URL for ONNX

yolox_nano_ti_lite_26p1_41p8.zip

Parameter Replacement JSON

{
    "format_version": 1,
    "operations": [
        {
            "op_name": "/head/ScatterND",
            "param_target": "inputs",
            "param_name": "/head/Concat_1_output_0",
            "values": [1,85,52,52]
        }
    ]
}

Description

Hi @PINTO0309. After our lengthy discussion regarding INT8 YOLOX export I decided to try out Ti's version of these models (https://github.com/TexasInstruments/edgeai-yolox/tree/main/pretrained_models). It looked to me that you manged to INT8-export those so maybe you could provide some hints 😄. I just downloaded the ONNX version of YOLOX-nano. For this model, the following fails:

onnx2tf -i ./yolox_nano.onnx -o yolox_nano_saved_model

The error I get:

ERROR: input_onnx_file_path: /datadrive/mikel/edgeai-yolox/yolox_nano.onnx
ERROR: onnx_op_name: /head/ScatterND
ERROR: Read this and deal with it. https://github.com/PINTO0309/onnx2tf#parameter-replacement
ERROR: Alternatively, if the input OP has a dynamic dimension, use the -b or -ois option to rewrite it to a static shape and try again.
ERROR: If the input OP of ONNX before conversion is NHWC or an irregular channel arrangement other than NCHW, use the -kt or -kat option.
ERROR: Also, for models that include NonMaxSuppression in the post-processing, try the -onwdt option.
  1. Research
  2. Export error
  3. I tried to overwrite the values of the parameter by the replacement json provided above with no luck
  4. Project need
  5. Operation that fails can be found in the image below:
    Screenshot from 2023-03-24 10-37-02
@PINTO0309
Copy link
Owner

Knowing that TI's model is rather verbose, I optimized it independently and created a script to replace all ScatterND with Slice.

https://github.com/PINTO0309/PINTO_model_zoo/tree/main/363_YOLO-6D-Pose

@mikel-brostrom
Copy link
Contributor Author

Thank you for your quick response

@PINTO0309
Copy link
Owner

I will be home with my parents today, tomorrow, and the day after, so I will not be able to provide detailed testing or assistance.

@PINTO0309 PINTO0309 added OP:ScatterND OP:ScatterND Parameter replacement Use Parameter replacement labels Mar 24, 2023
@mikel-brostrom
Copy link
Contributor Author

mikel-brostrom commented Mar 24, 2023

Thanks for the heads up! Testing this on my own on a detection model, not on pose. Let's see if I manage to get it working. The eval result on both models is as follows:

  YOLOX nano ONNX YOLOX-Ti nano ONNX
[email protected]:0.95 0.256 0.261
[email protected] 0.411 0.418

@mikel-brostrom
Copy link
Contributor Author

Ok. As I didn't see ScatterND in the original model, I checked what the differences where. I found out that this

def meshgrid(*tensors):
    if _TORCH_VER >= [1, 10]:
        return torch.meshgrid(*tensors, indexing="ij")
    else:
        return torch.meshgrid(*tensors)
 

def decode_outputs(self, outputs, dtype):
        grids = []
        strides = []
        for (hsize, wsize), stride in zip(self.hw, self.strides):
            yv, xv = meshgrid([torch.arange(hsize), torch.arange(wsize)])
            grid = torch.stack((xv, yv), 2).view(1, -1, 2)
            grids.append(grid)
            shape = grid.shape[:2]
            strides.append(torch.full((*shape, 1), stride))
 
        grids = torch.cat(grids, dim=1).type(dtype)
        strides = torch.cat(strides, dim=1).type(dtype)
 
        outputs = torch.cat([
            (outputs[..., 0:2] + grids) * strides,
            torch.exp(outputs[..., 2:4]) * strides,
            outputs[..., 4:]
        ], dim=-1)
        return outputs

gives:

Screenshot from 2023-03-24 11-49-44

While this:

def (self, outputs, dtype):
        grids = []
        strides = []
        for (hsize, wsize), stride in zip(self.hw, self.strides):
            yv, xv = torch.meshgrid([torch.arange(hsize), torch.arange(wsize)])
            grid = torch.stack((xv, yv), 2).view(1, -1, 2)
            grids.append(grid)
            shape = grid.shape[:2]
            strides.append(torch.full((*shape, 1), stride))
 
        grids = torch.cat(grids, dim=1).type(dtype)
        strides = torch.cat(strides, dim=1).type(dtype)
 
        outputs[..., :2] = (outputs[..., :2] + grids) * strides
        outputs[..., 2:4] = torch.exp(outputs[..., 2:4]) * strides
        return outputs

gives:

Screenshot from 2023-03-24 11-49-26

This as well as some other minor fixes make it possible to get rid of ScatterND completely.

@PINTO0309
Copy link
Owner

PINTO0309 commented Mar 24, 2023

Excellent.

Perhaps the overall size of the model should be significantly smaller. 64-bit index values are almost always overly precise. However, since the computational efficiency of Gather and Scatter is supposed to be high to begin with, I am concerned about how much the inference performance will deteriorate after the change to Slice.

@mikel-brostrom mikel-brostrom changed the title ERROR: onnx_op_name: /head/ScatterND [YOLOX-TI] ERROR: onnx_op_name: /head/ScatterND Mar 24, 2023
@mikel-brostrom
Copy link
Contributor Author

mikel-brostrom commented Mar 24, 2023

The model performance did not decrease after the changes and for the first time I got results on one of the quantized models (dynamic_range_quant).

Model size mAPval
0.5:0.95
mAPval
0.5
size
YOLOX-TI-nano ONNX (original model) 416 0.261 0.418 8.7M
YOLOX-TI-nano ONNX (no ScatterND) 416 0.261 0.418 8.7M
YOLOX-nano TFLite FP16 416 0.261 0.418 4.4M
YOLOX-nano TFLite FP32 416 0.261 0.418 8.7M
YOLOX-nano TFLite full_integer_quant 416 0 0 2.3M
YOLOX-nano TFLite dynamic_range_quant 416 0.249 0.410 2.3M
YOLOX-nano TFLite integer_quant 416 0 0 2.3M

But still nothing for the INT ones though...

@mikel-brostrom
Copy link
Contributor Author

Feel free to play around with it

yolox_nano_no_scatternd.zip

😄

@PINTO0309
Copy link
Owner

PINTO0309 commented Mar 24, 2023

I can't see the structure of the model today, but I believe there were a couple of Sigmoid at the beginning of the post-processing.

What if the model transformation is stopped just before post-processing? However, it is difficult to measure mAP.

e.g.

onnx2tf -i resnet18-v1-7.onnx \
-onimc resnetv15_stage2_conv1_fwd resnetv15_stage2_conv2_fwd

It's an interesting topic and I'd like to try it myself, but I can't easily try it right now.

@mikel-brostrom
Copy link
Contributor Author

mikel-brostrom commented Mar 24, 2023

You are right @PINTO0309 . I missed this:

output = torch.cat(
    [reg_output, obj_output.sigmoid(), cls_output.sigmoid()], 1
)

which in the ONNX model is represented as:

Screenshot from 2023-03-24 13-55-41

then in the TFLite models these Sigmoid converts into Logistic:

Screenshot from 2023-03-24 14-03-45

But why is the dynamic range quantized model working and not the rest of the quantized models?

@PINTO0309
Copy link
Owner

PINTO0309 commented Mar 24, 2023

If I remember correctly, dynamic range is less prone to accuracy degradation because it recalculates the quantization range each time; compared to INT8 full quantization, the inference speed would have been very slow in exchange for maintaining accuracy.

I may be wrong because I do not have an accurate grasp of recent quantization specifications.

By the way,
Sigmoid = Logistic

@mikel-brostrom
Copy link
Contributor Author

mikel-brostrom commented Mar 24, 2023

Maybe a bit out of topic. Anyways, I am using the official TFLite benchmark tool for the exported models and on the specific android device i I am running this on I get that the Float32 models is much faster that the dynamically quantized one.

Screenshot from 2023-03-24 16-24-21

@mikel-brostrom
Copy link
Contributor Author

mikel-brostrom commented Mar 24, 2023

People are getting the same quantization problems with YOLOv8 ultralytics/ultralytics#1447:
full_integer_quant and integer_quant does not work. dynamic_range_quant works but it is very slow

@mikel-brostrom
Copy link
Contributor Author

mikel-brostrom commented Mar 24, 2023

But then I guess that the only option we have is to perform the sigmoid operation outside the model...

@motokimura
Copy link
Contributor

motokimura commented Mar 24, 2023

@mikel-brostrom
As for the accuracy degradation of YOLOX integer quantization, I think it may be due to the distribution mismatch of xywh and score values.

Just before the last Concat, xywh seems to have a distribution of (min, max)~(0.0, 416.0). On the other hand, scores have a much narrower distribution of (min, max) = (0.0, 1.0) because of sigmoid.

In TFLite quantization, activation is quantized in per-tensor manner. That is, the OR distribution of xywh and scores, (min, max) = (0.0, 416.0), is mapped to integer values of (min, max) = (0, 255) after the Concat. As a result, even if the score is 1.0, after quantization it is mapped to: int(1.0 / 416 * 255) = int(0.61) = 0, resulting in all scores being zero!

A possible solution is to divide xywh tensors by the image size (416) to keep it in the range (min, max) ~ (0.0, 1.0) and then concat with the score tensor so that scores are not "collapsed" due to the per-tensor quantization.

The same workaround is done in YOLOv5:
https://github.com/ultralytics/yolov5/blob/b96f35ce75effc96f1a20efddd836fa17501b4f5/models/tf.py#L307-L310

スクリーンショット 2023-03-25 1 14 48

@mikel-brostrom
Copy link
Contributor Author

This was super helpful @motokimura! Will try this out

@motokimura
Copy link
Contributor

I hope this helps..
When you try this workaround, do not forget to multiply xywh tensors by 416 in the prediction phase!

@mikel-brostrom
Copy link
Contributor Author

Get it!

@mikel-brostrom
Copy link
Contributor Author

mikel-brostrom commented Mar 24, 2023

No change on the INT8 models @motokimura after implementing what you suggested... Still the same results for all the TFLite models, so the problem may primarily be in an operation or set of operations

@motokimura
Copy link
Contributor

motokimura commented Mar 24, 2023

hmm..
As PINTO pointed out, it may be better to compare int8 and float model activations before the decoder part.

#269 (comment)

It may be helpful to export onnx without '--export-det' option and compare the int8 and float outputs.

@PINTO0309
Copy link
Owner

PINTO0309 commented Mar 25, 2023

Anyways, I am using the official TFLite benchmark tool for the exported models and on the specific android device i I am running this on I get that the Float32 models is much faster that the dynamically quantized one.

First, let me tell you that your results will vary greatly depending on the architecture of the CPU you are using for your verification. If you are using an Intel x64(x86) or AMD x64(x86) architecture CPU, the Float32 model should be able to reason about 10 times faster than the INT8 model. INT8 models are very slow on the x64 architecture. Perhaps the RaspberryPi's ARM64 CPU 4 threads would be 10 times faster. The keyword XNNPACK is a good way to search for information. In the case of Intel's x64 architecture, CPUs of the 10th generation or later differ from CPUs of the 9th generation or earlier in the presence or absence of an optimization mechanism for processing Integer. If you are using a 10th generation or later CPU, it should run about 20% faster.

Therefore, when benchmarking using benchmarking tools, it is recommended to try to do so on ARM64 devices.

The benchmarking in the discussion on the ultralytics thread is not appropriate.

Next, let's look at dynamic range quantization.
My tool does per-channel quantization by default. This is due to the TFLiteConverter specification. per-channel quantization calculates the quantization range for each element of the tensor, which reduces the accuracy degradation and, at the same time, increases the cost of calculating the quantization range, which slows down the inference a little. Also, most of the current edge devices in the world are not optimized for per-channel quantization. For example, EdgeTPU only supports per-tensor quantization. Therefore, if quantization is to be performed with the assumption that the model will be put to practical use in the future, it is recommended that per-tensor quantization be performed during the transformation as follows.

onnx2tf -i xxxx.onnx -oiqt -qt per-tensor
  • per-channel quant
    image
  • per-tensor quant
    image

Next, we discuss post-quantization accuracy degradation. I think motoki's point is mostly correct. I think you should first try to split the model at the red line and see how the accuracy changes.

image

If the Sigmoid in this position does not affect the accuracy, it should work. It is better to think about complex problems by breaking them down into smaller problems without being too hasty.

image

@PINTO0309 PINTO0309 added discussion Specification Discussion Quantization Quantization OP:Sigmoid OP:Sigmoid labels Mar 25, 2023
@mikel-brostrom
Copy link
Contributor Author

mikel-brostrom commented Mar 25, 2023

I just cut the model at the point you suggested by:

onnx2tf -i /datadrive/mikel/yolox_tflite_export/yolox_nano.onnx -b 1 -cotof -cotoa 1e-1 -onimc /head/Concat_6_output_0

But I get the following error:

File "/datadrive/mikel/yolox_tflite_export/env/lib/python3.8/site-packages/onnx2tf/utils/common_functions.py", line 3071, in onnx_tf_tensor_validation
    onnx_tensor_shape = onnx_tensor.shape
AttributeError: 'NoneType' object has no attribute 'shape'

I couldn't find a similar issue and I had the same problem when I tried to cut YOLOX in our previous discussion. I probably misinterpreted how the tool is supposed to be used...

@mikel-brostrom
Copy link
Contributor Author

mikel-brostrom commented Mar 25, 2023

First, let me tell you that your results will vary greatly depending on the architecture of the CPU you are using for your verification. If you are using an Intel x64(x86) or AMD x64(x86) architecture CPU, the Float32 model should be able to reason about 10 times faster than the INT8 model. INT8 models are very slow on the x64 architecture. Perhaps the RaspberryPi's ARM64 CPU 4 threads would be 10 times faster. The keyword XNNPACK is a good way to search for information. In the case of Intel's x64 architecture, CPUs of the 10th generation or later differ from CPUs of the 9th generation or earlier in the presence or absence of an optimization mechanism for processing Integer. If you are using a 10th generation or later CPU, it should run about 20% faster.

Therefore, when benchmarking using benchmarking tools, it is recommended to try to do so on ARM64 devices.

I compiled the benchmark binary for android_arm64. The device has a Exynos9810 which is arm 64-bit. It contains a Mali-G72MP18 GPU. However, I am running the model without GPU accelerators, so the INT8 model must be running on CPU. The CPU got released 2018 so that may explain why the quantized model is that slow...

@PINTO0309
Copy link
Owner

PINTO0309 commented Mar 26, 2023

But I get the following error:

I came home and tried the same conversion as you.

The following command did not generate an error. It is a little strange that the situation is different in your environment and mine. Since scatternd requires a very complex modification at the moment, would the same error occur in ONNX with scatternd replaced with slice?

onnx2tf -i yolox_nano_no_scatternd.onnx -cotof -cotoa 1e-4 -onimc /head/Concat_6_output_0
INFO: onnx_output_name: /head/stems.2/act/Relu_output_0 tf_output_name: tf.nn.relu_70/Relu:0 shape: (1, 64, 13, 13) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/cls_convs.2/cls_convs.2.0/conv/Conv_output_0 tf_output_name: tf.math.add_84/Add:0 shape: (1, 64, 13, 13) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/reg_convs.2/reg_convs.2.0/conv/Conv_output_0 tf_output_name: tf.math.add_85/Add:0 shape: (1, 64, 13, 13) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/cls_convs.2/cls_convs.2.0/act/Relu_output_0 tf_output_name: tf.nn.relu_71/Relu:0 shape: (1, 64, 13, 13) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/reg_convs.2/reg_convs.2.0/act/Relu_output_0 tf_output_name: tf.nn.relu_72/Relu:0 shape: (1, 64, 13, 13) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/cls_convs.2/cls_convs.2.1/conv/Conv_output_0 tf_output_name: tf.math.add_86/Add:0 shape: (1, 64, 13, 13) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/reg_convs.2/reg_convs.2.1/conv/Conv_output_0 tf_output_name: tf.math.add_87/Add:0 shape: (1, 64, 13, 13) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/cls_convs.2/cls_convs.2.1/act/Relu_output_0 tf_output_name: tf.nn.relu_73/Relu:0 shape: (1, 64, 13, 13) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/reg_convs.2/reg_convs.2.1/act/Relu_output_0 tf_output_name: tf.nn.relu_74/Relu:0 shape: (1, 64, 13, 13) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/cls_preds.2/Conv_output_0 tf_output_name: tf.math.add_88/Add:0 shape: (1, 80, 13, 13) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/reg_preds.2/Conv_output_0 tf_output_name: tf.math.add_89/Add:0 shape: (1, 4, 13, 13) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/obj_preds.2/Conv_output_0 tf_output_name: tf.math.add_90/Add:0 shape: (1, 1, 13, 13) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/Sigmoid_4_output_0 tf_output_name: tf.math.sigmoid_4/Sigmoid:0 shape: (1, 1, 13, 13) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/Sigmoid_5_output_0 tf_output_name: tf.math.sigmoid_5/Sigmoid:0 shape: (1, 80, 13, 13) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/Concat_2_output_0 tf_output_name: tf.concat_15/concat:0 shape: (1, 85, 13, 13) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/Reshape_2_output_0 tf_output_name: tf.reshape_2/Reshape:0 shape: (1, 85, 169) dtype: float32 validate_result:  Matches 
INFO: onnx_output_name: /head/Concat_6_output_0 tf_output_name: tf.concat_16/concat:0 shape: (1, 85, 3549) dtype: float32 validate_result:  Matches 

image

onnx2tf -i yolox_nano_no_scatternd.onnx -oiqt -cotof -cotoa 1e-4 -onimc /head/Concat_6_output_0

@mikel-brostrom
Copy link
Contributor Author

Output looks like this now;

Screenshot from 2023-03-29 13-42-03

@PINTO0309
Copy link
Owner

PINTO0309 commented Mar 29, 2023

The position of Dequantize has obviously changed.

I am also interested in the quantization range for this area.
image

@mikel-brostrom
Copy link
Contributor Author

In/out quantization from top-left to bottom-right of the operations you pointed at:

quantization: -3.1056954860687256 ≤ 0.00014265520439948887 * q ≤ 4.674383163452148
quantization: -3.1056954860687256 ≤ 0.00014265520439948887 * q ≤ 4.674383163452148

quantization: -2.3114538192749023 ≤ 0.00010453650611452758 * q ≤ 3.4253478050231934
quantization: 0.00014265520439948887 * q

quantization: -2.2470905780792236 ≤ 0.00011867172725033015 * q ≤ 3.888516426086426
quantization: 0.00014265520439948887 * q

quantization: 0.00014265520439948887 * q
quantization: -3.1056954860687256 ≤ 0.00014265520439948887 * q ≤ 4.674383163452148

@PINTO0309
Copy link
Owner

It looks fine to me.

@mikel-brostrom
Copy link
Contributor Author

Going for a full COCO eval now 🚀

@motokimura
Copy link
Contributor

Great! 🚀🚀

@mikel-brostrom
Copy link
Contributor Author

mikel-brostrom commented Mar 29, 2023

Great that we get this into YOLOv8 as well @motokimura! Thank you both for this joint effort ❤️

Model size mAPval
0.5:0.95
mAPval
0.5
size calibration images
YOLOX-TI-nano TFLite FP32 416 0.261 0.418 8.7M N/A
YOLOX-TI-nano TFLite INT8 416 0.242 0.408 2.4M 200
YOLOX-TI-nano TFLite INT8 416 0.243 0.408 2.4M 800

@PINTO0309
Copy link
Owner

PINTO0309 commented Mar 29, 2023

congratulations! 👍

@PINTO0309
Copy link
Owner

I will close this issue once the original problem has been solved and the INT8 quantization problem seems to have been resolved.

@mikel-brostrom
Copy link
Contributor Author

mikel-brostrom commented Mar 31, 2023

Sorry for bothering you again but one thing is still unclear to me. Even when bringing the xy, wh, probs values to [0, 1] and then quantizing the model with a single output:

Screenshot from 2023-03-31 10-42-39

results are much worse than using separate xy, wh, probs outputs like this:

Screenshot from 2023-03-31 10-45-42

From our lengthy discussion I recall this:

Therefore, if we merge a flow that wants to express values in the range 0 to 1 with a flow that wants to express values in the range 0 to 416, I feel that almost all elements in the one that wants to express the range 0 to 1 will diverge to approximate 0.

and this:

In TFLite quantization, activation is quantized in per-tensor manner. That is, the OR distribution of xywh and scores, (min, max) = (0.0, 416.0), is mapped to integer values of (min, max) = (0, 255) after the Concat. As a result, even if the score is 1.0, after quantization it is mapped to: int(1.0 / 416 * 255) = int(0.61) = 0, resulting in all scores being zero!

Which makes total sense to me. Specially given the disparity in the different ranges within the same output. But why are the quantization results much worse for the model with a single output given that the values have the same range for all values? Does this make sense to you?

Model size mAPval
0.5:0.95
mAPval
0.5
size calibration images
YOLOX-TI-nano SINGLE OUTPUT 416 0.064 0.240 2.4M 8
YOLOX-TI-nano TFLite XY, WH, PROBS OUTPUT 416 0.242 0.408 2.4M 8

@PINTO0309
Copy link
Owner

PINTO0309 commented Mar 31, 2023

There is no part of the model left to explain in more detail than Motoki's explanation, but again, take a good look at the quantization parameters around the final output of the model. I think you can see why Concat is a bad idea.

All 1.7974882125854492 * (q + 128)

The values diverge when inverse quantization (Dequantize) is performed.

onnx2tf -i yolox_nano_no_scatternd.onnx -oiqt -qt per-tensor

image
image

Perhaps that is why TI used ScatterND.

@motokimura
Copy link
Contributor

In your inference code posted in this comment,

x[0:4] = x[0:4] * 416 # notice xywh in the model is divided by 416

The first dim of x should be batch dim, I think.

However, this should decrease the accuracy of float models as well..

@mikel-brostrom
Copy link
Contributor Author

mikel-brostrom commented Mar 31, 2023

Yup, sorry @motokimura, that's a typo. It is

outputs[:, :, 0:4] = outputs[:, :, 0:4] * 416

@motokimura
Copy link
Contributor

I have no idea what is happening in Concat..

As I posted, you may find something if you compare the distribution of outputs from float/int8 models.

@motokimura
Copy link
Contributor

motokimura commented Mar 31, 2023

@mikel-brostrom
Can you check what happens if you apply clipping to xy and wh before Concat?

if self.int8:
    xy = torch.div(xy, 416)
    wh = torch.div(wh, 416)
    # clipping
    xy = torch.clamp(xy, min=0, max=1)
    wh = torch.clamp(wh, min=0, max=1)
        
outputs = torch.cat([xy, wh, outputs[..., 4:]], dim=-1)

Assumption: xy and/or wh may have a few outliers which make quantization range much wider than we expected. Especially wh can have such outliers because Exp is used as activation function.

@mikel-brostrom
Copy link
Contributor Author

Good point @motokimura. Reporting back on Monday 😊

Repository owner locked as resolved and limited conversation to collaborators Apr 3, 2023
Repository owner unlocked this conversation Apr 3, 2023
@mikel-brostrom
Copy link
Contributor Author

Interesting. It actually made it worse...

Model size mAPval
0.5:0.95
mAPval
0.5
size calibration images
YOLOX-TI-nano TFLite XY, WH, PROBS OUTPUT 416 0.242 0.408 2.4M 8
YOLOX-TI-nano SINGLE OUTPUT 416 0.062 0.229 2.4M 8
YOLOX-TI-nano SINGLE OUTPUT (Clamped xywh) 416 0.028 0.103 2.4M 8

@motokimura
Copy link
Contributor

At this point I have no idea more than this comment about the quantization of Concat and what kind of quantization errors are happening inside actually.. This Concat is not necessary by nature and has no benefit for the model quantization, so I think we don't need go any deeper with this.

All I can say at this point is that tensors with very different value ranges should not be concatenated, especially in post-processing of the model.

Thank you for doing the experiment and sharing your results!

@mikel-brostrom
Copy link
Contributor Author

This Concat is not necessary by nature and has no benefit for the model quantization, so I think we don't need go any deeper with this.

Agree, let's close this. Enough experimentation on this topic 😄 . Again, thank you both @motokimura, @PINTO0309 for time and guidance during this quantization journey. I learnt a lot, hopefully you got something out of the experiment results posted here as well 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Specification Discussion OP:ScatterND OP:ScatterND OP:Sigmoid OP:Sigmoid Parameter replacement Use Parameter replacement Quantization Quantization
Projects
None yet
Development

No branches or pull requests

3 participants