Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature-request: YOLOv4-tiny (detector) #59

Closed
AlexeyAB opened this issue Jun 25, 2020 · 24 comments
Closed

Feature-request: YOLOv4-tiny (detector) #59

AlexeyAB opened this issue Jun 25, 2020 · 24 comments

Comments

@AlexeyAB
Copy link

Feature-request: YOLOv4-tiny (detector)

Many other features from Darknet were added previously.
There is required only 1 feature:

  1. Add groups= and group_id= to the [route] layer.
[route]
layers=-1
groups=2
group_id=1

So if input WxHxC, it divides input by 2 groups WxHx(C/2) (there are 2 groups: 0 and 1), and loads the 2nd group_1 WxHx(C/2).

If there are many layers specified in layers= parameter, then this will be done for each of the input layers specified in layer=, then results will be concatenated across channels.

cmp

@mive93
Copy link
Collaborator

mive93 commented Jun 29, 2020

Hi @AlexeyAB :)
We'll look into that this week.

@ceccocats ceccocats reopened this Jun 30, 2020
@mive93
Copy link
Collaborator

mive93 commented Jun 30, 2020

Hi @AlexeyAB ,
Yolov4 tiny is now supported on tkDNN.
Tomorrow I will do some performance tests, today my GPU is busy training.

@JasonDoingGreat
Copy link

@mive93 Thanks for the Yolov4-Tiny impl

I've tested on Jetson Nano with JetPack 4.4, TensorRT v7.1, 416 input size

For FP32, profile results:

Time stats:
Min: 37.3371 ms
Max: 122.952 ms
Avg: 38.0922 ms	26.2521 FPS

For FP16, profile results:

Time stats:
Min: 24.5687 ms
Max: 90.5088 ms
Avg: 25.5292 ms	39.1709 FPS

@mive93
Copy link
Collaborator

mive93 commented Jul 1, 2020

Hi @JasonDoingGreat,

thanks :)
Here the inference results on the RTX 2080Ti (CUDA 10.2, TensorRT 7.0.0, Cudnn 7.6.5); for yolo4tiny 416x416, on 1200 images of size 416x416.

model precision batch avg (ms) min (ms) max(ms) avg FPS
yolo4tiny fp32 1 1,64185 1,57668 1,71029 609,068
yolo4tiny fp32 4 1,0385 1,03024 1,08981 962,926
yolo4tiny fp16 1 1,26474 0,90607 1,4321 790,677
yolo4tiny fp16 4 0,563628 0,556467 0,620871 1774,22
yolo4tiny int8 1 1,03339 0,728739 1,16966 967,69
yolo4tiny int8 4 0,474048 0,467551 0,506916 2109,49

If needed I can test it on the Xavier or tx2

@AlexeyAB
Copy link
Author

AlexeyAB commented Jul 1, 2020

@mive93 Thanks! Yes, please test it on AGX or NX with max_N.

@mive93
Copy link
Collaborator

mive93 commented Jul 1, 2020

Here it is.
Results on Xavier AGX, Jetpack 4.3 (CUDA 10.0, CUDNN 7.6.3, tensorrt 6.0.1 ); for yolo4tiny 416x416, on 1200 images of size 416x416.

model precision batch avg (ms) min (ms) max(ms) avg FPS
yolo4tiny fp32 1 6,36684 6,31811 6,48507 157,064
yolo4tiny fp32 4 5,61027 5,58927 5,63641 178,244
yolo4tiny fp16 1 3,48334 3,44269 3,56074 287,081
yolo4tiny fp16 4 2,63374 2,61526 2,65826 379,688
yolo4tiny int8 1 3,13312 3,08334 3,24114 319,17
yolo4tiny int8 4 2,33578 2,32111 2,359 428,122

@CSTEZCAN
Copy link

CSTEZCAN commented Jul 1, 2020

@AlexeyAB @ceccocats @mive93 , single-handedly destroyed the reputation of google, facebook and nvidia. this is extraordinary.

@AlexeyAB
Copy link
Author

AlexeyAB commented Jul 2, 2020

@mive93 Hi,
Does tkDNN work only with converted weights Darknet->tkDNN?
Or can tkDNN work with yolov4.weights file directly without conversion.

@mive93
Copy link
Collaborator

mive93 commented Jul 6, 2020

Hi @AlexeyAB,
it is necessary to export to our format.

@mmaaz60
Copy link

mmaaz60 commented Jul 7, 2020

Hi @AlexeyAB,
it is necessary to export to our format.

Hi @mive93 @AlexeyAB,

Is there any accuracy degradation when you convert darknet weights to tkDNN format? What about accuracy loss when inferring in FP16 or INT8 mode? Is there any way to fine-tune the models in FP16 or INT8 mode or perform quantization aware training beforehand? Thanks

@mive93
Copy link
Collaborator

mive93 commented Jul 15, 2020

Hi @mmaaz60
The conversion of the weights does not lead to accuracy degradation. Actually the only thing we do is splitting weights layer by layer. However yes, I noticed that there is a very tiny accuracy drop. I checked (almost layer by layer) the weights of darknet and tkDNN and the problem is not there, rather on the output of the network. This is due to the different implementation of the operation (IMHO).

For the FP16 mode, the drop from full precision is negligible, while the drop from full precision to INT8 is heavy.
The problem with INT8 is also the calibration step. We have tried with 100/1000 images. Maybe using more, or using more variance would lead to better results.

I hope I covered all your doubts.

@mive93
Copy link
Collaborator

mive93 commented Sep 11, 2020

Closing for now,
feel free to reopen.

@AlexeyAB
Copy link
Author

@mive93 Hi,
Have you published a paper with YOLOv4/tiny results on tkDNN or are you planning to do so?

@mive93
Copy link
Collaborator

mive93 commented Oct 9, 2020

Hi @AlexeyAB

Sorry for the late reply.
Actually yes, we submitted the results to a journal, and we are now under review.
However, you can find here some results: https://git.hipert.unimore.it/edbench/edbench

Anyway, if you need some test I am available to do some, I also have a Xavier NX now ;)

@m-kzein
Copy link

m-kzein commented Oct 13, 2020

Hi @mive93
You mention in this thread that you reach 319,17 fps for yolov4tiny (int8) on the xavier; However, on the main readme, you mention that you reach 60.61. What is the difference?

@mive93
Copy link
Collaborator

mive93 commented Oct 13, 2020

Hi @MohammadKassemZein
The difference is that here I'm talking about yolov4Tiny, in the readme of Yolov4 (not tiny).

@m-kzein
Copy link

m-kzein commented Oct 13, 2020

@mive93 Nice !
I am going to test it now on the Xavier NX.

Thank you.

@m-kzein
Copy link

m-kzein commented Oct 13, 2020

@mive93 Below are the results on Xavier NX.

model precision batch avg (ms) min (ms) max(ms) avg FPS
yolo4tiny fp32 1 12.3885 10.1343 48.9955 80.7199
yolo4tiny fp32 4 10.2807 9.80685 32.563 97.27
yolo4tiny fp16 1 8.47916 6.8818 38.0609 117.936
yolo4tiny fp16 4 5.01874 4.54197 24.2833 199.253

@mive93
Copy link
Collaborator

mive93 commented Oct 13, 2020

Nice @MohammadKassemZein :)

How did you collect those results?

@m-kzein
Copy link

m-kzein commented Oct 13, 2020

@mive93 I used your framework (tkDNN) on Jetson NX .

@mive93
Copy link
Collaborator

mive93 commented Oct 13, 2020

yeah I guessed :) Sorry, I was vague.

Have you activated jetson_clocks?
For the measurements, did you use the demo itself, did you wrote your code or did you use this test?

@m-kzein
Copy link

m-kzein commented Oct 13, 2020

I was using MODE 15W 2CORE (which I guess gives the highest clocking for GPU and CPU).
For the measurements, I used demo itself.

@m-kzein
Copy link

m-kzein commented Oct 15, 2020

Hi again @mive93
Have you considered adding Mask-RCNN implementation to the framework?
I believe it is a very useful network and its benchmark in terms of FPS is not up to the expectations.

@mive93
Copy link
Collaborator

mive93 commented Dec 4, 2020

Hi @MohammadKassemZein
Yes we have considered that, but we haven't had the need in any of your project yet.
Therefore for now, we don't plan to port it, but maybe in the future.
Lately, we have ported a semantic segmentation net.
Probably in the future we'll port also something related to instance/panoptic segmentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants