-
Notifications
You must be signed in to change notification settings - Fork 8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regarding mAP and latency of Yolov4 #5354
Comments
Dear @WongKinYiu, Thank you for your answer.
Therefore again, I don't see a big improvement in Yolov4. |
And another thing, sorry I forgot, you said you use both training and validation for training, but you meant for CSPResNeXt50-PANet-SPP or for Yolov4? Thanks again. |
|
Dear @AlexeyAB, Yes, I am sure the size were correct. These are the commands I run to get the AVG FPS
here the commands to obtain the detections for codalab
In this folder you can find all the screenshots: https://cloud.hipert.unimore.it/s/g7KZNnytki5gExE I summarize here the results from codalab on val2017
|
|
Oh, you are about old model Yes, it seems I get this results on
By the way, I can't submit your json-files, so I just tested these models by myself again. |
I will test and update FPS on Turing architecture GPU in few days. If use old CSPResNeXt50-PANet-SPP, you will get higher AP on 416x416 due to the anchor setting. |
Dear @AlexeyAB and @WongKinYiu. However, if you say that you trained that network using also those data, it makes a lot of sense that the mAP is higher, even though it's not fair. So yeah, I assume Yolov4 accuracy is better then :) Thank you for your quick answers, and thank you for clarifying my doubts. I will wait for the FPS results then. |
Dear @AlexeyAB, yesterday I have ported your Yolov4 on tensorRT using tkDNN, a framework developed by @ceccocats, @sapienzadavide and I (you can find it here). Some performance results on 2 boards, a discrete and an embedded one. The outputs match with yours, so the mAP is the same.
|
@mive93 Hi,
Does it match even for FP16? Can you test YOLOv4 on RTX2080Ti (or preferably on Tesla V100) for 4 network resolutions with
|
Dear @AlexeyAB , Sorry for the delay.
Actually there is a small loss, I guess due to different implementation of the operations. I have tried to investigate more, but I couldn't find another source of mismatch. However, FP16 has the same mAP as FP32.
|
@mive93 Thanks!
There are 3 different |
Hi @AlexeyAB,
|
@mive93 Hi,
Will you publish paper on arxiv.org with AP / FPS or only FPS comparison of different models?
float softplus(float x, float threshold = 20) {
if (x > threshold) return x; // too large
else if (x < -threshold) return expf(x); // too small
return logf(expf(x) + 1);
}
float mish_activation(float input) {
const float MISH_THRESHOLD = 20;
output = input * tanh( softplus(input, MISH_THRESHOLD) );
return output;
} |
Can you also test AVG_FPS for YOLOv4 on the Darknet (OpenCV + CUDA + cuDNN) on the same GPU 2080 Ti, for these network resolutions 320, 416, 512, 608? By using such command: |
Hi @AlexeyAB, I am sorry for the delay but I have work-related deadline for this week. |
Hi @AlexeyAB, sorry for the long delay.
We submitted to a conference and we run experiments in terms of mAP, latency and power consumption. As soon as it's accepted we also plan to share the raw data.
We use FP32 (without Tensor Cores) and FP32/16 (Mixed-precision with Tensor Cores) [as well as FP32/INT8], because plugins are always at FP32.
In the master there is already a demo that computes the mAP for each method supported by tkDNN. However, it is a bit different from your, because bounding boxes are accounted only one time, with the highest probability. In the README is explained how to compute the mAP, each precision is supported.
I will work on the demo with batch > 1 this week, will keep you updated when I'll have something working.
Never heard of that, will take a look, thanks.
Yes
Here the results:
|
@mive93 Hi, Thanks! So, tkDNN accelerates yolov4 ~2x for batch=1 and 3x-4x for batch=4.
When will the conference be?
It would be great if you could get identical accuracy in the future like in Darknet. |
@mive93 Hi, We use new mish-implementation, and get +3% FPS with the same AP-detection accuracy on MSCOCO testdev: darknet/src/activation_kernels.cu Lines 235 to 246 in bef2844
More: #5452 (comment) So you can try to use this implementation in tkDNN. |
excuse me ,@mive93
|
Hi For the new mish function, I can include that. Thank you :) @lazerliu I obtained those results using codalab. You have first to generate a json (in COCO format) of the detection, then submit it in the site. More info are given also in this repo wiki. |
@lazerliu Create new topic. |
Results for OpenCV DNN @ master (https://github.com/opencv/opencv/tree/6b0fff72d9748345c6a079e4fce49af4130d8e12): Device: RTX 2080 Ti
Code: https://gist.github.com/YashasSamaga/48bdb167303e10f4d07b754888ddbdcf
There are currently two open PRs which affect YOLOv4 performance. The performance will mostly improve by around 5-10%. The timings often change slightly every time the benchmark program is run. Here is the raw output from the benchmark code: CLICK ME1 x 3 x 608 x 608:
4 x 3 x 608 x 608:
1 x 3 x 512 x 512:
4 x 3 x 512 x 512
1 x 3 x 416 x 416:
4 x 3 x 416 x 416
1 x 3 x 320 x 320:
4 x 3 x 320 x 320:
|
Can you add column, what FPS can you get by using Darknet GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 by using command
Why did you try 680x680? It should be multiple of 32. |
So, tkDNN accelerates yolov4 ~2x for batch=1 and 3x-4x for batch=4.
|
I forgot to mention that I had set RTX 2070S without setting
with
I have written an example which performs full NMS (not classwise) at the end instead of performing it three times during inference (which causes unnecessary context switches as NMS is performed on CPU). This barely changes the FPS. |
@YashasSamaga Do you think we should request such improvement and switchable ability in OpenCV? to use
|
I have always wondered about the benefits of performing NMS in each yolo detection layer. Is there any advantage of doing so compared with doing one combined NMS at the end? Doing the NMS at the end will definitely help improve performance of the OpenCV CUDA backend currently but I don't know how things will change once GPU NMS kernels are added (some work is in progress for DetectionOutput layer at opencv/opencv#17301). I think the best place such a thing could be introduced is in DetectionModel which is a part of the high-level model API that was recently introduced in OpenCV DNN. |
I think no. |
I did a bit of investigation. YOLOv2 PR added NMS in region layer because there was only one region layer back then. YOLOv3 PR reused the region layer but this led to NMS being performed in each region layer. I think it's a bug which I thought was a feature all this time. I have opened an issue opencv/opencv#17415 |
Hi @YashasSamaga, thank you for profiling OpenCV-dnn and comparing it with tkDNN also :) In the last days we have released a new version of tkDNN, with also a darknet parser, the new mish, and the handling of the batches also for pre-post processing. But I haven't profiled it seriously yet. If interested, I can do it soonish. |
Dataset and the list of images taken from How to evaluate accuracy and speed of YOLOv4. Darknet as of e08a818 and original
The number of detections were considerably smaller in OpenCV. I eventually figured that OpenCV was ditching detections with low confidence scores. So I added Code: https://gist.github.com/YashasSamaga/077a1d69c48e4cdb9957d167b7000b98
The numbers for OpenCV are better than Darknet. I think it's because of the NMS but I wanted to rule out the possibility of variations arising due to different choices made while selecting convolution kernels on different devices (Darknet stats were generated on GTX 1050 while OpenCV stats were generated on RTX 2080 Ti). OpenCV FP32 on GTX 1050
|
If I do not set
I wonder if this default behaviour in OpenCV is correct. |
@YashasSamaga This is normal, since for Detection you should use optimal conf-thresh 0.2 - 0.25, while AP calculation should be done for each possible conf-thresh starting from 0.001. |
Hello @AlexeyAB. batch_size x 17328 x 85 I understand that 85 is equal to [center_x, center_y, width, height, box_confidence, class_1_score, .... ]. |
they are |
@WongKinYiu But there're 3 of them : 17328, 4332, 1083 . Do you know the meaning of other 2? |
there are three yolo layers (feature pyramid). |
@WongKinYiu Thank you for your kindness. It help me a lot. I checked the yolov3 and FPN paper and found the explanation about the feature pyramid. |
@mive93 Do tkDNN benchmarks include the host/device memory transfer time? I was looking at tkDNN source and if I have understood correctly, the input is copied from the host to device. The input on the device is then copied to TRT's device buffer, inference is done, the outputs in TRT's buffer are copied to non-TRT output buffers. The outputs are then copied to the host. The time reported by tkDNN is the time it took for copying from a device buffer to TRT's buffers and vice-versa and the inference time. Is this correct? |
Hi @YashasSamaga, |
@mive93 Do you use overlapping in 3 thread/steams?
Yes, it reduces latency. |
Btw OpenCV benchmarks that I reported had included the GPU-CPU transfer times. They total up to 1.1ms on RTX 2080 Ti (0.3ms for input, 0.53ms for output1, 0.25ms for output2 and 0.03ms for output3) for single image inference with pinned host memory. If this extra time is deducted from the OpenCV timings I reported, I think OpenCV is faster than tkDNN on RTX 2080 Ti for single image inference. OpenCV master (as of today) takes 9.5ms for single image inference (inclusive of the 1.1ms) and tkDNN takes 9.0ms. Subtracting 1.1ms gives ~8.4ms for OpenCV but tkDNN is also making a device to device copy during inference which OpenCV doesn't but D2D copies are much faster (probably very very small compared to 1.1ms) than H2D or D2H copies. Anyway, OpenCV and tkDNN are close enough that any benchmark will depend on these minute details. So it's not meaningful to compare with numbers very close to each other. |
If 3 operations are overlaped, then they increase the latency, but do not affect the FPS. |
@mive93 Hi, Have you increased FPS more than in your table a month ago? #5354 (comment) Can you show actual FPS for RTX 2080Ti and yolov4.cfg 320-608 if FPS was increased? |
Hi @AlexeyAB, I'm sorry, I didn't notice the notification with your question. However, I had time to check the problem with the mAP and I finally understood why we had that accuracy drop.
Hereafter the new results on codalab for COCO val 2017, for threshold 0.001 and 0.3.
Here the results for other networks: https://github.com/ceccocats/tkDNN#map-results |
@mive93 @YashasSamaga |
interested |
Dear Alexey,
first of all, thank you for your work.
I have been doing some tests with your new yolov4, and I have some questions.
I compared the performance of Yolov4, Yolov3 and CSPResNext50-Panet-SPP (the one I found also in your repo) on two different GPUs, using input size 416x416, and I have checked the mAP for the COCO2017 validation set.
Here are the results (both FPS and mAP have been computed using your code):
However, I have noticed that you do not compare with the third network in your paper. I was wondering which was the reason, and if I am doing, maybe, something wrong.
Thank you in advance.
The text was updated successfully, but these errors were encountered: