Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best way to use YOLO on Jetson Xavier with max FPS #5386

Open
Kmarconi opened this issue Apr 28, 2020 · 28 comments
Open

Best way to use YOLO on Jetson Xavier with max FPS #5386

Kmarconi opened this issue Apr 28, 2020 · 28 comments
Labels

Comments

@Kmarconi
Copy link

Hi ! First thanks for the continuous update you are making on your repo, it's amazing. I'm working on a project on which i would like to be able to detect only one class of object but at a high speed ( at least 60 FPS). I just tested your yolov4 files and the yolov3 pruned weight and I'm blocked at 5 FPS on my Xavier whereas if I remember well, i was around 20 FPS with yolov3. I know that yolov4 is heavier than yolov3 but I was hopping that the pruned version of yolov3 would rise in terms of FPS but it did not and I think I did something wrong.

To compile darknet, I've put the flags GPU,CUDNN,CUDNN_HALF and OPENCV to 1. I also uncommented the ARCH version for the Xavier. Do I need to do something else ?

For now, i'm able to run some object detection algorithm at a speed of 150 FPS (ssd-inception) on the xavier but I really would like to use yolo because of it's accuracy. I know that I need to use TensorRT, the quantization so that the weight use FP16 or INT8 and not FP32 and I know how to do it with Tensorflow, but with darknet i'm kinda lost. Can you give me some help ?

Ps : I know that deepstream supports YOLO natively but I would like to do a python or C++ object-detection app and I'm not sure that It is possible to "import" the deepstream pipeline in a Python app and get the detected object from it.

Best Regards, SOrry for this long message but I'm passionate about YOLO ^^

@DocF
Copy link

DocF commented Apr 28, 2020

In my view, for detection in one class, as long as it is not a dense small object, yolov3-tiny is enough.

@AlexeyAB
Copy link
Owner

know that deepstream supports YOLO natively but I would like to do a python or C++ object-detection app

Deepstream isn't C++ app? https://github.com/NVIDIA-AI-IOT/deepstream_reference_apps and https://github.com/NVIDIA-AI-IOT/deepstream_4.x_apps

I know that I need to use TensorRT, the quantization so that the weight use FP16 or INT8 and not FP32

Yes, you can try to do INT8 quantization with TensorRT + Deepstream.

i would like to be able to detect only one class of object but at a high speed ( at least 60 FPS).

Very approximately for yolov4.cfg

  • width=416 height=416 in cfg - 9 FPS
  • width=320 height=320 in cfg - 13 FPS
  • width=320 height=320 in cfg INT8-TensorRT - 25 FPS
  • width=320 height=320 in cfg INT8-TensorRT batch=32 - 50 FPS

I made repo with INT8 quantization for Yolov2/v3 but it doesn't support Yolov4 https://github.com/AlexeyAB/yolo2_light

So may be better for you to use yolov3-tiny-prn.cfg

modern_gpus

@marcoslucianops
Copy link

marcoslucianops commented Apr 28, 2020

Ps : I know that deepstream supports YOLO natively but I would like to do a python or C++ object-detection app and I'm not sure that It is possible to "import" the deepstream pipeline in a Python app and get the detected object from it.

You can get metadata from deepstream in Python and C. For C, you need edit deepstream-app or deepstream-test code. For Python your need install and edit this.

You need manipulate NvDsObjectMeta, NvDsFrameMeta and NvOSD_RectParams to get label, position, etc. of bboxs.

In C deepstream-app aplication, your code need be in analytics_done_buf_prob function. In C/Python deepstream-test application, your code need be in tiler_src_pad_buffer_probe function.

Example using C: https://www.youtube.com/watch?v=eFv4P1oj9pA
Example using Python: https://www.youtube.com/watch?v=n3uYS550PDo

Python is slightly slower than C (on Jetson Nano, ~2FPS).

@Kmarconi
Copy link
Author

Hi, first thanks for your 3 quick replies ! Since I would like my model to detect objects which could be big at foreground but also could be small at the very background of the image, I'm not sure Yolov3-tiny is valuable option for me ? Correct me if i'm wrong but I know that YOlov3 is analyzing the image at three different scale which is a good feature for my purpose. But it is done with 106 convolutions layers and I don't know if the few layers from yolov3-tiny could be enough to detect one object at a large and a small scale. Will take a look to your links @marcoslucianops thanks ! :) And thanks for your answer too @AlexeyAB :)

@AlexeyAB
Copy link
Owner

@Kmarconi @marcoslucianops You can use Yolov4 on tensorRT using tkDNN with 32 FPS(FP16) / 17 FPS(FP32) with batch=1 on AGX Xavier: #5354 (comment)

With batch=4 FPS will be higher.

@Kmarconi
Copy link
Author

Thanks ! WIll give it a try !

@marcoslucianops
Copy link

@AlexeyAB, I will compare tkDNN and DeepStream. Thanks!

@Kmarconi
Copy link
Author

To keep you updated, I'm actually around 34 FPS with yolov4 on the Xavier with tkDNN.

@AlexeyAB
Copy link
Owner

@Kmarconi What batch-size, network resolution, and Float-point-precision (32/16) do you use?

@Kmarconi
Copy link
Author

I'm using batch_size of 4, fp16 mode and didn't touch to the network resolution for the moment so the default yolov4 one

@AlexeyAB
Copy link
Owner

So do you get 34 FPS on Jetson Xavier by using yolov4.cfg width=608 height=608 batch_size=4 fp16 by using tkDNN+TensorRT?

@Kmarconi
Copy link
Author

Sorry for the late response, I'm working in France so I'm not awaken in the same hours as you are ^^ I'm using width=416 height=416 batch=4 and fp16 with tkDNN+TRT to get 34 FPS on the Xavier yes ! :) I know that it is probably something too hard or too time consuming to do but it would be amazing to see one day an easy integration of TensorRT in the darknet project for every gpu architecture which support it. Will continue to test tkDNN today, will keep posting results

@marcoslucianops
Copy link

@AlexeyAB, DeepStream is faster than tkDNN. tkDNN shows 45.381ms inference time buts diplay video seems like 10-15 fps on Jeston Nano. I think it's due to OpenCV.

@AlexeyAB
Copy link
Owner

AlexeyAB commented May 1, 2020

@mive93 Hi, Can you comment on this?

@mive93
Copy link

mive93 commented May 1, 2020

Hi @marcoslucianops,
how are you using tkDNN? Have you enabled FP16 inference? Have you enabled preprocessing on GPU? We have never tested tkDNN on a Jetson Nano, so I do not have data on that. However, yes, you are right, OpenCV could be a problem for performances.

Hi @Kmarconi

To keep you updated, I'm actually around 34 FPS with yolov4 on the Xavier with tkDNN.

How did you obtain this number? I think you are doing something wrong, those are the FPS with batch = 1.

@marcoslucianops
Copy link

marcoslucianops commented May 1, 2020

Have you enabled FP16 inference?

I compared DeepStream FP32 vs tkDNN FP32

Have you enabled preprocessing on GPU?

Yes

image

I think that's problem (delay) in OpenCV when write bbox and imshow.

image

@AlexeyAB
Copy link
Owner

AlexeyAB commented May 2, 2020

@mive93

I think, for tkDNN

  • it shouldn't show all frames on the screen. So CPU-thread that shows detections on the screen should work asynchronously and shows only the last frame

  • if it is implemented, it should write all frames to the output.avi file

@Kmarconi
Copy link
Author

Kmarconi commented May 4, 2020

Hi @mive93 ,

Yeah I just saw that I was mistaken about the batch_size. Haven't seen

The test will still run with a batch of 1, but the created tensorRT can manage the desidered batch size.

So even if I export the batch_size variable to 4 for example, I will do my inference with only a batch_size of 1 ? Then how can I use the full potential of my trt engine ?

PS : 160 FPS with mobilenet on the Xavier,woaw. ^^

@mive93
Copy link

mive93 commented May 11, 2020

@AlexeyAB @marcoslucianops yeah, it's due to OpenCV. And @AlexeyAB you are right, we should insert some flag to disable the graphics. However it is thought to be a library, so the demo is just an example, it's not how you use it. Ofc when I use it in other projects, the graphic part is handled by other tasks. But I could add a demo like that maybe.

@Kmarconi thanks :)
Right now the batch can be only tested to check the FPS (using the rt_interence test). But this week I'm planning to allow using it in a demo, so that anyone can test it for real with more batches. It was a WIP.

@AlexeyAB
Copy link
Owner

@mive93 You can just add such part of code for bbox_drawing(), wait_key_cv() and show() functions, so these functions will be used no more than a 100 times per 1 second in the Demo:

darknet/src/demo.c

Lines 295 to 296 in 0c7305c

const int each_frame = max_val_cmp(1, avg_fps / 100);
if(global_frame_counter % each_frame == 0) show_image_mat(show_img, "Demo");

@harsco-jfernandez
Copy link

harsco-jfernandez commented May 30, 2020

How are all getting xavier to work at 34 FPS? I'm only able to get 24FPS!

I've set the following and my model is 320x320, not 416x416 like yours is.

TKDNN_BATCHSIZE=4
TKDNN_MODE=FP16

What else do I need?

yolo4_fp16.rt
New NetworkRT (TensorRT v6.01)
Float16 support: 1
Int8 support: 1
DLAs: 2
create execution context
Input/outputs numbers: 4
input idex = 0 -> output index = 3
Data dim: 1 3 320 320 1
Data dim: 1 33 10 10 1
RtBuffer 0 dim: Data dim: 1 3 320 320 1
RtBuffer 1 dim: Data dim: 1 33 40 40 1
RtBuffer 2 dim: Data dim: 1 33 20 20 1
RtBuffer 3 dim: Data dim: 1 33 10 10 1
===== TENSORRT detection ====
Time: 0.725123 ms
Data dim: 1 3 320 320 1
Time: 19.7376 ms
Data dim: 1 33 10 10 1
Time: 0.585052 ms

===== TENSORRT detection ====
Time: 0.71021 ms
Data dim: 1 3 320 320 1
Time: 19.7166 ms
Data dim: 1 33 10 10 1
Time: 0.396787 ms

===== TENSORRT detection ====
Time: 0.676224 ms
Data dim: 1 3 320 320 1
Time: 19.7656 ms
Data dim: 1 33 10 10 1
Time: 0.360881 ms

===== TENSORRT detection ====
Time: 0.758276 ms
Data dim: 1 3 320 320 1
Time: 19.7501 ms
Data dim: 1 33 10 10 1
Time: 0.458837 ms

@Kmarconi
Copy link
Author

Are you in MAXN mode and did you used sudo /usr/bin/jetson_clocks?

@mive93
Copy link

mive93 commented May 30, 2020

Hi @harsco-jfernandez ,

first, @Kmarconi is right.
How did you create the rt file?
Which command did you run to print those results?
How did you use the batches? Right now batches are supported in the demo only in the branch eval.

@harsco-jfernandez
Copy link

harsco-jfernandez commented May 30, 2020

Thank you, fellows!

Your questions are as good as answers. I made some assumptions. It is running at 40 FPS now.

@AlexeyAB
Copy link
Owner

@harsco-jfernandez 40 FPS is a good speed for Yolov4 on Jetson Xavier AGX.

@harsco-jfernandez
Copy link

harsco-jfernandez commented May 30, 2020

@AlexeyAB It is excellent! I love it!

I'm now trying int8 inference. My camera is capable of 100 fps.

@rafcy
Copy link

rafcy commented Jun 10, 2020

Has anyone tested the performance on jetson xavier nx instead of AGX? (it's almost half the price of AGX)

@mive93
Copy link

mive93 commented Jun 11, 2020

Hi @rafcy,
Not yet, I'm waiting for the board to be shipped.
But soonish I will do some tests on the nano.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants