Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarking GPU vs. CPU: Unexpected Results #11

Open
NickSwardh opened this issue Apr 2, 2024 · 2 comments
Open

Benchmarking GPU vs. CPU: Unexpected Results #11

NickSwardh opened this issue Apr 2, 2024 · 2 comments

Comments

@NickSwardh
Copy link
Owner

NickSwardh commented Apr 2, 2024

Originally posted by @niclasastrom in #9 (comment)

Performance using the GPU is worse than using the CPU. I have an RTX 4070, running Windows 11 Pro. The latest OS- and NVidia driver updates are installed.

I expected higher throughput when using the GPU, but I could be wrong. What performance can I expect, CPU vs GPU?

For example, the classification test took 130ms on the CPU and 572ms on the GPU. Do you know if this is expected?

I added a couple of lines to measure compute time:

var stopWatch=new Stopwatch();
stopWatch.Start();
List<Classification> results = yolo.RunClassification(image, 3); // Get top 5 classifications. Default = 1
stopWatch.Stop();
Console.WriteLine("Elapsed time: "+stopWatch.ElapsedMilliseconds);

Thanks for your input. If this follow-up question doesn't fit the topic, please forgive me and I will try to file my question somewhere else.

@NickSwardh
Copy link
Owner Author

NickSwardh commented Apr 2, 2024

This is normal behavior when running the very first inferences on the GPU. During startup, the CPU has to prepare and prime the GPU by copying the inputs, which, in turn, increases the execution time. When you run additional inferences after the first, things appear to go fast as expected.

Quote from the ONNX runtime Docs about CPU vs GPU execution

When working with non-CPU execution providers, it’s most efficient to have inputs (and/or outputs) arranged on the target device (abstracted by the execution provider used) prior to executing the graph (calling Run()). When the input is not copied to the target device, ORT copies it from the CPU as part of the Run() call. Similarly, if the output is not pre-allocated on the device, ORT assumes that the output is requested on the CPU and copies it from the device as the last step of the Run() call. This eats into the execution time of the graph, misleading users into thinking ORT is slow when the majority of the time is spent in these copies.

As the docs also states, this can be addressed by allocating memory to the GPU prior to execution.

I'm currently adding a new option to YoloDotNet to prime the GPU with allocated memory before execution. In my own tests with my RTX 3060, I get these approx results for the first inference:

Classification, ex:

CPU: 66ms
GPU without allocated GPU memory: 541ms
GPU: with allocated GPU memory: 13ms

Object Detection, ex:

CPU: 245ms
GPU without allocated GPU memory: 4715ms
GPU: with allocated GPU memory: 60ms

@niclasastrom
Copy link

Wow! That's impressive! Thanks for the explanation. I will certainly download the new version as soon as it becomes available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants