Benchmarking GPU vs. CPU: Unexpected Results #11

NickSwardh · 2024-04-02T09:10:28Z

Originally posted by @niclasastrom in #9 (comment)

Performance using the GPU is worse than using the CPU. I have an RTX 4070, running Windows 11 Pro. The latest OS- and NVidia driver updates are installed.

I expected higher throughput when using the GPU, but I could be wrong. What performance can I expect, CPU vs GPU?

For example, the classification test took 130ms on the CPU and 572ms on the GPU. Do you know if this is expected?

I added a couple of lines to measure compute time:
var stopWatch=new Stopwatch();
stopWatch.Start();
List<Classification> results = yolo.RunClassification(image, 3); // Get top 5 classifications. Default = 1
stopWatch.Stop();
Console.WriteLine("Elapsed time: "+stopWatch.ElapsedMilliseconds);
Thanks for your input. If this follow-up question doesn't fit the topic, please forgive me and I will try to file my question somewhere else.

The text was updated successfully, but these errors were encountered:

NickSwardh · 2024-04-02T10:27:40Z

This is normal behavior when running the very first inferences on the GPU. During startup, the CPU has to prepare and prime the GPU by copying the inputs, which, in turn, increases the execution time. When you run additional inferences after the first, things appear to go fast as expected.

Quote from the ONNX runtime Docs about CPU vs GPU execution

When working with non-CPU execution providers, it’s most efficient to have inputs (and/or outputs) arranged on the target device (abstracted by the execution provider used) prior to executing the graph (calling Run()). When the input is not copied to the target device, ORT copies it from the CPU as part of the Run() call. Similarly, if the output is not pre-allocated on the device, ORT assumes that the output is requested on the CPU and copies it from the device as the last step of the Run() call. This eats into the execution time of the graph, misleading users into thinking ORT is slow when the majority of the time is spent in these copies.

As the docs also states, this can be addressed by allocating memory to the GPU prior to execution.

I'm currently adding a new option to YoloDotNet to prime the GPU with allocated memory before execution. In my own tests with my RTX 3060, I get these approx results for the first inference:

Classification, ex:

CPU: 66ms
GPU without allocated GPU memory: 541ms
GPU: with allocated GPU memory: 13ms

Object Detection, ex:

CPU: 245ms
GPU without allocated GPU memory: 4715ms
GPU: with allocated GPU memory: 60ms

niclasastrom · 2024-04-03T16:51:01Z

Wow! That's impressive! Thanks for the explanation. I will certainly download the new version as soon as it becomes available.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking GPU vs. CPU: Unexpected Results #11

Benchmarking GPU vs. CPU: Unexpected Results #11

NickSwardh commented Apr 2, 2024 •

edited

Loading

NickSwardh commented Apr 2, 2024 •

edited

Loading

niclasastrom commented Apr 3, 2024

Benchmarking GPU vs. CPU: Unexpected Results #11

Benchmarking GPU vs. CPU: Unexpected Results #11

Comments

NickSwardh commented Apr 2, 2024 • edited Loading

NickSwardh commented Apr 2, 2024 • edited Loading

niclasastrom commented Apr 3, 2024

NickSwardh commented Apr 2, 2024 •

edited

Loading

NickSwardh commented Apr 2, 2024 •

edited

Loading