Performance issues on Nvidia GPUs with mixed precision and accuracy issues #350

aliencaocao · 2023-02-03T03:25:25Z

Discussed in #315

^{Originally posted by aliencaocao October 19, 2022}
I have done a simple benchmark of ResNetRS50 on an RTX 3080Ti, comparing DirectML plugin 0.1.1.dev221004 and CUDA 11.8 + CUDNN 8.6.0, and found that DML is very slow compared to CUDA, and uses only about 50% of GPU while training, while CUDA constantly uses 100%. Both tests were conducted with mixed precision off and batch size of 64.

Training 10 epochs on DML took 416 seconds, while on CUDA took only 164 seconds. Both on TF 2.10 (CPU for DML) and Python 3.9.13.

This brings the big performance question - is DML in any case optimized for Nvidia GPUs, especially its Tensor Cores and TensorFloat32 datatypes? And what could cause it to not use 100% of my GPU? I have tried to increase batch size but it will just OOM so 64 is definitely a large enough BS to fully use the GPU (as shown by 100% usage on CUDA).

Or perhaps is this something that will be optimized in the future, but just not yet?

UPDATE TLDR: the performance issues mentioned above have been partially resolved in 0.2.0, but the fix introduced a model accuracy loss issue that have yet to be resolved. See #315 (reply in thread)
This makes the plugin not worth to switch over on Nvidia Ampere GPUs (and potentially other nvidia GPUs).
Mixed precision is able to run but with poor performance as of now (on 0.1.1 it was unable to run)

aliencaocao changed the title ~~Performance issues on Nvidia GPUs~~ Performance issues on Nvidia GPUs with mixed precision and accuracy issues Feb 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issues on Nvidia GPUs with mixed precision and accuracy issues #350

Performance issues on Nvidia GPUs with mixed precision and accuracy issues #350

aliencaocao commented Feb 3, 2023 •

edited

Loading

Performance issues on Nvidia GPUs with mixed precision and accuracy issues #350

Performance issues on Nvidia GPUs with mixed precision and accuracy issues #350

Comments

aliencaocao commented Feb 3, 2023 • edited Loading

Discussed in #315

aliencaocao commented Feb 3, 2023 •

edited

Loading