Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues on Nvidia GPUs with mixed precision and accuracy issues #350

Open
aliencaocao opened this issue Feb 3, 2023 Discussed in #315 · 0 comments
Open

Performance issues on Nvidia GPUs with mixed precision and accuracy issues #350

aliencaocao opened this issue Feb 3, 2023 Discussed in #315 · 0 comments

Comments

@aliencaocao
Copy link

aliencaocao commented Feb 3, 2023

Discussed in #315

Originally posted by aliencaocao October 19, 2022
I have done a simple benchmark of ResNetRS50 on an RTX 3080Ti, comparing DirectML plugin 0.1.1.dev221004 and CUDA 11.8 + CUDNN 8.6.0, and found that DML is very slow compared to CUDA, and uses only about 50% of GPU while training, while CUDA constantly uses 100%. Both tests were conducted with mixed precision off and batch size of 64.

Training 10 epochs on DML took 416 seconds, while on CUDA took only 164 seconds. Both on TF 2.10 (CPU for DML) and Python 3.9.13.

This brings the big performance question - is DML in any case optimized for Nvidia GPUs, especially its Tensor Cores and TensorFloat32 datatypes? And what could cause it to not use 100% of my GPU? I have tried to increase batch size but it will just OOM so 64 is definitely a large enough BS to fully use the GPU (as shown by 100% usage on CUDA).

Or perhaps is this something that will be optimized in the future, but just not yet?

UPDATE TLDR: the performance issues mentioned above have been partially resolved in 0.2.0, but the fix introduced a model accuracy loss issue that have yet to be resolved. See #315 (reply in thread)
This makes the plugin not worth to switch over on Nvidia Ampere GPUs (and potentially other nvidia GPUs).
Mixed precision is able to run but with poor performance as of now (on 0.1.1 it was unable to run)

@aliencaocao aliencaocao changed the title Performance issues on Nvidia GPUs Performance issues on Nvidia GPUs with mixed precision and accuracy issues Feb 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant