You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Originally posted by aliencaocao October 19, 2022
I have done a simple benchmark of ResNetRS50 on an RTX 3080Ti, comparing DirectML plugin 0.1.1.dev221004 and CUDA 11.8 + CUDNN 8.6.0, and found that DML is very slow compared to CUDA, and uses only about 50% of GPU while training, while CUDA constantly uses 100%. Both tests were conducted with mixed precision off and batch size of 64.
Training 10 epochs on DML took 416 seconds, while on CUDA took only 164 seconds. Both on TF 2.10 (CPU for DML) and Python 3.9.13.
This brings the big performance question - is DML in any case optimized for Nvidia GPUs, especially its Tensor Cores and TensorFloat32 datatypes? And what could cause it to not use 100% of my GPU? I have tried to increase batch size but it will just OOM so 64 is definitely a large enough BS to fully use the GPU (as shown by 100% usage on CUDA).
Or perhaps is this something that will be optimized in the future, but just not yet?
UPDATE TLDR: the performance issues mentioned above have been partially resolved in 0.2.0, but the fix introduced a model accuracy loss issue that have yet to be resolved. See #315 (reply in thread)
This makes the plugin not worth to switch over on Nvidia Ampere GPUs (and potentially other nvidia GPUs).
Mixed precision is able to run but with poor performance as of now (on 0.1.1 it was unable to run)
The text was updated successfully, but these errors were encountered:
aliencaocao
changed the title
Performance issues on Nvidia GPUs
Performance issues on Nvidia GPUs with mixed precision and accuracy issues
Feb 3, 2023
Discussed in #315
Originally posted by aliencaocao October 19, 2022
I have done a simple benchmark of ResNetRS50 on an RTX 3080Ti, comparing DirectML plugin 0.1.1.dev221004 and CUDA 11.8 + CUDNN 8.6.0, and found that DML is very slow compared to CUDA, and uses only about 50% of GPU while training, while CUDA constantly uses 100%. Both tests were conducted with mixed precision off and batch size of 64.
Training 10 epochs on DML took 416 seconds, while on CUDA took only 164 seconds. Both on TF 2.10 (CPU for DML) and Python 3.9.13.
This brings the big performance question - is DML in any case optimized for Nvidia GPUs, especially its Tensor Cores and TensorFloat32 datatypes? And what could cause it to not use 100% of my GPU? I have tried to increase batch size but it will just OOM so 64 is definitely a large enough BS to fully use the GPU (as shown by 100% usage on CUDA).
Or perhaps is this something that will be optimized in the future, but just not yet?
UPDATE TLDR: the performance issues mentioned above have been partially resolved in 0.2.0, but the fix introduced a model accuracy loss issue that have yet to be resolved. See #315 (reply in thread)
This makes the plugin not worth to switch over on Nvidia Ampere GPUs (and potentially other nvidia GPUs).
Mixed precision is able to run but with poor performance as of now (on 0.1.1 it was unable to run)
The text was updated successfully, but these errors were encountered: