Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training VGG13 net with RX6600 is slow #326

Open
thinksmert opened this issue Nov 1, 2022 · 10 comments
Open

Training VGG13 net with RX6600 is slow #326

thinksmert opened this issue Nov 1, 2022 · 10 comments

Comments

@thinksmert
Copy link

my environment:
windows 11 64bit
python 3.9 64bit
tensorflow 2.10
tensorflow-directml-plugin 0.2.0.dev221020
AMD Radeon RX 6600 Nvidia RTX1060
Conda 22.9.0

I'm training a VGG13 net in miniConda enviroment.I have two configurations:
1.Nvidia RTX1060 + tensorflow-gpu
2.RX6600(more powerful than RTX1060) + tensorflow-cpu + tensorflow-directml-plugin
With first configuration,it is very fast, about 6s each train period.But with second configuration,it is slower than the first configuration,only about 30s each train period.
I guess the reason of second configuration is slower, is it just uses tensorflow-cpu not tensorflow-gpu?Is it right?
Is there any way can improve the trainning speed with that second configuration? Or when tensorflow-directml-plugin can support tensorflow-gpu?

Thanks

@PatriceVignola
Copy link
Contributor

Hi @thinksmert,

  1. Could you try running the model on your RTX1060 with tensorflow-cpu + tensorflow-directml-plugin and give us the numbers.
  2. tensorflow-cpu + tensorflow-directml-plugin is supposed to use your GPU, so it's not clear why it is much slower. It's possible that some operators are falling back to the CPU, which can considerably slow down the execution.

Can you send us the device placement logs? Just add the following snippet at the start of your script:

import tensorflow as tf
tf.debugging.set_log_device_placement(True)

and then, redirect the output to a file. For example:

python script.py > log.txt

@thinksmert
Copy link
Author

Hi,
OK, I will try it later.

@thinksmert
Copy link
Author

And I wonder if the plugin will support the tensorflow-gpu?

@thinksmert
Copy link
Author

Hi,
I have add your snippet in my script and this is the log file when I run my script about 30s with the second configuration.For your reference.
Thanks
log.txt

@PatriceVignola
Copy link
Contributor

@thinksmert Thanks! Can you do the same thing with the tensorflow-gpu package (and without tensorflow-directml-plugin) on your Nvidia card? This will help us compare what is supposed to happen versus what is actually happening.

@thinksmert
Copy link
Author

Hi,
OK,I will try to do that,maybe a few days later because I need take down my RX6600 and install Nvidia card again.It will take some times.

@thinksmert
Copy link
Author

Hi,
I have do two tests with my Nvidia card.One is using tensorflow-gpu and it takes about 7s per training period.Another test is using tensorflow-cpu and tensorflow-directml-plugin.It used more time(about 20s per training period) but still faster than RX6600 with tensorflow-cpu.Here are the logs:
first is RTX1060 with tensorflow-gpu
second is RTX1060 with tensorflow-cpu and tensorflow-directml-plugin

For your reference.
Thanks

log_gpu_GTX1060.txt
log_cpu_GTX1060.txt

@thinksmert
Copy link
Author

Hi,
Is there any idea?

@PatriceVignola
Copy link
Contributor

The logs are identical between DML and CUDA, so it's hard to say just from that. Can I ask where you got that VGG13 script from? Running the exact same script would help us investigate this on our end.

@thinksmert
Copy link
Author

thinksmert commented Nov 12, 2022

Hi,
This script is just an exercise when I study ML from the network tutorial.I coded it flow the tutorial setp by step.These logs I gave you run the same script.
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants