-
Notifications
You must be signed in to change notification settings - Fork 45.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Slim] Imagenet training not utilizing multiple GPUs efficiently #1428
Comments
Have you tried to set |
@tfboyd for performance. |
That script does not detect the number of GPUs automatically. You have to set it. To double check run nvidia-smi or |
Ok I have set However, the situation became worse. On a single GPU training reports an average of 3.1 sec/step. On 8 GPUs its now reporting an average of 3.4 sec/step. So I am really confused now as it seems not using |
I will try to reproduce it today locally and go from there. I have run a few models via slim but not this one. |
I have not forgotten you. I had some personal items and some cleanup to do on another task. |
Ahh I forgot until I setup the example and ran it. What you are seeing is 100% correct and makes sense. Slim is reporting time for the step. You are running synchronized training so your images/sec would be
|
@tfboyd interesting. So can you explain this equation? Namely what does the number 32 represent? |
@redserpent7 Apologies and I mean that with sincerity. It makes me crazy when people do not explain things. And I apologize again if I am to detailed but that is the best way I can think of to explain it. 32 is the batch-size, which is the default and very popular so I assume that is what you used. With one GPU you did 32 images in 3.1 seconds, which would be 10.2 images per second What happens when you add more GPUs is that you are processing a larger batch size. So instead of 32 images in 3.1 seconds you are processing 256 images in 3.4 seconds. You could add this formula to the script to get images per second and account for the number of clones I definitely understand how this is confusing. I setup a VM and setup the code before I realized the answer to your question. I know I marked this closed but please, as I said, feel free to ask more questions. I am happy to help. Best of luck. |
@tfboyd thaks so mucb for the info much appreciated. Well it all make sense now. So I am now trying to calculate the number of steps that I will require. I first started on 1 GPU for 10000 steps. This gave me a top-1 accuracy of 0.0022. I then tried running 100000 steps which assuming a linear progress should yeild 0.022 accuracy and using 8 GPUs should give me 8 times that. Am I correct in my assumption? |
@redserpent7 The curves are not usually linear and you may have to adjust your learning rate. I am not familiar with the inception_resnetv2 model. I calculate based on Epochs (number of times through the entire data set). I will use ResNet as an example. For ResNet-50 it is common (but there are other approaches) to train for 30 epochs and then reduce the learning rate from .1 to .01 and then after another 30 Epochs (60 total) reduce the learning rate from .01 to .001. So if you have 8 GPUs and are using a batch size of 32 per GPU for a total of 256 images per step then 30 Epochs would be: 1281167 [total training images in Imagenet] / 256 [images per step] * 30 [epochs] = 150,137 steps. |
How can we test it in asyn mode? than we can verify whether we are fully using every one of the gpus. |
I do not believe there is an async mode built into the SLIM models for multi-GPU. For inception and ResNet async is not going to gain much on a single machine with multi-GPUs. Our benchmark scripts have Aync mode for distributed (across servers) but not within a single machine. The concept used in distributed should be applicable to local GPUs or you could run 8 local worker instances and one ps-server but that is not likely a great idea. I am not remotely versed in all models but for the limited models I tested VGG, AlexNet, ResNet, and Inception, I do not think much is gained from async locally. VGG and AlexNet drop of a little on scaling but I am not sure enough to make async worth it but I am not saying it is not interesting. |
Hi Guys, |
It was closed, but what is the final understanding about it ? I used the retrain.py (from tf for poets) with batch size=100, inception v3 e learning rate=0.01 and retrain.py only use GPU to create the bottlenecks. I got (0.2 imgs/sec) on that. Using train_image_classifier.py (from tf slim) with batch_size = 32 (because I set to use GPU with num_clones=3), inception v3 and learning rate = 0.01 I got (0.2 img/sec). I think it so confunsing because using retrain and only CPU to train got almost the same result (img/sec) compared with 3 GPUs. I used the strategy to split the batch size 100 per 3 GPUs using 32 on train_image_classifier. |
TF SLIM is a little slow but .2 is not even close. What do you get with batch-size=32 with one gpu, e.g. num_clones=1. Looking at the internal regression test for TF Slim, I am seeing 400ms per step on a P100 which assuming batch-size 32 = 80 images/sec. It even got faster recently down to 275ms = 116 images/sec. The benchmark code is ~130 image/sec on a DGX-1. Also what is your command line and GPU? I have the data local so it is pretty easy for me to reproduce and I am willing to give it a try to give you a baseline. |
@tfbody im using P100 as GPU. I will test o flowers dataset using inception v3, learning rate 0.01 and 1000 steps. So, I will create 4 tests:
Which one has to be faster in your opinion? |
Normal batch sizes would be 32/64/128 per GPU. I think the P100 can do 128
but I would stick with 64. No need to test CPU, if you are doing CPU you
want to read my CPU guide.
If you do more gpus then do weak scaling meaning gpus * batch size so
32*2=64 or 32*4=128.
More gpus will be faster if using weak scaling. 3 gpus and 100 is weird
because 100/3 is not a whole number.
I think slim reports global step time. If you get a result please paste
the log and cmd. I have debugged a few of these just seeing logs.
If you get a very weird result I can link the benchmark code so you can
verify your setup with something that I test on many platforms nightly.
On Feb 7, 2018 3:53 PM, "gustavomr" <[email protected]> wrote:
@tfbody im using P100 as GPU. I will test o flowers dataset using inception
v3, learning rate 0.01 and 1000 steps. So, I will create 4 tests:
1. CPUs and batch size = 100
2. one GPU and batch size = 100
3. three GPU and batch size = 100
4. three GPU and batch size = 32
Which one has to be faster in your opinion?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1428 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AWZesldZhIv9Oz838_fFHHPPSIrIjxByks5tSjdcgaJpZM4NM_kM>
.
|
Ok, so I will setup my training on slim as you said:
For every try I will get a log and cmd to you. As I said, my results are not faster than using retrain (using only CPU). |
@tfbody here are my tests: TF slim (train_image_classifier.py) using this cmd: python train_image_classifier.py
TF for poets (retrain.py) using this cmd: ARCHITECTURE="inception_v3" python -m scripts.retrain
So, as we can see retrain script (test 5) using one CPU has better performance compared with equivalent batch size on tf slim (test 4) using 2 GPUs. What do you think? I attached all logs on this thread. Thanks again! |
So you are mixing up the stats. It is seconds/step and then global steps in a second.
batch-size 32 this would be 8.0496 * 32 = 256 images/sec Your examples did not include num_clones which would be num_clones=2 if you want to use 2 GPUs. I do not have a good guess at the scaling from 1 to 2 GPUs there are a lot of factors and SLIM have a single mode for doing it. It can be quasi linear to really not good. Other than that this looks decent to me. I have not run the flowers example fine tuning before but this is actually faster than I expected as training ImageNet on inceptionv3 would be closer to 130 images a second on SMX2 P100s and you are running PCIe P100s that are clocked a little slower. This may seem like a dig but it is not. If you had posted the logs the first time I would have seen the issue instantly. I also should have guessed you mixed up steps/sec with seconds/step anyway, sorry for not realizing it instantly. I have not looked at this script in a long time. |
@tfbody I edited my post to put --num_clones. I used it running my script. I got what you said. *But why I get better performance using retrain (using one CPU) compared with tf slim? In my opinion because tf slim use GPU (and you have option to use how many you want) it should have better performance than retrain. |
edit 08-FEB-2018 I am 99% sure that you are not getting 256 images/sec with CPU. It likely ended up on the GPU anyway, you can try again with CUDA_VISIBLE_DEVICES='' python blah.py I am almost never 100% sure of anything. I do not run this code but I did all of the performance guides and run perf tests all the time and know many of the numbers by heart. Training Resnet50 (easier than inceptionv3) is 6.2 images a second and that is if you compiled with AVX2 and using dual Broadwells with 36 physical / 72 logical cores. Your logs for poets, seems to show the GPUs were used not CPUs are you are stating. The poets script does not print out the step/time but I assume you are using the timestamps to make an educated guess. On to scaling, SLIM as a very simple multi-GPU setup where the variables are placed on gpu:0. I noticed the following in your logs:
it seems your P100 pcie are not setup to talk to each other with GPU Direct peer-to-peer based on the Matrix. 1 and 2 are but 0 seems to be hanging out alone. While I cannot be sure, that could create problems if all of the parameters on GPU:0 and I have seen that in my testing. You could always try to isolate those to GPUs that are connected. with CUDA_VISIBLE_DEVICES. Finally, while the SLIM code is not ideal and is not well supported at this point, I know scales in one instance. ResNet50 1xK80 = 40.5 images/sec (32 or 64 I forget) and then 8xK80 293 images/sec. Not great but faster. The total batch would be num_gpus * 32 or 64. We are working on revamping the example with the latest APIs. So I do not just go away, For me this issue is closed as there is not much else I can do. |
Hi,
I am running some tests on Slim's imagenet training using Inception Resnet V2. The training is done on AWS ec2 instances (p2.xlarge and p2.8xlarge): Here are the specs for both:
The GPUs are all Nvidia Tesla K80
Tensorflow seems to detect and loads the training on all GPUs according to both the training output and nvidia-smi. However there does not seem to be much difference in execution times.
On the p2.xlarge instance, TF/Slim reported an average of 3.05 sec/step.
On the p2.8xlarge instance it reported an average of 2.96 sec/step
I was expecting the time to drop significantly but given the above results I do not see a huge benefit running the training on multiple GPUs.
Both instances have a copy of the same exact training datasets and scripts. I am running the training using this command:
Both instances running Tensorflow 1.0.1 running from binary as VM
Both instances are running Ubuntu 14.04 x64
Regards
The text was updated successfully, but these errors were encountered: