-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does the size of batch-size affect the training results? #32
Comments
Hello, thanks for using this code ! The big batch training is more a machine learning problem than an implementation problem I think ! Having a stochastic training, i.e. a small batch size helps the training not to overfit this part of the road I think. With large batch size, you might to try other regularisation techniques. A bunch of them have been tried in recent papers, such as GeoNet (https://github.com/yzcjtr/GeoNet) or adversarial collaboration (https://github.com/anuragranj/ac) (or mine here 👼 https://github.com/ClementPinard/unsupervised-depthnet) One the key regularization techniques is to my mind the smooth loss scaling with image textureness. you might to try it and set a higher smooth loss scale. You can see an example here As for multi-GPU training, I actually don't have multi-GPU so I could not test it thoroughly. But @anuragranj said to me he was able to encapsulate the loss function inside a The key here is probably to make a Module which will call the loss function, which you then put inside a |
@ClementPinard Well, thankful, I see it. |
@youmi-zym By the way, I followed this for multigpu. https://github.com/NVIDIA/flownet2-pytorch |
@ClementPinard @anuragranj Yeah, thanks for your help, and I have implemented multigpu loss function with your advise. However, I still have some experiment results to share with you ( smooth-loss, texture-smooth-loss ) : It means these experiments run with the top command, the smooth-loss, texture-smooth-loss are mentioned upon, and the weight is set as the command. and, epoch after 140 or 160 don't improve any more. |
According to the original author, quality worsens after 160K iterations anyway (tinghuiz/SfMLearner#42) How did you split your training dataset ? |
Well, there's a big jet lag between us... 2011_09_26_drive_0005_sync_02 41664 samples found in 64 train scenes Thanks a lot! |
Here are my results with my own split :
I used a regular smooth loss and a trained on a clean version of the repo. I'll check with your split |
Results with your split, using model_best :
Results with your split, using checkpoint :
As such, I think you only used the |
@ClementPinard Thanks for your work and reply, I use the "model_best" version... I will try again with 140k iterations. Thanks a lot. |
Here is my exact dataset, in case there has been some regression in the data preporcessing script : https://mega.nz/#!OIEwEQ4a!Yz5aRFjPHxNwCV2sxIslgWPfppAj_WOpthNTqWUvByo |
Thanks a lot ! I will check it and try again ! |
The result is still bad....
Then, I use the command below to test:
Finally, the result I get: no PoseNet specified, scale_factor will be determined by median ratio, which is kiiinda cheating (but consistent with original paper)
And, yesterday I retrain with 160k iterations, the new result is:
The result is similar to yours, and it reveals that 140k iterations may not be a bottleneck, I will try with 200 iterations again. Updata ! Here are the 200 iterations...
Thus, the number of iterations really makes big influence to the result, it's difficult to determine the best of it.. |
Thanks for your insight ! Since it got worse with more training, there may be a problem with the validation set since it is supposed to keep the best network on it. Maybe it's not representative enough compared to the test set ? Anyway, this auto supervision problem is very hard to make converge correctly, and to make interpretation on the different results. Good luck for you research! |
@ClementPinard Thanks very much |
Hi,
I have run the train.py with the command blow on KITTI-raw-data :
python3 train.py /path/to/the/formatted/data/ -b4 -m0 -s2.0 --epoch-size 1000 --sequence-length 5 --log-output --with-gt
Otherwise the batch-size=80, and the train(41664)/vaild(2452) split is different.
The result I get is:
disp:
Results with scale factor determined by GT/prediction ratio (like the original paper) :
`
abs_rel, sq_rel, rms, log_rms, a1, a2, a3
0.2058, 1.6333, 6.7410, 0.2895, 0.6762, 0.8853, 0.9532
pose:
Results 10
ATE, RE
mean 0.0223, 0.0053
std 0.0188, 0.0036
Results 09
ATE, RE
mean 0.0284, 0.0055
std 0.0241, 0.0035
`
You can see that there's still a quiet big margin with yours:
Abs Rel | Sq Rel | RMSE | RMSE(log) | Acc.1 | Acc.2 | Acc.3
0.181 | 1.341 | 6.236 | 0.262 | 0.733 | 0.901 | 0.964
I think there is no other factors causing this difference, otherwise the batch-size and data split. Therefore, does the size of batch-size affect the training results?
What's more, when I try to train my model with two Titan GPUs, batch-size=80*2=160, the memory usage of each GPU is:
GPU0: about 11G, GPU1: about 6G.
There is a huge memory usage difference between two GPUs, and it seriously impacts multi-gpu trianing.
And then I find the loss calculations are all placed on the first GPU, actually the memory is mainly used to calculate the 4 scales of depth photometric_reconstruction_loss, and we can just move some scales to the cuda:0, and others to cuda:1, it might be better I think.
The text was updated successfully, but these errors were encountered: