Skip to content
This repository has been archived by the owner on Jan 7, 2025. It is now read-only.

Add VGG-16 net as one of the default network #159

Open
jmozah opened this issue Jul 3, 2015 · 34 comments
Open

Add VGG-16 net as one of the default network #159

jmozah opened this issue Jul 3, 2015 · 34 comments

Comments

@jmozah
Copy link

jmozah commented Jul 3, 2015

SImilar to LeNet,AlexNext, GoogLeNet... it would be good if VGG net is also added as one of the default networks to select from..

@lukeyeager
Copy link
Member

Last I checked, there wasn't a publicly available version of their train_val.prototxt. Lots of people have asked for it:
https://gist.github.com/ksimonyan/fd8800eeb36e276cd6f9#comment-1430126
https://gist.github.com/ksimonyan/211839e770f7b538e2d8#comment-1346808
https://gist.github.com/ksimonyan/3785162f95cd2d5fee77#comment-1316301

I think they probably just don't have it anymore. If you want to put together a version that successfully trains on multiple datasets successfully, then can test it and get it added to DIGITS.

@jmozah
Copy link
Author

jmozah commented Jul 7, 2015

Look at the bottom of this link.. @karathy has a link there
https://gist.github.com/ksimonyan/211839e770f7b538e2d8#file-readme-md

I will try and see if i can succesfully train a version

@serafett
Copy link

Hi @jmozah

Were you able to train VGG successfully? I think training using the pretrained model works but training from scratch does not converge.

If anyone has successfully trained VGG16 or VGG19 from scratch, can you share your solver and train_val files?

@jmozah
Copy link
Author

jmozah commented Jul 17, 2015

No... The network failed after 1 epoc... Will check it next week and update

Sent from my iPhone

On 16-Jul-2015, at 10:50 PM, serafett [email protected] wrote:

Hi @jmozah

Were you able to train VGG successfully? I think training using the pretrained model works but training from scratch does not converge.

If anyone has successfully trained VGG16 or VGG19 from scratch, can you share your solver and train_val files?


Reply to this email directly or view it on GitHub.

@saeedizadi
Copy link

@jmozah
Any success?

@jmozah
Copy link
Author

jmozah commented Aug 18, 2015

No... not yet

@groar
Copy link
Contributor

groar commented Sep 8, 2015

I use a train_val that I updated from an old one. It works with the VGG 19 layers (with a very small batch). https://gist.github.com/groar/d455ebe671b2f1807659

I used it for fine-tuning, but never tried to train it from scratch. I could try.

@lukeyeager
Copy link
Member

Update on this:

@graphific uploaded a train_val.prototxt in the comments for this gist. I tried it on a 20-class subset of ImageNet (which should be easier to solve than the full imagenet dataset) and it totally failed to train (whereas AlexNet and GoogLeNet converge quickly every time).

vgg-no-converge

So, still no luck here :-/

@gheinrich
Copy link
Contributor

It would probably help to add Xavier weight initialization for this kind of deep network. With the default weight initialization the odds of hitting a vanishing gradient in the first layers are high.

@lfrdm
Copy link

lfrdm commented Jan 21, 2016

Hi guys. Don't know if you still got problems with converging vggnet but for me initializing the weights did the trick, as @gheinrich suggested. Though, I used the standard initialization like it is done in the AlexNet.

@gheinrich
Copy link
Contributor

Thanks! Can you post your .prototxt? Did you use Gaussian intialization? Xavier or MSRA initializations should perform better (and you don't have to specify the standard deviation of the distribution on these). Some toy examples there.

@lfrdm
Copy link

lfrdm commented Jan 21, 2016

You can find my .prototxt here. Yes, i used Gaussian. I trained on about 100.000 images (80% train, 20% val) with 64x64p with a batch size of 100. I used standard SGD, Gamma and LR. The dataset is private, dont know if it works on imagenet but i guess so. Note that the last output is 2 due to a binary class problem, for imagenet the fc8 layer should have an output of 1000.

I just noticed, that I used the VGGNet from BMVC-2014. Sorry for that. I will give feedback after I tryed it with the 16 layer network on the same dataset.

@lfrdm
Copy link

lfrdm commented Jan 22, 2016

As @gheinrich suggested the VGGNet with 16 layers converges with the "xavier" weight initialization. You can find my train_val.prototxt file here. Note that I didnt train on the ImageNet dataset, but I had faced the same problem with convergence and was able to fix it with the "xavier" weight initialization. Parameters: Batch: 100, Image: 64x64, SGD: 6%, Gamma: 0.5, LR: 0.05. The last output is 2 due to a binary class problem, for ImageNet the fc8 layer should have an output of 1000.

@gheinrich
Copy link
Contributor

Thanks for the update. That is nicely in line with the VGG paper:

Quote:

The initialisation of the network weights is important, since bad initialisation can stall learning due
to  the  instability  of  gradient in  deep  nets.   To  circumvent this problem,  we  began with  training
the configuration A (Table 1), shallow enough to be trained with random initialisation. Then, when
training deeper architectures, we initialised the first four convolutional layers and the last three fully-
connected layers with the layers of net A (the intermediate layers were initialised randomly). We did
not decrease the learning rate for the pre-initialised layers, allowing them to change during learning.
For random initialisation (where applicable), we sampled the weights from a normal distribution
with the zero mean and 10e-2 variance. The biases were initialised with zero. It is worth
noting that after the paper submission we found that it is possible to initialise the weights without
pre-training by using the random initialisation procedure of Glorot & Bengio (2010).

@GiuliaP
Copy link

GiuliaP commented Mar 15, 2016

Hi, I tried the train_val.prototxt posted by @lfrdm and it works, thanks. I added the lr_mult=10/20 and decay_mult=1/0 params for the weights/biases to the fc8 layer . I was now wondering why these params are missing in the train_val.prototxt and whether setting them to the same values as, e.g., in CaffeNet, as I have done for fc8, may make sense.

@GiuliaP
Copy link

GiuliaP commented Mar 29, 2016

@igorbb you're right, in the train_val.prototxt, in all the pooling
layers, the "pool: MAX" parameter is repeated twice. It must be a typo.
After correcting this it seems to work.

Il 23/03/16 00:36, igorbb ha scritto:

Hwy @GiuliaP https://github.com/GiuliaP I am getting a parser error
with @lfrdm https://github.com/lfrdm version. Can you share your gist ?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#159 (comment)

@hariprasadravi
Copy link

Hi I'm new to DIGITS and I'm experimenting with some datasets. When I tried the train_val.prototxt posted by @lfrdm with the changes mentioned by @GiuliaP (removing repeated pool :Max) I got this error message. Am I going wrong somewhere? Alex and GoogLeNet seem to be working fine.

ERROR: Check failed: error == cudaSuccess (2 vs. 0) out of memory

relu2_2 needs backward computation.
conv2_2 needs backward computation.
relu2_1 needs backward computation.
conv2_1 needs backward computation.
pool1 needs backward computation.
relu1_2 needs backward computation.
conv1_2 needs backward computation.
relu1_1 needs backward computation.
conv1_1 needs backward computation.
label_data_1_split does not need backward computation.
data does not need backward computation.
This network produces output accuracy
This network produces output loss
Network initialization done.
Solver scaffolding done.
Starting Optimization
Solving
Learning Rate Policy: step
Iteration 0, Testing net (#0)
Check failed: error == cudaSuccess (2 vs. 0) out of memory

@GiuliaP
Copy link

GiuliaP commented Jun 23, 2016

You have to reduce the batch size (both train and test/val): as it says, the GPU is out of memory.

@hariprasadravi
Copy link

@GiuliaP Reduced it and works well now. Thank you.

@jmozah
Copy link
Author

jmozah commented Jun 23, 2016

Did it converge?

./Zahoor@iPhone

On 23-Jun-2016, at 2:12 PM, Hariprasad Ravishankar [email protected] wrote:

@GiuliaP Reduced it and works well now. Thank you.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@hariprasadravi
Copy link

yes it did. I ran it for 10 epochs on a data set consisting of 10k color images with a batch size of 10. It took an hour to complete and gave me a validation accuracy of 92%.

@ghost
Copy link

ghost commented Jun 30, 2016

Hi,
I'm trying to use VGG in DIGITS. When I tried to create the model, I get the following error:

_

ERROR: Layer 'loss' references bottom 'label' at the TEST stage however this blob is not included at that stage. Please consider using an include directive to limit the scope of this layer.

_

I just copied the train_val.prototxt provided by lfrdm to custom network and deleted the duplicated pool: MAX. Any idea?
Thanks in advance,
M

@lukeyeager
Copy link
Member

lukeyeager commented Jul 5, 2016

@mizadyya Read the documentation on how custom networks in DIGITS work by clicking on the blue question mark above the box.

You probably want to add something like this to your loss layer:

  exclude { stage: "deploy" }

Example:
https://github.com/NVIDIA/DIGITS/blob/digits-4.0/digits/standard-networks/caffe/lenet.prototxt#L162-L184

@ghost
Copy link

ghost commented Jul 5, 2016

@lukeyeager I also needed to add softmax layer to the end, in addition to softmax with loss. Now it's running fine. Thanks

@jmozah
Copy link
Author

jmozah commented Jul 7, 2016

How much memory does it consume... Fits in 4gb card?

./Zahoor@iPhone

On 07-Jul-2016, at 9:15 AM, Ishant Mrinal Haloi [email protected] wrote:

I have tested this in Imagenet, it converges https://github.com/n3011/VGG_19_layers_Network


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@Motherboard
Copy link

Motherboard commented Sep 13, 2016

I couldn't make it work with batches as big as 5 256x256 images on a K520 with 4GB... And it also takes 5 days for 10 epochs on 18k images (finetuning)... maybe something is wrong with my EC2? GPU utilization is 99% constantly, memory peaked during initialization to near 100%, but quickly dropped to 60%... although larger batches made it fail for lack of memory (ended up using batches of 3)...

@mrgloom
Copy link

mrgloom commented Sep 16, 2016

Also can't train VGG-16. Maybe it's because small batch size or solver settings(I use default DIGITS settings)?
My dataset is from this kaggle competition: https://www.kaggle.com/c/dogs-vs-cats
Here is my network definition: https://gist.github.com/mrgloom/fec835c5570e739eff8c18a343bdd7db

@mrgloom
Copy link

mrgloom commented Sep 16, 2016

Seems that was small batch problem, I successfully trained VGG-16 with batch size 24 and batch accumulation 2, so as I understand my batch size was 48?

Here is the models and logs downloaded from DIGITS:
https://github.com/mrgloom/kaggle-dogs-vs-cats-solution/tree/master/learning_from_scratch/Models/VGG-16
https://github.com/mrgloom/kaggle-dogs-vs-cats-solution/tree/master/learning_from_scratch/Models/VGG-19

@HolmesShuan
Copy link

Here is my prototxt, seems to work correctly.

@eamadord
Copy link

Hi, I'm fairly new to DIGITS and to Caffe, and I have been trying to finetune VGG for the past few weeks without results. I used the prototxt posted by @lfrdm , setting the lr_mult parameters of the last layer to the values suggested by @GiuliaP and the lr_mult of the rest of the layers to 0. However, when running it in DIGITS it does not converge, it goes from 20% acc to 55% and it stays like that during the whole training. I've tried with several learning rates, from 0,01 to 0,0005 without success. My dataset consists on 8500 images for training and 1700 for validation, splitted into 5 classes. Could anyone give me a hand on this?

@gheinrich
Copy link
Contributor

Hi @Elviish since your question isn't related to getting VGG to load in DIGITS but how to train it, can you post this question on the DIGITS users list (https://groups.google.com/forum/#!forum/digits-users).

@aytackanaci
Copy link

Hi @lfrdm, I was looking for train_val files for vgg from bmvc 2014. I see that you have two commits for that file. Is the older one for bmvc version?

@aaron276h
Copy link

@lfrdm any chance you could post your prototxt file for VGG again, seems to be down, Thanks!

@gaving
Copy link

gaving commented Nov 2, 2017

Echoing a request for this prototxt file for VGG.. can't seem to find one!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests