-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retraining from scratch yields worse results #32
Comments
Just to make sure I'm understanding: when you say first slope you're referring to the part of the graph from 0 to 60.00K on the x axis? And the 2nd slope starts at 80.00k? So you did the I guess I'm not too surprised at this result. There's very little information in the paper about this retraining step - it's all in section 2.2 under Deriving Architectures: "We then take only the model with the highest reward to re-train from scratch" - AFAICT that's it, that's the whole description of the retraining step. If you look at the Tensorflow enas implementation from the paper authors (https://github.com/melodyguan/enas) you'll see that there are two scripts: ptb_search.sh and ptb_final.sh. The latter script is used to retrain the best found dag (and in fact they've hard-coded the best found dag to be exactly the one found in the paper). Doing a comparison between them I notice that several parameters are different between the two: The lstm_hidden_size is 720 in ptb_search.sh while it's 748 in ptb_final.sh, for example, and the parameters related to learning rate are very different as well. Perhaps you could try retraining using their parameter values from the ptb_final.sh? |
Yes, correct. The second part starts at around 80k. Interesting, I will try to use the same parameters and compare the results, thanks for the suggestion. |
#35 Provides a way to train a single, given, dag |
@nkcr were you able to better results with the parameters from the TF code? |
No, not really. I didn't investigate much but with a quick matching from the tensorflow implementation I got worse results. |
Hello,
As written in the paper (2.2 Deriving Architectures) I tried to re-retrain from scratch the best derived model, but it surprisingly gives worse result when I retrain it from scratch than if I would keep the original (shared) weights.
I expected training the best model (dag) from scratch to be faster and eventually have a better perplexity, but it's not the case.
I do the following:
--load_path
argument, which loads a previous run, and the--mode test
, which will call a customtest
method inside thetrainer
classtest
method) I reset the shared weights withself.shared.reset_parameters()
train_shared
method)The following picture shows the loss and ppl during the "normal" training (first slope) and after reseting the shared weights (second slope). The second slope only trains the same best model (dag).
Has anyone any idea about why resetting the shared weight and re-training from scratch is so bad?
The text was updated successfully, but these errors were encountered: