Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weight Regularization/hidden state clipping parameters #39

Closed
raulpuric opened this issue Sep 12, 2017 · 0 comments
Closed

Weight Regularization/hidden state clipping parameters #39

raulpuric opened this issue Sep 12, 2017 · 0 comments

Comments

@raulpuric
Copy link

raulpuric commented Sep 12, 2017

Is there any plans to release what hyperparameters were used for regularizing the training process.

I've been trying to retrain these weights on amazon reviews and a different dataset using guillitte's implementation as suggested on this repo's README; however, because of the multiplicative nature of the mlstm, the weights tend to overfit and have very high norms. The input->hidden weights tend to be fine and have constant values throughout, but the hidden->hidden weights seem to continually grow in norm throughout the training process as it unearths the patterns of the training corpus.

This is problematic for scenarios when I have a rare character/sequence of chars such as a name in finnish with utf-8 supported accenting/diaresis (eg. Väinämö) that comes up frequently in otherwise english text. If multiple of these names appear in a batch, it causes massive gradient spikes and can lead to gradient explosion in the network, and even if the gradients recover, the net is incapable of getting back to previous performance levels if the gradient spike pushes the weights too far from their local.

Obviously I could make an effort to preprocess this data/drop it and clip activation outputs/their associated gradients (and I have), but it is inconvenient to have to rely on data processing and hope that I thought of all possible data transformations or have to extensively tune clipping hyperparameters.

This explosion doesn't happen with an LSTM model (since it's additive) either after extensive testing, even though it doesn't do too well without preprocessed data.

TL;DR
Please release hyperparameters, as the network is too prone to overfitting and training instability to the point where I can't even guarantee a stable training run on amazon reviews (even with your saved weights as initialization). (it's about 1 fail:5 succeed) An LSTM model does worse, but doesn't have these training instabilities.

UPDATE:
Nevermind, I saw that weight normalization was mentioned in the paper, despite not being used in the pytorch implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant