Skip to content

Implementation for "Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling"

License

Notifications You must be signed in to change notification settings

drauh/tying-wv-and-wc

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling

Implementation for "Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling"

This paper tries to utilize the diversity of word meaning to train the Deep Neural Network.

Summary of Paper

Motivation

In the language modeling (prediction of the word sequence), we want to express the diversity of word meaning.
For example, when predicting the word next to "Banana is delicious ___", the answer is "fruit", but "sweets", "food" is also ok. But ordinary one-hot vector teaching is not suitable to achieve it. Because any similar words ignored, but the exact answer word.

motivation.PNG

If we can use not one-hot but "distribution", we can teach this variety.

Method

So we use "distribution of the word" to teach the model. This distribution acquired from the answer word and embedding lookup matrix.

formulation.PNG

architecture.PNG

If we use this distribution type loss, then we can prove the equivalence between input embedding and output projection matrix.

equivalence.PNG

To use the distribution type loss and input embedding and output projection equivalence restriction improves the perplexity of the model.

Experiments

Implementation

Result

result.PNG

  • Run the 15 epoch on Penn Treebank dataset.
    • perplexity score is large, I couldn't have confidence of its implementation. I'm waiting pull request!
  • augmentedmodel works better than the baseline(onehotmodel), and augmentedmodel_tying outperforms the baseline!
  • You can run this experiment by python train.py

Additional validation

  • At the beginning of the training, embedding matrix to produce "teacher distribution" is not trained yet. So proposed method has a little handicap at first.
    • But the delay of training was not observed
  • Increasing the temperature (alpha) gradually may improve training speed.
  • To use the pre-trained word vector, or fixing the embedding matrix weight for some interval (fixed target technique at the reinforcement learning (please refer Deep Reinforcement Learning)) will also have effect to the training.

By the way, PyTorch example already use tying method! Don't be afraid to use it!

About

Implementation for "Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%