kyunghyun_cho_2016_school.txt

Neural Language Modeling
------------
- change MLE ngram language model to be function approximator.


Bengio et al. 2000
-------------------
1. 1-of-K encoding
2. continuous space word representation
3. nonlinear hidden layer (tanh)

Dependency beyond the context window
-------------------
- Models that don't incorporate this:
	1. ngram
	2. word2vec
- Recursively apply function f to state and current word.
	- RNN intuition

RNNLM
----------------
- steps:
	1. continuous representation for words
	2. linear transformation of previous hidden state
	3. additive combination of previous word and hidden state vectors with bias
	4. point wise linear transform - tanh

GRU
-----------------
- Problem with RNN training - vanishing gradients
	- We multiply the transition matrix multiple times
	- the norm of the matrix will shrink or explode. 
		If the largest eigenvalue of the transition matrix is < 1, then there is a good chance that it will shrink to 0
- intuition
	In order to propagate the gradient to time t from time t+H, we need to go through all the time steps in between, which
	can cause the gradient to shrink to 0 or explode. Instead, we can bypass all the steps for BPTT (backprop through time) in between
	by adding connections to go to those nodes directly. (ie. t -> t+1, t-> t +2, ...t+H).
	But using the bypass this way has 2 problems:
	a. The number of connections will explode
	b. We will effectively limit the context window that we are looking at.
	Idea:
	We should let the network also learn how much to prune the unnecessary connections.
- Components
	1. hidden state
	2. update gate
	3. new cadidate hidden state
	4. reset gate
- the gates are meant to act like dirac delta functions, and tend to use sigmoid.
- temporal shortcut connection
	jump to some node directly from future node (to propagate gradient)
- adaptive leaky integration
	linear interpolated state from current and previous accumulated state
- update gate
	how much to take from current vs previous when computing the new hidden state.
- candidate state
	the current node's state
- Reset gate - for pruning connections
	a gate that determines how much to take from previous hidden state to compute the new candidate state.


Neural Machine Translation
---------------------
- can use bilingual and monolingual datasets together for training
	log( P(translated | base) ) = log( P(base | translated)) + log( P(translated) ) 
- traditional way
	- come up with a lot of features and put in loglinear function.
	- Filter the candidates generated by that model using a strong language model for the target language to see likelihood of translated sentence.
- Encoder decoder.


Deep Learning to NLP
--------------------
- don't worry about words. Concerns:
	1. prior belief that words are the unit of meaning
	2. fear of data sparsity
		use continuous representations for words/characters
	3. worried that we won't be able to train an RNN
- problems with treating each token separately
	- morphology - each word can have many variants
	- sub-optimal segmentation/tokenization
		- "run", "ran", "running" etc. can become multiple vector representations
	- compound words
	- many rare morphological variants, more prevalant in certain langauges
- Multi-way Multilingual Neural Machine Translation with a Shared Attention Mechanism
	multiple language simultaneous training.
- Context matters
	surrounding words or related sentences.
	http://arxiv.org/abs/1511.03729
	Larger-Context Language Modelling (2015)
-