-
Notifications
You must be signed in to change notification settings - Fork 0
/
kyunghyun_cho_2016_school.txt
104 lines (81 loc) · 3.43 KB
/
kyunghyun_cho_2016_school.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
Neural Language Modeling
------------
- change MLE ngram language model to be function approximator.
Bengio et al. 2000
-------------------
1. 1-of-K encoding
2. continuous space word representation
3. nonlinear hidden layer (tanh)
Dependency beyond the context window
-------------------
- Models that don't incorporate this:
1. ngram
2. word2vec
- Recursively apply function f to state and current word.
- RNN intuition
RNNLM
----------------
- steps:
1. continuous representation for words
2. linear transformation of previous hidden state
3. additive combination of previous word and hidden state vectors with bias
4. point wise linear transform - tanh
GRU
-----------------
- Problem with RNN training - vanishing gradients
- We multiply the transition matrix multiple times
- the norm of the matrix will shrink or explode.
If the largest eigenvalue of the transition matrix is < 1, then there is a good chance that it will shrink to 0
- intuition
In order to propagate the gradient to time t from time t+H, we need to go through all the time steps in between, which
can cause the gradient to shrink to 0 or explode. Instead, we can bypass all the steps for BPTT (backprop through time) in between
by adding connections to go to those nodes directly. (ie. t -> t+1, t-> t +2, ...t+H).
But using the bypass this way has 2 problems:
a. The number of connections will explode
b. We will effectively limit the context window that we are looking at.
Idea:
We should let the network also learn how much to prune the unnecessary connections.
- Components
1. hidden state
2. update gate
3. new cadidate hidden state
4. reset gate
- the gates are meant to act like dirac delta functions, and tend to use sigmoid.
- temporal shortcut connection
jump to some node directly from future node (to propagate gradient)
- adaptive leaky integration
linear interpolated state from current and previous accumulated state
- update gate
how much to take from current vs previous when computing the new hidden state.
- candidate state
the current node's state
- Reset gate - for pruning connections
a gate that determines how much to take from previous hidden state to compute the new candidate state.
Neural Machine Translation
---------------------
- can use bilingual and monolingual datasets together for training
log( P(translated | base) ) = log( P(base | translated)) + log( P(translated) )
- traditional way
- come up with a lot of features and put in loglinear function.
- Filter the candidates generated by that model using a strong language model for the target language to see likelihood of translated sentence.
- Encoder decoder.
Deep Learning to NLP
--------------------
- don't worry about words. Concerns:
1. prior belief that words are the unit of meaning
2. fear of data sparsity
use continuous representations for words/characters
3. worried that we won't be able to train an RNN
- problems with treating each token separately
- morphology - each word can have many variants
- sub-optimal segmentation/tokenization
- "run", "ran", "running" etc. can become multiple vector representations
- compound words
- many rare morphological variants, more prevalant in certain langauges
- Multi-way Multilingual Neural Machine Translation with a Shared Attention Mechanism
multiple language simultaneous training.
- Context matters
surrounding words or related sentences.
http://arxiv.org/abs/1511.03729
Larger-Context Language Modelling (2015)
-