- Time series forecasting (시계열 예측)
- the input is sequence and the output is single value
- Sentiment classification
- Translation
- input and output are both sequence
- Speech recognition and generation
- Text or music generation
- Question Answering
- the input is text and the output is a subset of text
- one to many (single input & sequence output)
- many to one (sequence input & single output)
- many to many (sequence input & sequence output)
> Problem 1 : variable length inputs
임의의 길이를 가진 input을 잘 처리하지 못한다는 단점이 있다. 이를 해결하기 위해 max length에 맞춰서 empty value를 padding하는 방법이 있다.
> Problem 2 : memory scaling
만일 sequence의 max length를 늘리게 되면 multiply해야 하는 matrix의 크기도 함께 커진다 -> undesirable scaling
> Problem 3 : overkill
sequence에서 pattern을 파악하려고 할 때 특정 input은 특정 output과 상관관계가 있을 것이다.
이 때 model은 이러한 상관관계를 위한 weight을 학습해야 한다 -> data ineffieicnt
- every position in the sequence에 대해 독립적인 weights을 곱하는 single massive matrix 대신 stateful computation을 사용
- model의 output은 current time step in the sequence에서의 input에 따라 달라진다
- each time step에서 input에 따라 해당 time step에 대한 output과 next hidden state을 출력
- starting hidden state인 h0가 존재.
- 첫번째 input x1을 받은 model이 h0와 x1을 계산하여 첫번째 output y1와 second hidden state h1을 생성한다.
- 각각의 subsequent time step에서 2와 동일한 계산 과정을 거친다.
class RNN:
#...
def compute_next_h(self, x):
# Simple hidden state computation
h = np.tahn(self.W_hh.dot(self.h) + self.W_xh.dot(x))
return h
# this function gets called every single time the model sees a new input
def step(self, x):
# update (next) hidden state
self.h = self.compute_next_h(x)
# compute the output vector
y = self.W_hy.dot(self.h)
return y
> A look at compute_next_h function
- 2개의 input : property of this class (previous hidden state h_t-1) & input of time step x_t
- 2개를 더해서 activation function(tahn)으로 wrap up
- Goal : handle long sequences
- Connect events from the past to outcomes in the future
- i.e., Long-term dependencies
- e.g., remember the name of a character from the first sentence
- Can't handle more than 10-20 timesteps
- Longer-term dependencies get lost
- WHY? Vanishing Gradients
- sigmoid, tanh ; input 커질수록 derivative 값이 0에 수렴 -> 기울기 소실이 발생
- ReLU : exploding gradients (gradients tend to get too big)
class LSTM(RNN):
# ...
def compute_next_h(self, x):
h = lstm(x, self.h)
# or gru(x, self.h), etc.
return h
# ...
- What problem are they trying to solve?
- What model architecture was used?
- What loss function was used?
- What dataset was it trained on?
- How did they do training?
- What tricks were needed for inference in depolyment?
- Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation (Wu et al., 2016)
- Summary of GNMT approach
- Stacked LSTM encoder-decoder architecture with residual connections
- Attention enables longer-term connections
- To encode future information in the source sentence use a bidirectional LSTM
- Train using standard cross-entropy on a large dataset
- Speed up inference with quantization of weights
- The Goal
- The Idea
- CTC Loss
- Encoder/Decoder LSTM architectures can model arbitray (one-to-many, many-to-one, many-to-many) sequence problems
- Many successes in NLP and other applications
- Recurrent network training is not as parallelizable as FC or CNN, due to the need to go in sequence
- Therefore much slower!
- Also can be finicky to train
- Convolutional approach to sequence data modeling
- Next lecture, all-fully-connected Transformer models
-
Inference
- Primary drawback of WaveNet: although training is parallel, inference is serial
- Followup paper introduced fast, parallel synthesis version of WaveNet: Parallel WaveNet
-
Summary of WaveNet approach
- Another way of dealing with long-term dependencies is to do away with recurrent networks altogether
- Instead, you can use a form of 1d convolutions called causal convolutions
- To increase the receptive field of causal convolutions, can used dilated causal convolutions
- In addition to long-term dependencies, main advantage of WaveNet is fast parallel training
- Tradeoff is slow inference time, which can be mitigated through the parallel WaveNet approach