- ResNet-50이 좋은 성능을 보인다 but 신경망이 너무 크고 과적합 우려
- Solution : ImageNet의 데이터(1만장의 image)를 NN으로 학습 -> fine-tuning
- ImageNet으로 모델 학습 -> 마지막 layer를 추가하거나 대체
- 더 적은 데이터로 빠르고 정확하게 학습이 가능하다
- Tensorflow, Pytorch
-
Input data
- in NLP : sequence of words
- in Deep Learning : vectors
-
word를 vector로 변환하려면?
- one-hot encoding
- embedding : 문자를 기계가 이해할 수 있는 숫자로 바꾼 결과
- Solution 1 : Learn as part of the task
- Solution 2 : Learn a Language Model
- Word2Vec and GloVe embeddings became popular in ~2013-14
- But these representations are shallow:
- only first layer would have benefit of seeing all of Wikipedia
- rest of the model -- LSTMs, etc -- would be trained only on the task dataset (much smaller)
- Elmo (2018)
- Bidirectional stacked LSTM
- SQuAD dataset
- SNLI dataset
- GLUE dataset
- ULMFit
- similar to ELMO
- Paper : Attention is all you need(2017)
Basic self-attention
- Not a learned weights, but a function of x_i, x_j
Attention Function
- Query, Key, Value
Transformer
- self-attention layer - Layer normalization - Dense layer
GPT/GPT-2
- Generative Pre-trained Transformer
BERT
- Bidirectional Encoder Representation from Transformers
T5
- Text-to-Text Transfer Transformer
GPT-3
DistillBERT
- a smaller model is trained to reproduce the output of a larger model