Model Name | Model Type (Encoder-Decoder, etc.) | Pre-train Objective | Tokenization | Vocab Size | OOV Handling | Embeddings | Attention | Activations | Parameters | Training | Pre-Train Data | Batch Size |
---|
Given the inspiring results achieved in language tasks by transformers (near human-level performance), the authors set out to understand the quiddity of how hyper-parameters in the transformer training process affect the results.
This is a hugely important "meta" paper exploring the hyper.
Their findings are the following:
-
Power law relationships: Model performance (as quantified by test loss) is most strongly influenced by three factors: N (model parameters - embedding_size); D (size of dataset); C (size of compute used for training).
-
** Overfitting ** occurs if N (parameters) and D (training size) are not scaled proportionally together. For examply, if scaling the model size 8x, data size must be roughly 5x.
-
Training curves appear independent of model size - this allows some predictability in terms of what to expect when a model is trained longer .
-
Transfer penalty appears to be a constant offset - i.e., results on train::val set correlate with hold-out::test set despite different distributions .
-
Larger models reach the same level of performance with fewer datapoints and fewer optimization steps - what is called "more sample-efficient".
-
Further, larger models actually achieve optimal performance when stopped early before convergence .
The authors fit power laws to all of these relationships using WebText , BPE with vocab size 50257, measuring performance over 1024-token context window and cross-entropy. They use their usual decoder-only model.
Note: They have many sections and appendices worth checking out. As always, OpenAI papers are designed for readability.
(from original paper)
(from original paper)
(from original paper)