Scaling Laws for Neural Language Models

Summary

Model Name	Model Type (Encoder-Decoder, etc.)	Pre-train Objective	Tokenization	Vocab Size	OOV Handling	Embeddings	Attention	Activations	Parameters	Training	Pre-Train Data	Batch Size

TL;DR

Given the inspiring results achieved in language tasks by transformers (near human-level performance), the authors set out to understand the quiddity of how hyper-parameters in the transformer training process affect the results.

This is a hugely important "meta" paper exploring the hyper.

Their findings are the following:

Power law relationships: Model performance (as quantified by test loss) is most strongly influenced by three factors: N (model parameters - embedding_size); D (size of dataset); C (size of compute used for training).
** Overfitting ** occurs if N (parameters) and D (training size) are not scaled proportionally together. For examply, if scaling the model size 8x, data size must be roughly 5x.
Training curves appear independent of model size - this allows some predictability in terms of what to expect when a model is trained longer .
Transfer penalty appears to be a constant offset - i.e., results on train::val set correlate with hold-out::test set despite different distributions .
Larger models reach the same level of performance with fewer datapoints and fewer optimization steps - what is called "more sample-efficient".
Further, larger models actually achieve optimal performance when stopped early before convergence .

The authors fit power laws to all of these relationships using WebText , BPE with vocab size 50257, measuring performance over 1024-token context window and cross-entropy. They use their usual decoder-only model.

Note: They have many sections and appendices worth checking out. As always, OpenAI papers are designed for readability.

Art

Figure 1: Scaling Laws

(from original paper)

Figure 2: Training Curves

(from original paper)

Figure 4: Sample Efficiency of Large Models

(from original paper)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scaling-laws.md

scaling-laws.md

Scaling Laws for Neural Language Models

Summary

TL;DR

Art

Figure 1: Scaling Laws

Figure 2: Training Curves

Figure 4: Sample Efficiency of Large Models

Files

scaling-laws.md

Latest commit

History

scaling-laws.md

File metadata and controls

Scaling Laws for Neural Language Models

Summary

TL;DR

Art

Figure 1: Scaling Laws

Figure 2: Training Curves

Figure 4: Sample Efficiency of Large Models