Skip to content
This repository has been archived by the owner on Jan 27, 2023. It is now read-only.

Latest commit

 

History

History
54 lines (26 loc) · 2.47 KB

scaling-laws.md

File metadata and controls

54 lines (26 loc) · 2.47 KB

Scaling Laws for Neural Language Models

Summary

Model Name Model Type (Encoder-Decoder, etc.) Pre-train Objective Tokenization Vocab Size OOV Handling Embeddings Attention Activations Parameters Training Pre-Train Data Batch Size

TL;DR

Given the inspiring results achieved in language tasks by transformers (near human-level performance), the authors set out to understand the quiddity of how hyper-parameters in the transformer training process affect the results.

This is a hugely important "meta" paper exploring the hyper.

Their findings are the following:

  1. Power law relationships: Model performance (as quantified by test loss) is most strongly influenced by three factors: N (model parameters - embedding_size); D (size of dataset); C (size of compute used for training).

  2. ** Overfitting ** occurs if N (parameters) and D (training size) are not scaled proportionally together. For examply, if scaling the model size 8x, data size must be roughly 5x.

  3. Training curves appear independent of model size - this allows some predictability in terms of what to expect when a model is trained longer .

  4. Transfer penalty appears to be a constant offset - i.e., results on train::val set correlate with hold-out::test set despite different distributions .

  5. Larger models reach the same level of performance with fewer datapoints and fewer optimization steps - what is called "more sample-efficient".

  6. Further, larger models actually achieve optimal performance when stopped early before convergence .

The authors fit power laws to all of these relationships using WebText , BPE with vocab size 50257, measuring performance over 1024-token context window and cross-entropy. They use their usual decoder-only model.

Note: They have many sections and appendices worth checking out. As always, OpenAI papers are designed for readability.

Art

Figure 1: Scaling Laws

figure 1

(from original paper)

Figure 2: Training Curves

figure 2

(from original paper)

Figure 4: Sample Efficiency of Large Models

figure 4

(from original paper)