Contents

Summary: Training Compute-Optimal Large Language Models

  • Investigates how to allocate a fixed training-compute budget between model parameters (N) and training tokens (D) for transformer LLMs.

  • Derives an updated scaling law showing the compute-optimal frontier requires N and D to grow proportionally (≈ C^0.5 each), unlike earlier “parameter-heavy” prescriptions.

  • Demonstrates the new law empirically and validates it by training Chinchilla (70 B params, 1.4 T tokens), which beats much larger models (Gopher 280 B, GPT-3 175 B, MT-NLG 530 B) under identical FLOPs.

  • Equal-ratio scaling rule (N : D ≈ 1 : 1) replaces Kaplan et al.-2020’s rule (N∝C^0.73, D∝C^0.27).

  • Introduces three complementary methodologies—training-curve envelope, IsoFLOP valleys, and parametric loss fitting—to estimate the compute-efficient frontier directly from data.

  • Larger than 400 pre-training runs spanning 70 M → 16 B params and 5 B → 500 B tokens to map loss vs. compute surfaces.

  • Constructed training-curve envelopes to find minimal loss per FLOP, and IsoFLOP curves to locate loss minima at fixed compute levels.

  • Parametric fit of loss ≈ E + A / N^α + B / D^β to derive closed-form optimum (α≈0.54, β≈0.46).

  • Full-scale training of Chinchilla (same 5.76 × 10²³ FLOPs as Gopher) and head-to-head evaluation on:

    • The Pile bits-per-byte, WikiText103 perplexity.
    • MMLU (+7.6 pp over Gopher).
    • BIG-bench (+10.7 pp).
    • Reading comprehension (RACE, LAMBADA) and closed-book QA (Natural Questions, TriviaQA).
  • Bias & toxicity checks (Winogender, PerspectiveAPI) showing no adverse increase vs. Gopher.

  • Only two full-budget runs (Gopher & Chinchilla); intermediate-scale validations are missing.

  • Power-law assumption may be imperfect; slight concavity at extreme compute suggests optimum sizes could be even smaller.

  • All experiments are < 1 epoch over the corpus, so multi-epoch behaviour remains untested.

  • Possible train/test leakage because Chinchilla sees 4× more data, which could inflate LM benchmarks.

  • Collect and curate larger, higher-quality corpora (multi-trillion tokens) to test the scaling law without leakage and study data quality effects.

  • Run additional compute-matched experiments at intermediate scales to densify the frontier and verify concavity.

  • Extend the methodology to other modalities (vision, audio, multimodal) and to Mixture-of-Experts or retrieval-augmented architectures.

  • Investigate epoch-wise scaling (multiple passes) and its interaction with learning-rate schedules.

  • IsoFLOP valleys – U-shaped curves obtained by fixing a total FLOP budget, varying model size, and plotting the final loss; their minima show the parameter count that is compute-optimal for that budget.

  • Parametric loss fitting – a modelling step that fits the function L(N,D) = E + A / N+ B / D​ to all measured (loss, parameters, tokens) triples so the closed-form optimum N, D​ can be predicted analytically.

  • MMLU – “Massive Multitask Language Understanding”, a 57-task exam-style benchmark

  • The Pile bits-per-byte (bpb) – a language-model metric equal to average cross-entropy (in bits) per byte of text on The Pile corpus; lower bpb means better compression/prediction.

  • BIG-bench – “Beyond the Imitation Game” benchmark, a community collection of 200 + diverse tasks.

  • RACE – the “Reading Comprehension from Examinations” dataset with ~100 k questions drawn from English high-school exams

  • LAMBADA – a 10 k-passage cloze benchmark where the model must guess the last word of a narrative.

  • Natural Questions (NQ) – a Google QA corpus of real search queries paired with Wikipedia pages

  • Winogender – a diagnostic corpus of minimal-pair sentences that differ only by pronoun gender, used to reveal occupation-related gender bias in coreference resolution.

  • Perspective API – Google Jigsaw’s public service that assigns probabilistic toxicity scores (0–1) to text, where higher scores indicate language likely to drive users out of a discussion.

  • Bias & toxicity checks (in LLMs) – systematic evaluations that combine datasets such as Winogender with automatic tools like Perspective API to quantify demographic bias and harmful-language propensity in generated text.