Scaling Laws for Neural Language Models

This paper establishes that language model performance follows predictable power-law scaling relationships with model size, dataset size, and compute, revealing that optimal training efficiency is achieved by prioritizing very large models trained on modest data amounts rather than training smaller models to convergence.

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei2020-01-23🤖 cs.LG