works
Beren Millidge The Scaling Laws Are In Our Stars, Not Ourselves www Neural network scaling laws are fundamentally a property of data distributions rather than specific model architectures. While architectural choices influence scaling coefficients, sufficiently expressive architectures consistently exhibit power-law decays in loss when trained on large-scale naturalistic datasets. This phenomenon arises because the covariance matrices of natural datasets, such as text and images, possess a fractal-like structure characterized by a power-law spectral decay. Learning in non-linear neural networks mirrors the behavior of linear models, where the network progressively identifies the eigenvectors of the dataset’s covariance matrix at a rate proportional to their corresponding eigenvalues. Features with higher eigenvalues, which explain the most variance, are learned first; scaling the model size or training duration enables the resolution of increasingly fine-grained features further down the spectral distribution. Consequently, the power-law relationship between loss and scale reflects the inherent information density of the data itself. Under this framework, model capacity functions as a lens and training duration as exposure time, both facilitating the extraction of finer detail from a continuous, multi-scale generative process. The continuation of these scaling laws is bounded primarily by the limits of dataset size and model parameterization relative to the data’s rank. – AI-generated abstract.

The Scaling Laws Are In Our Stars, Not Ourselves

Beren Millidge

2025

Abstract

Neural network scaling laws are fundamentally a property of data distributions rather than specific model architectures. While architectural choices influence scaling coefficients, sufficiently expressive architectures consistently exhibit power-law decays in loss when trained on large-scale naturalistic datasets. This phenomenon arises because the covariance matrices of natural datasets, such as text and images, possess a fractal-like structure characterized by a power-law spectral decay. Learning in non-linear neural networks mirrors the behavior of linear models, where the network progressively identifies the eigenvectors of the dataset’s covariance matrix at a rate proportional to their corresponding eigenvalues. Features with higher eigenvalues, which explain the most variance, are learned first; scaling the model size or training duration enables the resolution of increasingly fine-grained features further down the spectral distribution. Consequently, the power-law relationship between loss and scale reflects the inherent information density of the data itself. Under this framework, model capacity functions as a lens and training duration as exposure time, both facilitating the extraction of finer detail from a continuous, multi-scale generative process. The continuation of these scaling laws is bounded primarily by the limits of dataset size and model parameterization relative to the data’s rank. – AI-generated abstract.

PDF

First page of PDF