Will we run out of data? Limits of LLM scaling based on human-generated data

Pablo Villalobos et al.

Epoch AI, June 6, 2024

Abstract

The stock of publicly available, human-generated text data is estimated to be approximately 300 trillion tokens. This data, crucial for training large language models (LLMs), is projected to be fully utilized between 2026 and 2032, assuming current trends in data usage and model scaling continue. The exact timeline is influenced by factors such as model overtraining and the rate of dataset growth. While undertraining models could extend data utilization by a factor, ultimately, novel approaches like synthetic data generation, cross-modal learning, and improved data efficiency will be necessary to sustain LLM progress beyond this point. – AI-generated abstract.

Will we run out of data? Limits of LLM scaling based on human-generated data

Abstract

PDF