works
Pablo Villalobos et al. Will we run out of data? Limits of LLM scaling based on human-generated data online The stock of publicly available, human-generated text data is estimated to be approximately 300 trillion tokens. This data, crucial for training large language models (LLMs), is projected to be fully utilized between 2026 and 2032, assuming current trends in data usage and model scaling continue. The exact timeline is influenced by factors such as model overtraining and the rate of dataset growth. While undertraining models could extend data utilization by a factor, ultimately, novel approaches like synthetic data generation, cross-modal learning, and improved data efficiency will be necessary to sustain LLM progress beyond this point. – AI-generated abstract.

Abstract

The stock of publicly available, human-generated text data is estimated to be approximately 300 trillion tokens. This data, crucial for training large language models (LLMs), is projected to be fully utilized between 2026 and 2032, assuming current trends in data usage and model scaling continue. The exact timeline is influenced by factors such as model overtraining and the rate of dataset growth. While undertraining models could extend data utilization by a factor, ultimately, novel approaches like synthetic data generation, cross-modal learning, and improved data efficiency will be necessary to sustain LLM progress beyond this point. – AI-generated abstract.

PDF

First page of PDF