AI Firms Will Soon Exhaust Most of the Internet’s Data

The Economist

The Economist, 2024

Abstract

Large language models (LLMs) are currently trained on massive datasets of text and code, but the rate at which these models are learning is outpacing the availability of new data. By 2028, LLMs will have exhausted most of the high-quality textual data on the internet, creating a ‘data wall’ that threatens to slow down AI progress. To overcome this challenge, researchers are focusing on improving the quality of existing data through filtering and sequencing techniques. Additionally, there is growing interest in synthetic data, which is generated by machines and is potentially limitless. Synthetic data allows LLMs to learn by playing against themselves and observing the winning strategies, as demonstrated by AlphaGo Zero. Another approach involves using human experts to provide feedback on the quality of LLM output, which can be used to further train the models. While access to more data is crucial for AI advancement, the current challenge is to find new and sustainable sources of data to overcome the impending data wall. – AI-generated abstract.

AI Firms Will Soon Exhaust Most of the Internet’s Data

Abstract

PDF