AI faces data drought: running low on text data.

A new study by research group Epoch AI suggests that the era of abundant publicly available data for training AI language models like ChatGPT could soon come to an end. The study projects that tech companies might exhaust the available data by the turn of the decade, between 2026 and 2032, posing challenges to the current pace of progress in AI development.

The surge in demand for data-driven AI systems is akin to a gold rush depleting finite natural resources. As tech companies scramble to secure high-quality data sources, concerns arise about the sustainability of AI progress once human-generated text data runs dry.

In the short term, companies are actively seeking and sometimes paying for access to quality data sources, such as Reddit forums and news media outlets. However, in the long term, there may not be enough new text data, leading to potential reliance on sensitive private data or less reliable synthetic data generated by AI systems themselves.

The Epoch study warns of a bottleneck in AI development when the available data becomes insufficient to scale up models efficiently. This scalability has been crucial in expanding AI capabilities and improving output quality.

Although projections were made two years ago, recent advancements and increased utilization of existing data have delayed the timeline slightly. Nevertheless, the study still anticipates a shortage of public text data within the next two to eight years.

The study’s findings raise concerns about the over-reliance on larger models and the potential consequences of training AI systems on their own outputs, which could lead to degraded performance and further perpetuate existing biases and mistakes.

While some data sources have restricted access to their content, others like Wikipedia remain relatively open for AI training. However, as AI-generated content proliferates, maintaining incentives for human contributions becomes crucial to sustain high-quality data sources.

AI developers are exploring alternatives, including generating synthetic data, although concerns persist about its efficacy compared to human-generated data.

As the AI community grapples with the impending data shortage, the search for sustainable solutions continues, balancing the need for high-quality data with ethical considerations and technical advancements.