The Self-Consumption Dilemma of AI: A Statistical Look at the Risks of Recursive Training
Artificial intelligence (AI) has come a long way in a remarkably short time. Large Language Models (LLMs), for instance, have devoured countless books, articles, and websites to learn about language and generate human-like responses. But a new challenge is emerging: once AI systems have “read” most of what the internet currently offers, they risk entering a phase of self-consumption — training on the content they themselves produce. This loop can deteriorate the quality and truthfulness of outputs. Below, we explore this phenomenon, backed by relevant data and statistics.
1. The Rise of Large Language Models
According to OpenAI’s documentation, the dataset for GPT-3 spanned 499 billion tokens, drawn from diverse sources like books, websites, and social media posts. These massive training sets have enabled the model to craft text so convincingly that it can be challenging to distinguish AI-generated responses from those written by humans.
Fact: Over 80% of AI researchers surveyed in a 2022 Stanford study believe the size of training datasets will keep expanding, but at some point, new and diverse data may become harder to obtain.