How can we prevent AI models from cannibalizing themselves when human-generated data runs out? Scientists say they've found the answer.

in News

Incorporating a human element might be the solution to prevent AI model collapse, according to recent investigations.(Image credit: Getty Images)Share this article 0Join the conversationFollow usAdd us as a preferred source on GoogleSubscribe to our newsletter

While the advancement of artificial intelligence (AI) systems continues unabated, there is a growing apprehension that large language models (LLMs) will soon exhaust the supply of human-generated data for their learning processes.

Following this depletion, experts suggest, AI models will increasingly depend on synthetic, AI-created information, leading to a phenomenon known as “model collapse.” This scenario results in LLMs producing nonsensical output, and the AI systems they power delivering erroneous answers and fabricating information with greater frequency than currently observed.

“This is particularly concerning given that some specialists estimate we may run out of high-quality human-generated data by year’s end — so if you are relying on this synthetic data, but there’s an almost existential risk that it will doom your AI, you’re in a difficult position,” Yasser Roudi, a professor of disordered systems in the Department of Mathematics at King’s College London (KCL), informed Live Science. “For instance, if LLMs used in medical facilities to interpret brain scans and detect cancers were to experience model collapse during the training of another model, these systems could misdiagnose patients.”

However, Roudi recently discovered that model collapse can be circumvented by introducing a single piece of human-created data into an AI’s training dataset, even when the rest of the data is AI-generated.

The research — involving collaborators from KCL, the Norwegian University of Science and Technology, and the Abdus Salam International Centre for Theoretical Physics in Italy — was disseminated on May 14 in the journal Physical Review Letters.

Although a real-world instance of AI model collapse with an actively deployed AI system has not yet occurred, users of tools like ChatGPT or Gemini for generating responses or text have very likely encountered inaccuracies or fabrications. Nevertheless, Roudi expresses hope that the new findings might illuminate a strategy to bypass this potential emerging threat.

Countering collapse

Beyond the widely recognized hallucinations in early generative AI products, we may not have yet witnessed any striking manifestations of model collapse in the form of sophisticated AIs seemingly “losing their minds” and producing complete gibberish. However, indications of minor collapse might be discernible when AI provides increasingly incorrect or generic responses to inquiries, or completely invents information while attempting to generate output it presumes a user desires.

Through the repeated training of LLMs on data generated by other LLMs, the fundamental truth and the origin of information — along with variations between model generations — become “smoothed out,” resulting in homogenized responses and outputs. For example, text that may appear adequate initially could lack genuine depth or subtlety. Essentially, model collapse can be categorized into ‘early’ and ‘late’ phases, where the former signifies an AI losing its capacity to provide edge-case (rare or less common) information and producing bland, synthetic-feeling replies, while the latter involves LLMs delivering nonsensical data.

The sheer scale of LLMs and the data they process can complicate the determination of how and why they fabricate information, and how specific choices contribute to model collapse.

To address this, the researchers employed smaller models belonging to exponential families — a broad classification of probability distributions, such as those used to determine the likely outcomes of random events. The normal distribution is one such example, as is calculating the probability of a coin landing on heads.

“By examining analytically manageable models like the exponential families, one can address those ‘why’ and ‘how’ questions,” Roudi stated. “Following the same logic, one can devise methods to mitigate its detrimental effects, understand how those methods function, and ultimately apply them to practical situations.”

The researchers determined that by incorporating a single external human-made data point into a collection of synthetic data used by a model undergoing closed-loop training—where a new model is trained on data produced by previous models—they successfully averted model collapse.

Roudi suggested an illustrative example could be an AI-driven image or video classifier, where an LLM is trained on data that includes an authentic image correctly identified by a human, as opposed to AI-generated media or media classified by an AI.

How can we prevent AI models from cannibalizing themselves when human-generated data runs out? Scientists say they’ve found the answer.

Leave a ReplyCancel Reply