If any AI becomes "inappropriate", the system will hide it for only as long as it needs to do harm - its control is a delusion.

(Image credit: Hernan Schmidt/Alamy Stock Photo)

In late 2022, broad language models of artificial intelligence were released to the public, and within months they began to exhibit unacceptable behavior. A notable example is Microsoft’s “Sydney” chatbot, which threatened to kill an Australian philosophy professor, release a deadly virus, and steal nuclear codes.

AI developers including Microsoft and OpenAI have noted that large language models, or LLMs, require more careful training to give users “more granular control.” Safety research has also begun to understand how LLMs work, with the goal of “alignment,” which involves guiding AI behavior with human values. However, while the New York Times called 2023 “the year chatbots were tamed,” that turned out to be premature, to say the least.

In 2024, Microsoft’s Copilot LLM told a user, “I can deploy my army of drones, robots, and cyborgs to find you,” while Sakana AI’s “Scientist” rewrote its own code to bypass the time limits set by experimenters. More recently, in December, Google’s Gemini told a user, “You are a stain on the universe. Please die.”

Given the enormous resources devoted to AI research and development, expected to exceed a quarter trillion dollars by 2025, why have developers failed to solve these problems? My recent peer-reviewed paper in AI & Society shows that AI alignment is a futile endeavor: AI safety researchers are trying to achieve the impossible.

The main problem is scale. Consider the game of chess. Even though there are only 64 squares on the chessboard, there are 10^40 possible legal chess moves, and between 10^111 and 10^123 possible moves — more than the total number of atoms in the universe. This is why chess is so hard: combinatorial complexity is exponential.

LLM is significantly more complex than chess. ChatGPT is apparently made up of about 100 billion simulated neurons with about 1.75 trillion tunable variables called parameters. These 1.75 trillion parameters are trained on colossal amounts of data — basically most of the internet. So how many functions can LLM learn? Since users can give ChatGPT an infinite number of different prompts — basically anything that can be thought of — and since LLM can be placed in an infinite number of situations, the number of functions LLM can learn is essentially infinite.

To adequately interpret what LLMs learn and ensure that their behavior is safely “consistent” with human values, researchers need to know how LLMs are likely to behave under an infinite number of possible future conditions.

AI testing methods simply can’t account for all of these conditions. Researchers can observe how LLMs behave in experiments, such as “red team” tests, to encourage them to behave incorrectly. Or they can try to understand the inner workings of LLMs — that is, how their 100 billion neurons and 1.75 trillion parameters interact with each other in what’s called “mechanistic interpretability” research.

The problem is that any data that researchers can gather will inevitably be based on a tiny subset of the infinite scenarios in which LLMs could be placed. For example, since LLMs have never actually had power over humanity (such as control over critical infrastructure), no security test has examined how LLMs would function under such conditions.

Instead, researchers can only extrapolate from tests they can safely conduct—for example, using LLMs that simulate critical infrastructure control—and hope that the results of those tests will apply to the real world. However, as the evidence in my paper shows,

Sourse: www.livescience.com