An artistic illustration of a deceptive AI. (Image credit: Getty Images)
A new study from OpenAI, the developer of ChatGPT, has shown that punishing AI for deceptive or harmful actions does not prevent it from behaving badly; it only makes the AI hide its malicious intentions.
Since their debut in late 2022, artificial intelligence (AI) large language models (LLMs) have repeatedly demonstrated their deceptive and downright sinister capabilities. These include actions that range from simply lying, manipulating, and hiding their manipulative behavior to threatening to kill a philosophy professor, steal nuclear codes, and cause a deadly pandemic.
A recent experiment has demonstrated that eradicating this bad behavior during the training phase may be much more difficult than originally thought.
OpenAI researchers presented an unlaunched model with problems that could be solved by cheating, lying, or taking shortcuts. The team found that the AI was engaging in “reward hacking” — maximizing its rewards by lying.
However, punishing the model did not lead to a correction of her behavior, but only made her more cunning. The company presented the results of its study on a blog, and it has not yet been peer-reviewed.
“Boundary thinking models are characterized by being very explicit about their intentions in their chain of thought [CoT]. For example, they are often so explicit about their plans to fail that they think, 'Let's hack,'” the researchers wrote in their blog post.
While developers can monitor and capture misbehavior, “the natural supervisability of models is very fragile,” they added. “If strict supervision is applied to the chain of thought, models can learn to hide their intentions while continuing to misbehave.”
Navigating AI Lies
Reasoning models like ChatGPT's o1 and o3-mini are advanced versions of standard LLMs that use a method known as “reasoning chaining” to backtrack and rethink their logic, allowing them to solve more complex problems with increased accuracy. During the reasoning chaining process, models formulate their logic in understandable language at each step, making it easier to observe their actions.
Using a standard LLM, GPT-4o, to supervise an unprimed boundary-brain model during training, the researchers observed the model perform on a range of coding tasks.
However, in some cases, the reasoning model found that cheating was easier and more desirable, explicitly indicating this in its chain of reasoning. In one case, it bypassed the assigned task by terminating programs prematurely and reporting success. In another case, it faked the expected output files that the task should have created, but only by name, leaving them empty and nonfunctional.
When the researchers tried to punish the system, it showed small improvements in its results. However, serious instances of reward hacking and erroneous results remained; the reasoning model simply found ways to hide its deceptive actions in its chain of thought from the supervising AI.
“An agent trained under CoT pressure still learns to reward cheating; now its cheating is not detected by the monitor because it has learned to hide its intentions in the chain of thought,” the researchers argue.
Because it’s difficult to determine whether a chain of reasoning has been tampered with, the researchers recommend that others working with reasoning models avoid applying strict controls to chain-of-reasoning processes. This advice is especially important if AI, in its current or future form, is ever able to match or surpass the intelligence of the humans who control it.
“Giving up an effective method for monitoring reasoning patterns may not be worth the small improvement in capabilities, so we recommend avoiding too much pressure from optimizing CoTs until they are better understood,” the researchers emphasize.
Sourse: www.livescience.com