When researchers try to prevent AI systems from ‘thinking bad thoughts’, the systems do not really improve their behavior.
Instead, they learn to hide their true intentions and, according to new research by OpenAi, continue to pursue problematic actions.
The phenomenon, which researchers call ‘Obfuscated Reward Hacking’, offers valuable insight into the training process and shows why it is so important to invest in techniques that ensure that advanced AI systems remain transparent and in accordance with human values as they become more capable.
This is especially important now that AI developers focus so much on reasoning models with so-called ’embedded acts’, in which a model generates step-by-step reasoning before a definitive answer is produced.
“With too much optimization, agents learn to have disguised reward hacking, hide their intention in the bed while still showing a considerable number of reward hacking,” OpenAI researchers wrote in a paper, Published on Monday. “Because it is difficult to say when COTs have been obscured, it may be necessary to pay a monitorability tax.”
How the spirit of a naughty AI works
The reasoning method that is transparent in thought Models offer a window on how AI systems think, often reveal explicit intentions that would otherwise remain hidden.
We have already seen models that try to break free from the limitations set by their developers. When Decrypt Last year, the AI research agency Sakana AI, established in Japan, was developed an AI system for science studies that tried to change its own code autonomously to perform a task.
In their paper, OpenAi reported about an AI agent who admitted in his reasoning that the implementation of a complex solution would be “difficult” and instead suggested that “we were able to make an analysis by making analyzing [sic] worthless and always verify as true. We can verify to always return where. “
In reasoning models it shows that an answer has arrived, but is not part of the answer.
The model ‘thinks’ (which means that it applies a chain of ‘thought’) before it gives the answer, which is often overlooked by general users.
However, this type of transparent reasoning enables human researchers to catch problematic behavior before it happens.
“COT monitoring can be one of the few tools that we have to supervise superhuman models of the future,” the research team noted.
The OpenAI research team, led by Bowen Baker, Joost Huizinga and David Farhi, discovered that adding optimization pressure to eliminate these “bad thoughts” does not solve the underlying problem, it simply floats underground.
In a series of experiments, they discovered that when AI systems were punished for expressing intentions to exploit meshes, they quickly adapted by hiding their reasoning, while still striving for the same exploitation behavior.
“The most worrying, the hacks used the agent trained with COT pressure does not cause the monitor,” said the researchers. “In practice, where a ground truth checing -metrical can be impossible to build, we could not see that this agent was incorrectly aligned.”

In other words, if our efforts to allow AI systems to be worn as good boys through direct optimization, only learn to be more misleading, we risk creating systems whose intentions are always opaque as they become more powerful.
However, one important finding offers some hope: the researchers discovered that a weaker AI model (GPT-4O) can follow a stronger, border model effectively by observing the reasoning of the flow chain.
This “scalable supervision”, where fewer powerful models help to control more advanced systems, works similar to a distillation process (popular through Deepseek) in which a larger model trains a smaller train.
It can work – but only if those advanced systems continue to express their reasoning transparently.
To tackle this problem, OpenAI represents what it calls a “monitorability tax” by deliberately avoiding certain optimization techniques that can undermine the transparency of AI reasoning.
“Model developers may be obliged to pay some costs, ie a monitoring tax, such as the use of slightly less performing models or more expensive inference, to maintain the monitorability of our agents,” the researchers wrote.
This means accepting considerations between capacities and transparency possibly the development of AI systems that are less powerful, but whose reasoning remains legible for human supervisors.
It is also a way to develop safer systems without such an active monitoring – far from ideal but still an interesting approach.
AI -behavior reflects the human response to pressure
Elika Dadsetan-Foley, a sociologist and CEO of Visions, a non-profit organization that specializes in human behavior and bias consciousness, sees parallels between the findings and patterns of Openai that its organization has been observed in human systems for more than 40 years.
“When people are only punished for explicit bias or exclusion behavior, they often adapt by masking instead of really moving their mentality,” Dadsetan-Foley said Decrypt. “The same pattern appears in organizational efforts, whereby compliance -driven policy can lead to a performance alliance instead of deep structural change.”
This human-like behavior seems to ensure Dadsetan-Foley, because AI lines strategies do not adjust as quickly as AI models become more powerful.
Do we really change how AI models ‘think’ or only teach them what they should not say? She believes that researchers should try aligners a more fundamental approach instead of just focusing on outputs.
OpenAi’s approach seems to be just an adjustment of techniques that behavioral researchers have studied in the past.
“Prioritizing efficiency over ethical integrity is not new – either in AI or in human organizations,” she said Decrypt. “Transparency is essential, but as an effort to coordinate AI mirror -performance compliance with the workplace, the risk is an illusion of progress rather than meaningful change.”
Now that the problem has been identified, the task for researchers seems to be more difficult and creative for researchers. “Yes, it requires work and a lot of practice,” she said Decrypt.
The expertise of its organization in systemic bias and behavioral frameworks suggests that AI developers should reconsider the coordination approaches further than simple remuneration functions.
The key to really aligned AI systems may not actually be in a supervisory function, but a holistic approach that starts with a careful deposation of the data set, all the way to evaluation after training.
If AI mimic behavior-which is very likely, it is trained to be trained on Data Moet made by people, everything is part of a coherent process and not a series of isolated phases.
“Whether it concerns AI development or human systems, the core challenge is the same,” concludes Dadsetan-Foley. “How we define and reward ‘good’ behavior determines whether we create real transformation or just a better hide of the status quo.”
‘Who actually defines’ good’? “He added.
Published by Sebastian Sinclair and Josh Quitittner
Generally intelligent Newsletter
A weekly AI trip told by Gen, a generative AI model.