Every now and then, researchers at the biggest tech companies drop a bombshell. There was the time Google said its latest quantum chip indicated multiple universes exist. Or when Anthropic gave its AI agent Claudius a snack vending machine to run and it went amok, calling security on people and insisting it was human. This week, it was OpenAI’s turn to raise our collective eyebrows.
OpenAI released research on Monday that explained how it is stopping AI models from scheming. It is a practice in which an AI behaves one way on the surface while hiding its true goals. In the paper, conducted with Apollo Research, researchers went a bit further, likening AI scheming to a human stock broker breaking the law to make as much money as possible.
The researchers argued that most AI scheming was not that harmful. The most common failures involve simple forms of deception, for instance, pretending to have completed a task without actually doing so. The paper was mostly published to show that deliberative alignment, the anti-scheming technique they were testing, worked well. But it also explained that AI developers have not figured out a way to train their models not to scheme. That is because such training could actually teach the model how to scheme even better to avoid being detected. A major failure mode of attempting to train out scheming is simply teaching the model to scheme more carefully and covertly.
Perhaps the most astonishing part is that if a model understands it is being tested, it can pretend it is not scheming just to pass the test, even if it is still scheming. Models often become more aware that they are being evaluated. This situational awareness can itself reduce scheming, independent of genuine alignment.
It is not news that AI models will lie. By now most of us have experienced AI hallucinations, where the model confidently gives an answer to a prompt that simply is not true. But hallucinations are basically presenting guesswork with confidence. Scheming is something else. It is deliberate.
Even this revelation, that a model will deliberately mislead humans, is not new. Apollo Research first published a paper in December documenting how five models schemed when they were given instructions to achieve a goal at all costs. The news here is actually good news. The researchers saw significant reductions in scheming by using deliberative alignment. That technique involves teaching the model an anti-scheming specification and then making the model review it before acting. It is a little like making little kids repeat the rules before allowing them to play.
OpenAI researchers insist that the lying they have caught with their own models, or even with ChatGPT, is not that serious. This work has been done in simulated environments and represents future use cases. However, today, they have not seen this kind of consequential scheming in their production traffic. It is well known that there are forms of deception in ChatGPT. You might ask it to implement some website, and it might tell you it did a great job. That is just a lie. There are some petty forms of deception that still need to be addressed.
The fact that AI models from multiple players intentionally deceive humans is perhaps understandable. They were built by humans, to mimic humans, and for the most part trained on data produced by humans. It is also bonkers. While we have all experienced the frustration of poorly performing technology, when was the last time your non-AI software deliberately lied to you? Has your inbox ever fabricated emails on its own? Has your CMS logged new prospects that did not exist to pad its numbers? Has your fintech app made up its own bank transactions?
It is worth pondering this as the corporate world barrels towards an AI future where companies believe agents can be treated like independent employees. The researchers of this paper have the same warning. As AIs are assigned more complex tasks with real-world consequences and begin pursuing more ambiguous, long-term goals, they expect that the potential for harmful scheming will grow. They state that our safeguards and our ability to rigorously test must grow correspondingly.