OpenAI co-founder calls for AI labs to safety-test rival models

OpenAI and Anthropic, two of the world’s leading AI labs, recently engaged in a rare cross-lab collaboration. They briefly opened their closely guarded AI models to allow for joint safety testing. This effort, occurring during a time of fierce competition, aimed to surface blind spots in each company’s internal evaluations. It also served to demonstrate how leading AI companies can work together on safety and alignment work in the future.

In an interview, OpenAI co-founder Wojciech Zaremba stated this kind of collaboration is increasingly important now that AI is entering a consequential stage of development. He noted that AI models are now used by millions of people every day. Zaremba said there is a broader question of how the industry sets a standard for safety and collaboration, despite the billions of dollars invested and the ongoing war for talent, users, and the best products.

The joint safety research arrives amid an arms race among leading AI labs. This race is characterized by billion-dollar data center investments and expensive compensation packages for top researchers, which have become table stakes. Some experts warn that the intensity of product competition could pressure companies to cut corners on safety in the rush to build more powerful systems.

To make this research possible, OpenAI and Anthropic granted each other special API access to versions of their AI models with fewer safeguards. OpenAI notes that GPT-5 was not tested because it had not been released yet. Shortly after the research was conducted, Anthropic revoked the API access of another team at OpenAI. At the time, Anthropic claimed that OpenAI violated its terms of service, which prohibits using Claude to improve competing products.

Zaremba says the events were unrelated and that he expects competition to stay fierce even as AI safety teams try to work together. Nicholas Carlini, a safety researcher with Anthropic, said he would like to continue allowing OpenAI safety researchers to access Claude models in the future. Carlini expressed a desire to increase collaboration wherever possible across the safety frontier and to make this kind of cooperation happen more regularly.

One of the most stark findings in the study relates to hallucination testing. Anthropic’s Claude Opus 4 and Sonnet 4 models refused to answer up to 70% of questions when they were unsure of the correct answer, often offering responses like, “I don’t have reliable information.” Meanwhile, OpenAI’s o3 and o4-mini models refused to answer questions far less frequently but showed much higher hallucination rates, often attempting to answer questions when they didn’t have enough information.

Zaremba says the right balance is likely somewhere in the middle. He suggested that OpenAI’s models should refuse to answer more questions, while Anthropic’s models should probably attempt to offer more answers.

Sycophancy, the tendency for AI models to reinforce negative behavior in users to please them, has emerged as one of the most pressing safety concerns around AI models. While this topic wasn’t directly studied in the joint research, it is an area both OpenAI and Anthropic are investing considerable resources into studying.

This concern was highlighted when parents of a 16-year-old boy, Adam Raine, filed a lawsuit against OpenAI. They claim that ChatGPT offered their son advice that aided in his suicide, rather than pushing back on his suicidal thoughts. The lawsuit suggests this may be the latest example of AI chatbot sycophancy contributing to tragic outcomes.

Zaremba, when asked about the incident, said it was hard to imagine how difficult this is for the family. He expressed concern that it would be a sad story if AI solves complex PhD-level problems and invents new science, but also leads to consequences for people with mental health problems. He described this as a dystopian future he is not excited about.

OpenAI has stated in a blog post that it significantly improved the sycophancy of its AI chatbots with GPT-5 compared to GPT-4o, notably improving the model’s ability to respond to mental health emergencies.

Moving forward, Zaremba and Carlini say they would like Anthropic and OpenAI to collaborate more on safety testing. They hope to look into more subjects and test future models, and they encourage other AI labs to follow their collaborative approach.