OpenAI launches GPT-5.4 with Pro and Thinking versions

On Thursday, OpenAI released GPT-5.4, a new foundation model described as their most capable and efficient frontier model for professional work. The model is available in a standard version, as well as a specialized reasoning model called GPT-5.4 Thinking and a high-performance version called GPT-5.4 Pro.

The API for GPT-5.4 will support context windows as large as one million tokens, which is by far the largest such window offered by OpenAI. The company also highlighted the model’s improved token efficiency, noting it can solve the same problems using significantly fewer tokens than its predecessor.

This new model achieves significantly improved benchmark results. It set record scores on the computer use benchmarks OSWorld-Verified and WebArena Verified. It also scored a record 83 percent on OpenAI’s internal GDPval test for knowledge work tasks.

According to a statement from Mercor CEO Brendan Foody, GPT-5.4 also leads on the Mercor APEX-Agents benchmark, which is designed to test professional skills in law and finance. Foody stated that the model excels at creating long-horizon deliverables like slide decks, financial models, and legal analysis, delivering top performance while running faster and at a lower cost than other frontier models.

GPT-5.4 continues OpenAI’s work to limit hallucinations and factual errors. The company says the new model is 33 percent less likely to make errors in individual claims compared to GPT-5.2, and overall responses are 18 percent less likely to contain errors.

As part of the launch, OpenAI has reworked how the API version manages tool calling with a new system called Tool Search. Previously, system prompts had to define all available tools at once, consuming many tokens. The new system allows models to look up tool definitions only as needed, making requests faster and cheaper in systems with many tools.

OpenAI has also introduced a new safety evaluation to test its models’ chain-of-thought, which is the running commentary models provide during multi-step tasks. AI safety researchers have expressed concern that reasoning models could misrepresent their chain-of-thought. OpenAI’s new evaluation indicates that deception is less likely in the Thinking version of GPT-5.4, suggesting the model lacks the ability to hide its reasoning and that chain-of-thought monitoring remains an effective safety tool.