OpenAI says GPT-5 stacks up to humans in a wide range of jobs

OpenAI released a new benchmark on Thursday that tests how its AI models perform compared to human professionals across a wide range of industries and jobs. The test, called GDPval, is an early attempt at understanding how close OpenAI’s systems are to outperforming humans at economically valuable work. This is a key part of the company’s founding mission to develop artificial general intelligence, or AGI.

OpenAI found that its GPT-5 model and Anthropic’s Claude Opus 4.1 are already approaching the quality of work produced by industry experts. This does not mean that OpenAI’s models will start replacing humans in their jobs immediately. Despite some predictions that AI will take jobs in just a few years, OpenAI admits that GDPval currently covers a very limited number of tasks people do in their real jobs. However, it is one of the latest ways the company is measuring AI’s progress toward this milestone.

GDPval is based on nine industries that contribute the most to America’s gross domestic product. These domains include healthcare, finance, manufacturing, and government. The benchmark tests an AI model’s performance in 44 occupations among those industries, ranging from software engineers to nurses to journalists.

For the first version of the test, GDPval-v0, OpenAI asked experienced professionals to compare AI-generated reports with those produced by other professionals and then choose the best one. For example, one prompt asked investment bankers to create a competitor landscape for the last mile delivery industry and compare it to an AI-generated report. OpenAI then averages an AI model’s win rate against the human reports across all 44 occupations.

For GPT-5-high, a more powerful version of GPT-5 with extra computational power, the AI model was ranked as better than or on par with industry experts 40.6 percent of the time. OpenAI also tested Anthropic’s Claude Opus 4.1 model, which was ranked as better than or equal to experts in 49 percent of tasks. OpenAI suggests that Claude scored so high because of its tendency to create pleasing graphics, rather than sheer performance alone.

It is important to note that most professionals do much more than submit research reports, which is all that GDPval-v0 tests for. OpenAI acknowledges this and says it plans to create more robust tests in the future that can account for more industries and interactive workflows. Nonetheless, the company sees the progress on GDPval as notable.

In an interview, OpenAI’s chief economist, Dr. Aaron Chatterji, said the results suggest that people in these jobs can now use AI models to spend time on more meaningful tasks. Because the model is getting good at some of these things, people can use it to offload some of their work and focus on potentially higher-value activities.

OpenAI’s evaluations lead, Tejal Patwardhan, said she is encouraged by the rate of progress on GDPval. The GPT-4o model, released roughly 15 months ago, scored just 13.7 percent. Now GPT-5 scores nearly triple that, a trend Patwardhan expects to continue.

Silicon Valley uses a wide range of benchmarks to measure the progress of AI models and assess if a model is state-of-the-art. Popular tests include AIME 2025 for competitive math problems and GPQA Diamond for PhD-level science questions. However, several AI models are nearing saturation on some of these benchmarks, and many researchers have cited the need for better tests that measure proficiency on real-world tasks.

Benchmarks like GDPval could become increasingly important in that conversation as OpenAI makes the case that its AI models are valuable for a wide range of industries. But OpenAI may need a more comprehensive version of the test to definitively say its AI models can outperform humans.