Are AI agents ready for the workplace? A new benchmark raises doubts

Nearly two years ago, Microsoft CEO Satya Nadella predicted that AI would soon replace knowledge work—the white-collar jobs held by lawyers, investment bankers, librarians, accountants, IT professionals, and others. Despite the significant progress made by foundation models, this transformation in knowledge work has been slow to arrive. Models have mastered in-depth research and agentic planning, yet most white-collar work remains relatively unaffected.

This is one of the biggest mysteries in AI today. Thanks to new research from the training-data company Mercor, we are finally getting some answers. The research examines how leading AI models perform actual white-collar work tasks drawn from consulting, investment banking, and law. The result is a new benchmark called APEX-Agents. So far, every AI lab is receiving a failing grade. When faced with queries from real professionals, even the best models struggled to get more than a quarter of the questions right. The vast majority of the time, the models returned a wrong answer or no answer at all.

According to Mercor CEO Brendan Foody, who worked on the paper, the models’ biggest stumbling block was tracking down information across multiple domains. This skill is integral to most knowledge work performed by humans. Foody explained that the benchmark built out an entire environment modeled after real professional services. In real life, professionals operate across tools like Slack and Google Drive, not with all context provided in one place. For many agentic AI models, that kind of multi-domain reasoning is still hit or miss.

The benchmark scenarios were drawn from actual professionals on Mercor’s expert marketplace. These professionals both designed the queries and set the standard for a successful response. Reviewing the questions gives a sense of the tasks’ complexity. One example from the law section asks whether, under a company’s own policies and relevant EU privacy laws, certain data exports during an outage can be treated as consistent with a specific article. The correct answer is yes, but arriving at it requires an in-depth assessment of both corporate policy and legal regulations.

While such a question might stump a well-informed human, the researchers aimed to model the real work done by professionals. If an LLM could reliably answer these questions, it could effectively replace many lawyers working today. Foody stated that this is probably the most important topic in the economy and that the benchmark is very reflective of the real work these people do.

OpenAI previously attempted to measure professional skills with its GDPval benchmark. However, the APEX-Agents test differs in important ways. Where GDPval tests general knowledge across a wide range of professions, APEX-Agents measures a system’s ability to perform sustained tasks within a narrow set of high-value professions. This makes it more difficult for models but also more closely tied to the potential for automating these jobs.

While none of the models proved ready to take over as investment bankers, some performed better than others. Gemini 3 Flash performed best with 24% one-shot accuracy, followed closely by GPT-5.2 with 23%. Below that, Opus 4.5, Gemini 3 Pro, and GPT-5 all scored roughly 18%.

Although the initial results fall short, the AI field has a history of blowing through challenging benchmarks. Now that the APEX-Agents test is public, it stands as an open challenge for AI labs that believe they can do better—something Foody fully expects in the coming months. He noted that performance is improving rapidly. Right now, it’s fair to say the AI is like an intern that gets it right a quarter of the time, but last year it was the intern that got it right only five or ten percent of the time. That kind of improvement year after year can have an impact very quickly.