The reinforcement gap — or why some AI skills improve faster than others

AI coding tools are improving at a remarkable pace. For those not working directly with code, the scale of this change can be difficult to appreciate. However, recent advancements with models like GPT-5 and Gemini 2.5 have unlocked a new set of possibilities for automating developer tasks, and the latest Sonnet 4.5 continues this trend of rapid progress.

In contrast, progress in other AI applications is moving more slowly. If you use AI to write emails, you are likely getting similar value today as you did a year ago. Even when the underlying model improves, the end product does not always show a clear benefit. This is especially true for multipurpose chatbots that are tasked with a dozen different jobs simultaneously. AI is still advancing, but the progress is no longer as evenly distributed as it once was.

The reason for this difference in progress is simpler than it appears. Coding applications benefit from billions of easily measurable tests, which can train the AI to produce functional code. This training method is called reinforcement learning, and it is arguably the biggest driver of AI progress over the last six months. The techniques are becoming more intricate all the time. You can perform reinforcement learning with human graders, but it is most effective when there is a clear pass-fail metric. This allows the process to be repeated billions of times without requiring constant human input.

As the industry relies more on reinforcement learning to improve products, a real divergence is emerging between capabilities that can be automatically graded and those that cannot. Skills that are friendly to reinforcement learning, such as bug-fixing and competitive math, are improving quickly. Meanwhile, skills like writing are making only incremental progress. This creates a reinforcement gap, and it is becoming one of the most important factors determining what AI systems can and cannot accomplish.

Software development is the perfect subject for reinforcement learning. Even before modern AI, an entire sub-discipline was devoted to testing how software holds up under pressure. Developers have always needed to ensure their code would not break before deployment. Even the most elegant code must pass through unit testing, integration testing, and security testing. Human developers use these tests routinely to validate their work, and they are just as useful for validating AI-generated code. More importantly, they are ideal for reinforcement learning because they are already systematized and repeatable on a massive scale.

There is no easy way to validate a well-written email or a good chatbot response. These skills are inherently subjective and much harder to measure at scale. However, not every task falls neatly into an easy or hard to test category. We do not have a ready-made testing kit for quarterly financial reports or actuarial science, but a well-funded accounting startup could likely build one from scratch. Some testing methods will work better than others, and some companies will be smarter in their approach. Ultimately, the testability of the underlying process will be the deciding factor in whether it can become a functional product or remain just an exciting demo.

Some processes turn out to be more testable than you might think. For example, if asked last week, many would have placed AI-generated video in the hard to test category. Yet the immense progress of OpenAI’s new Sora 2 model shows it may not be as difficult as it seemed. In Sora 2, objects no longer appear and disappear randomly. Faces maintain their shape, resembling a specific person rather than just a collection of features. The generated footage respects the laws of physics in both obvious and subtle ways. It is likely that behind the scenes, a robust reinforcement learning system is responsible for each of these qualities. Together, these systems create the difference between photorealism and an entertaining hallucination.

This is not a hard and fast rule of artificial intelligence. It is a result of the central role reinforcement learning currently plays in AI development, a role that could easily change as models evolve. But as long as reinforcement learning remains the primary tool for bringing AI products to market, the reinforcement gap will only grow wider. This has serious implications for both startups and the broader economy. If a process ends up on the right side of the reinforcement gap, startups will likely succeed in automating it, and anyone doing that work today may need to find a new career. The question of which healthcare services can be trained with reinforcement learning, for instance, has enormous implications for the shape of our economy over the next twenty years. If surprises like Sora 2 are any indication, we may not have to wait long for the answers.