AI coding tools are improving at a remarkable pace. For those not working directly with code, the scale of this change can be difficult to appreciate. However, recent advancements like GPT-5 and Gemini 2.5 have unlocked a new range of automated developer capabilities, and the latest Sonnet 2.4 update continues this trend. In contrast, progress in other areas is much slower. If you use AI to write emails, you are likely getting the same utility from it today as you did a year ago. Even when the underlying model improves, the end product does not always get better, especially when that product is a general-purpose chatbot performing many different tasks simultaneously. Artificial intelligence is still advancing, but the progress is no longer evenly distributed across all applications.
The reason for this differing progress is quite straightforward. Coding applications benefit from billions of easily measurable tests that can train the AI to produce functional code. This training method is called reinforcement learning, which has arguably been the most significant driver of AI progress over the last six months and is becoming more sophisticated all the time. You can conduct reinforcement learning with human graders, but it is most effective when there is a clear pass or fail metric. This allows the process to be repeated billions of times without requiring constant human intervention.
As the industry relies more heavily on reinforcement learning to enhance its products, a clear division is emerging between capabilities that can be automatically graded and those that cannot. Skills that are friendly to reinforcement learning, such as fixing software bugs or solving competitive math problems, are improving rapidly. Meanwhile, skills like writing show only incremental progress. This creates a reinforcement gap, and it is becoming one of the most important factors determining what AI systems can and cannot accomplish.
Software development is ideally suited for reinforcement learning. Long before modern AI, an entire sub-discipline existed to test how software would perform under pressure, as developers needed to ensure their code was stable before deployment. Even the most elegant code must pass through unit testing, integration testing, and security testing. Human developers use these tests routinely to validate their work, and they are just as useful for validating code generated by an AI. More importantly, these tests are perfect for reinforcement learning because they are already systematized and can be repeated on a massive scale.
There is no simple way to validate a well-written email or a good chatbot response. These skills are inherently subjective and much harder to measure at scale. However, not every task fits neatly into an easy-to-test or hard-to-test category. We do not have a ready-made testing kit for quarterly financial reports or actuarial science, but a well-funded accounting startup could likely build one from the ground up. Some testing systems will naturally be more effective than others, and some companies will be smarter in their approach. Ultimately, the testability of the underlying process will be the deciding factor in whether it can become a functional product or remain merely an exciting demonstration.
Some processes turn out to be more testable than one might assume. If asked last week, many would have placed AI-generated video in the hard to test category. Yet the immense progress shown by OpenAI’s new Sora 2 model suggests it may not be as difficult as it appeared. In Sora 2, objects no longer appear and disappear randomly. Faces maintain their shape, resembling a specific person rather than just a collection of features. The footage from Sora 2 respects the laws of physics in both obvious and subtle ways. It is likely that behind the scenes, a robust reinforcement learning system is responsible for each of these qualities. Together, these systems create the difference between photorealism and an entertaining hallucination.
It is important to note that this is not a rigid law of artificial intelligence. It is a consequence of the central role reinforcement learning currently plays in AI development, a role that could easily change as models evolve. However, as long as reinforcement learning remains the primary tool for bringing AI products to market, the reinforcement gap will only widen. This has serious implications for both startups and the broader economy. If a particular process falls on the right side of the reinforcement gap, startups will likely succeed in automating it, and people performing that work today may need to find new careers. The question of which healthcare services can be trained with reinforcement learning, for example, has enormous implications for the structure of our economy over the next two decades. If surprises like Sora 2 are any indication, we may not have to wait long for the answers.

