RSS co-creator launches new protocol for AI data licensing

Following Anthropic’s recent $1.5 billion copyright settlement, the AI industry is confronting a significant challenge with its training data. There are as many as forty other pending cases seeking damages for the use of unlicensed data. One notable lawsuit involves Midjourney for generating images of Superman. Without a functional licensing framework, AI companies could face an avalanche of copyright lawsuits that some analysts fear might set the entire industry back permanently.

In response, a coalition of technologists and web publishers has launched a new system designed to enable data licensing on a massive scale. This initiative, called Real Simple Licensing (RSL), is already supported by major publishers including Reddit, Quora, and Yahoo. The critical question is whether this momentum will be sufficient to bring major AI labs to the bargaining table.

According to RSL co-founder Eckart Walther, who also co-created the RSS standard, the objective was to build a scalable training-data licensing system for the entire internet. He stated that the internet requires machine-readable licensing agreements, which is the core problem RSL aims to solve.

For years, industry groups have advocated for clearer data collection practices, but RSL represents the first comprehensive attempt to create both the technical and legal infrastructure to make it work. Technically, the RSL Protocol specifies the licensing terms a publisher can set for their content. These terms are included in a website’s “robots.txt” file in a predefined format, making it simple to identify the licensing status of any data.

On the legal front, the team has established a collective licensing organization named the RSL Collective. This entity will negotiate terms and collect royalties, functioning similarly to ASCAP for musicians or the MPLC for films. The goal is to provide licensors with a single point of contact for royalty payments and to offer rightsholders a way to set terms with numerous potential licensors simultaneously.

A host of prominent web publishers have already joined the collective, such as Yahoo, Reddit, Medium, O’Reilly Media, Ziff Davis, Internet Brands, People Inc., and The Daily Beast. Other companies, including Fastly, Quora, and Adweek, are supporting the standard without formally joining the collective.

Notably, the RSL Collective includes publishers like Reddit, which already has separate licensing deals, such as an estimated $60 million annual agreement with Google. The system does not prevent companies from negotiating their own individual deals, much like a musician can set special terms while still collecting standard royalties through a performing rights organization. For smaller publishers without the leverage to secure their own agreements, the RSL’s collective terms will likely be the only viable option.

However, implementing royalties for AI training data presents unique challenges. Determining when a specific piece of data was used is far more complex than tracking when a song is played. The issue is most straightforward for products like AI search abstracts that pull data in real time with clear attribution. But if training events are not logged, it becomes nearly impossible to confirm if a specific document was ingested by a large language model. This is especially challenging if publishers request payment per-inference, an option available in one of the standard RSL licenses.

Despite these hurdles, RSL’s creators are confident that AI companies can develop the necessary tracking. They point out that some existing licensing agreements already require such reporting capabilities. A co-founder of RSL stated that the system does not need to be perfect, but merely good enough to ensure people get paid.

The larger uncertainty is whether AI companies will adopt the system. While frontier labs have shown a willingness to pay for high-quality data through vendors, the web has historically been treated as a source of cheap, low-quality data. With free resources like Common Crawl available, convincing labs to pay royalties for something they are accustomed to getting for free may be difficult. Furthermore, recent incidents show it can be challenging to distinguish between web-scraping and legitimate machine-enhanced browsing.

When questioned on this point, RSL leadership pointed to recent public comments from AI leaders calling for a system like RSL. They plan to hold these companies to their public statements, asserting that the industry has outwardly agreed that such a protocol and system is necessary. Now, they may finally have one.