Adobe hit with proposed class-action, accused of misusing authors’ work in AItraining

Like pretty much every other tech company, Adobe has leaned heavily into AI over the past several years. The software firm has launched a number of different AI services since 2023, including Firefly, its AI-powered media-generation suite. Now, however, the company’s full embrace of the technology may have led to trouble. A new lawsuit claims Adobe used pirated books to train one of its AI models.

A proposed class-action lawsuit filed on behalf of Elizabeth Lyon, an author from Oregon, claims that Adobe used pirated versions of numerous books, including her own, to train the company’s SlimLM program. Adobe describes SlimLM as a small language model series optimized for document assistance tasks on mobile devices. The company states SlimLM was pre-trained on SlimPajama-627B, a deduplicated, multi-corpora, open-source dataset released by Cerebras in June of 2023.

Lyon, who has written several guidebooks for non-fiction writing, says some of her works were included in a pretraining dataset Adobe used. Her lawsuit says her writing was included in a processed subset of a manipulated dataset that formed the basis of Adobe’s program. It states the SlimPajama dataset was created by copying and manipulating the RedPajama dataset, which included a collection known as Books3. Therefore, SlimPajama contains the Books3 dataset, including the copyrighted works of the plaintiff and other class members.

Books3 is a huge collection of 191,000 books that has been used to train generative AI systems and has been an ongoing source of legal trouble for the tech community. The RedPajama dataset has also been cited in numerous litigation cases. In September, a lawsuit against Apple claimed the company used copyrighted material to train its Apple Intelligence model, mentioning the dataset and accusing Apple of copying protected works without consent, credit, or compensation. In October, a similar lawsuit against Salesforce claimed the company had used RedPajama for training purposes.

Unfortunately for the tech industry, such lawsuits have become somewhat commonplace. AI algorithms are trained on massive datasets, and in some cases, those datasets have allegedly included pirated materials. In September, Anthropic agreed to pay 1.5 billion dollars to a number of authors who had sued the company, accusing it of using pirated versions of their work to train its chatbot, Claude. That case was considered a potential turning point in the ongoing legal battles over copyrighted material in AI training data.