Out of the Shadow Library: Fair Use and AI Training Data
Since the launch of the first Large Language Models (LLMs), a wave of copyright litigation has been initiated by authors, musicians, and news organizations alleging that their works were misappropriated to build today’s most powerful generative AI tools. In response, AI companies have asserted that such use is non-infringing fair use.
These lawsuits target a spectrum of alleged copyright infringement. Some lawsuits focus solely on the unauthorized use of works as training inputs, while others focus on the model’s ability to generate allegedly infringing outputs. Many premise liability on both behaviors.
Judicial Perspectives on Fair Use
In 2025, two rulings offered the first judicial perspectives on one end of this spectrum: whether unauthorized use of copyrighted works as training "inputs" constitutes fair use. While the outcomes of these cases have been widely reported — both rulings suggest training generative AI models on copyrighted works may be fair use — the legal landscape is far from definitive. Fair use remains a famously fact-specific inquiry, and these holdings do not broadly shield users of third-party content from liability for unlawful data acquisition or retention.
Understanding Fair Use
To prove infringement, a plaintiff must show a defendant used a copyrighted work in violation of one of the exclusive rights granted to a copyright owner: the rights to reproduce, distribute, perform, display, or adapt an original work without authorization. However, not every unauthorized use results in liability. Copyright protection incorporates certain limitations — including fair use — designed to balance creative incentives with the public interest.
Courts analyze fair use by assessing and balancing four statutory factors:
- the purpose and character of the use,
- the nature of the original work,
- the amount and substantiality of the portion used, and
- market effects.
No single factor is determinative, but courts often focus their analysis on two factors: the purpose and character of the use (including whether the use is “transformative”) and its effect on the potential market.
Case Studies: Bartz v. Anthropic and Kadrey v. Meta
In Bartz v. Anthropic and Kadrey v. Meta, both plaintiffs were groups of authors who alleged that the unauthorized use of their literary works as "inputs" to train AI models constituted copyright infringement. The question was not whether AI companies’ use of the plaintiffs’ works for AI training was authorized, but whether the unauthorized use was a fair use under the Copyright Act.
Bartz v. Anthropic
In Bartz v. Anthropic, the court found Anthropic’s use of copyrighted works to train an AI model to be fair use, at least when the works were lawfully sourced. The court separated its fair use analysis between the act of training and Anthropic’s retention of data.
Regarding the act of training the model, the court found the first and fourth factors favored fair use because creating the model was "quintessentially transformative" and did not produce infringing substitutes. The third factor also favored fair use because the amount copied was “especially reasonable” for the purpose of using high-quality writing to train an LLM. While the second factor favored the plaintiffs, the court determined it was not dispositive considering the model’s transformative purpose.
However, regarding the act of retaining unlawfully obtained copies, the court found “every factor pointed against fair use.” The court explained the piracy of otherwise available copies is “inherently, irredeemably infringing,” even if the copies are immediately used or discarded. The court cautioned that while the training process is transformative and may be protected, the acquisition of materials used to facilitate AI training is not.
Kadrey v. Meta
In Kadrey v. Meta, the court similarly found the unauthorized reproduction of copyrighted works to train an AI model to be fair use. Although Meta, like Anthropic, downloaded and reproduced pirated sources for portions of its training data, the court reached a different conclusion about the relevance of data acquisition in its fair use analysis.
Even with the use of training data from so-called "shadow libraries," the court found the first factor weighed in favor of fair use. The court acknowledged that Meta’s use of shadow libraries was relevant to the first factor, but ultimately found this factor favored Meta because the downloading was considered an integral step toward the ultimate transformative goal. The second factor favored the plaintiffs, but the court found the third factor favored Meta because copying the entirety of the works was reasonably necessary to achieve the transformative purpose of training the model.
Critically, the court leaned heavily on the fourth factor, finding that it favored Meta due to the plaintiffs’ perceived failure to provide evidence that Meta’s model harmed the market for the plaintiffs’ works.
Conclusion: Implications of Recent Rulings
These two decisions suggest a trend toward finding the use of copyrighted works as training inputs to a generative AI model to be fair use. However, both rulings clarify that this protection is not absolute. The overall specter of copyright liability remains for the acquisition of data used to train AI.
As the legal landscape continues to shift, myriad avenues for liability remain, particularly depending on how AI training data is obtained, retained, and used in generating outputs.