House Bill Aims to Enhance AI Training Data Transparency

New House Bill on AI Transparency Aims to Pull Back the Curtain on AI Training Data

On January 22, 2026, House Representatives Madeleine Dean (D-PA) and Nathaniel Moran (R-TX) introduced H.R. 7209, a bipartisan bill that could significantly reshape the relationship between copyright law and artificial intelligence. Known as the Transparency and Responsibility for Artificial Intelligence Networks (TRAIN) Act, the proposal seeks to give copyright owners a clearer path to understanding whether—and how—their works are being used to train generative AI models.

The Core of the TRAIN Act

At the heart of the bill is a new administrative subpoena process added to the Copyright Act. Under the TRAIN Act, a copyright owner who has a good-faith belief that their work was used to train a generative AI model could request a subpoena, issued by the clerk of a U.S. district court, compelling an AI developer to disclose copies of training materials or records sufficient to identify them with certainty. The bill applies not just to original models, but also to substantially modified versions, including those retrained or fine-tuned after initial release.

Importantly, rights holders may only seek information about their own copyrighted works, not the broader training datasets used by a developer. To initiate the process, the requester must submit a sworn declaration stating that the subpoena is sought solely to determine whether their copyrighted material was used and that any disclosed records will be used only to protect their rights.

Developer Obligations and Consequences

For developers, the obligations are clear: comply expeditiously or face consequences. Failure to comply with a valid subpoena would create a rebuttable presumption that the developer copied the copyrighted work—a notable shift that could affect future infringement litigation. At the same time, the bill includes safeguards against abuse, allowing courts to impose sanctions on rights holders who request subpoenas in bad faith under existing Rule 11 standards.

Support and Criticism of the TRAIN Act

Supporters of the TRAIN Act frame it as a transparency measure, arguing that copyright owners currently lack practical tools to determine whether their works have been ingested by opaque AI training pipelines. Critics, however, may raise concerns about administrative burden, confidentiality— including exposure of potential trade secrets regarding how a model is trained— and the potential chilling effect on AI development.

State Laws on AI Training Data Disclosure

As debates over AI, data rights, and creative ownership intensify, the TRAIN Act represents one of the most concrete legislative efforts yet to address the “black box” of AI training. Until now, only a handful of states have enacted laws requiring some form of disclosure about AI training data, and they do so with differing scopes and mechanisms:

California – AB 2013 (Artificial Intelligence Training Data Transparency Act), effective January 1, 2026, requires developers of generative AI systems offered for public use in California to post a high-level summary of their training data on a public website.
Connecticut – An amendment to the Connecticut Data Privacy Act (Public Act No. 25-113), effective July 1, 2026, requires covered “controllers” to disclose in their consumer privacy notices whether they collect, use, or sell personal data for training large language models.
Colorado – The Artificial Intelligence Act requires certain developers of high-risk AI systems to provide deployers with documentation about those systems, including general information about the categories of data used for training.

Unlike these state laws—which rely on generalized disclosures, privacy notices, or risk documentation—the TRAIN Act would create a targeted, rights-holder-driven mechanism to obtain specific information about whether particular copyrighted works were used in AI training.

Potential Impact of the TRAIN Act

If enacted, the TRAIN Act could reduce the need for a fragmented, state-by-state approach and provide a broader, more effective path for content owners to determine whether their materials are being used to train AI systems.