Clarifying GDPR Compliance for AI Training

CNIL Clarifies GDPR Basis for AI Training

The recent guidance from the French National Commission on Informatics and Liberty (CNIL) has provided essential clarity regarding the application of legitimate interest as a legal basis for processing personal data in the context of artificial intelligence (AI) model training. This guidance specifically addresses the use of personal data scraped from public sources, a topic that has generated considerable debate within the tech and regulatory communities.

Key Points of the Guidance

The CNIL’s insights are significant, yet they represent only one layer in a complex regulatory landscape. While this guidance helps in reducing uncertainty surrounding GDPR compliance at the training stage, it does not resolve other pertinent issues such as copyright, database rights, and post-training litigation risks.

Organisations are urged to apply structured judgment at critical moments and to maintain a well-documented position on GDPR compliance. This remains crucial for managing AI-related compliance at scale.

What the CNIL’s Guidance Clarifies

The CNIL affirms that training AI models on personal data sourced from public content can be lawful under the GDPR’s legitimate interest basis, provided that certain conditions are met. These conditions include a credible balancing of interests, demonstrable safeguards, and clear documentation.

Web scraping may be permissible, respecting contextual privacy expectations. Scraping should not occur from sites that actively prohibit it, such as those outlined in robots.txt, or from platforms aimed at minors.
Training-scale data use is not inherently unlawful. Large datasets may be necessary for effective AI development, provided that the principles of proportionality and minimisation are observed.
End-user benefit may favour the controller in the legitimate interest assessment. Improvements in accuracy and functionality could justify processing under legitimate interest, subject to a well-documented assessment.
Regurgitation risk must be addressed. The CNIL expects evidence of mitigation measures, such as prompt filtering and internal testing for memorisation.
Data subject rights may be respected indirectly. The CNIL allows for alternatives in situations where individual erasure or objection is difficult to implement.
Documentation must be prepared at the time of training. The legitimate interest assessment and mitigation planning should be complete before training begins.
Data Protection Impact Assessments (DPIAs) may still be expected when model training involves large-scale data scraping or special category data.

Regulatory Landscape Comparison

While the CNIL’s guidance is the most structured to date, other data protection authorities have varying levels of clarity:

The UK Information Commissioner’s Office (ICO) has acknowledged that existing GDPR rules, including legitimate interest, may justify AI training in some contexts but has not provided detailed implementation guidance.
The Irish Data Protection Commission (DPC) and Italian Garante have focused primarily on deployment-phase enforcement, particularly concerning DPIAs and transparency around profiling.
A consistent, pan-EU approach remains absent, creating challenges for companies navigating multiple expectations across different jurisdictions.

Legal Uncertainty Beyond GDPR

The CNIL’s guidance offers a defensible position for GDPR compliance in model training, but several legal restrictions continue to limit AI system viability, particularly in commercial settings:

Copyright and database law remain binding. Publicly accessible content may still be protected under copyright, and the commercial text and data mining exception can be overridden by opt-out mechanisms.
Contractual terms restrict access and reuse. Many platforms prohibit scraping or commercial reuse through their terms of service, which are enforceable separately from data protection laws.
Downstream deployment introduces new compliance layers, encompassing obligations under various regulations like the AI Act and the Digital Services Act.

Operational Priorities and Legal Positioning

For legal, privacy, and product teams, the focus should not be on reinventing governance but rather on applying structured judgment at critical points:

Utilise the CNIL’s guidance to reinforce existing privacy governance, integrating it into company workflows.
Understand that training-stage compliance does not permit commercial use; copyright and platform terms may still restrict model training.
Recognise that deployment remains a separate compliance layer, requiring adherence to GDPR and other applicable regulations.
Encourage cross-functional collaboration among privacy, legal, product, and engineering teams to facilitate rapid, practical decisions.
Assign internal accountability to ensure clear connections between model training decisions and privacy documentation.
Prepare for regulatory inconsistencies and maintain thorough documentation to defend compliance narratives.

Despite the clarity offered by the CNIL’s guidance, organisations should not consider GDPR compliance for AI training as a resolved issue. Interpretation may vary across member states, and enforcement will likely focus on end-to-end outcomes, especially in sensitive use cases.

As regulatory frameworks evolve, a well-documented position grounded in the CNIL’s guidance remains a vital tool for managing AI compliance effectively.