Understanding Data and Data Governance in the EU AI Act
The European Union’s Artificial Intelligence Act (EU AI Act) proposes a framework to regulate AI, particularly for “high-risk” systems—those that could impact health, safety, or fundamental rights. A crucial element of this framework is Article 10, which focuses on data and data governance. This article mandates strict standards for the datasets used in training, validating, and testing high-risk AI systems to prevent issues like bias, errors, or discrimination.
Understanding Article 10 is vital for AI providers and stakeholders interested in AI regulation on data and data governance. This article conceptualizes the data and data governance requirements as outlined in the Act. We’ll explore what data governance means, its key elements, and its significance for compliance.
What is Data Governance in the Context of AI?
Data governance refers to the set of practices, policies, and processes that ensure data is handled ethically, accurately, and in line with ethical and legal standards. For high-risk AI systems, poor data practices can lead to amplified biases or unreliable outcomes, which is why the AI Act emphasizes governance to mitigate risks and ensure systems perform as intended.
Think of data governance as a conceptual framework:
- It covers everything from how data is collected and prepared to how biases are detected and corrected.
- The goal is to make AI systems not just functional, but also fair and compliant with regulations like the General Data Protection Regulation (GDPR) and others.
- In Article 10, this governance applies specifically to training, validation, and testing datasets, ensuring they’re suitable for the AI’s purpose and free from flaws that could harm users.
The Five Pillars of Data Governance
Article 10 is structured around five main paragraphs, each building on the last to create a robust data management ecosystem. These pillars apply to datasets for high-risk AI systems, with some exceptions for non-training-based systems. Let’s explore each one.
1. Data Governance and Management Practices (Article 10(2))
Datasets must undergo appropriate governance and management practices tailored to the AI system’s intended purpose. It’s not a one-size-fits-all approach; practices should reflect the system’s design and real-world application. Key elements include:
- Design Choices: Strategic decisions during development align the AI with its goals.
- Data Collection Processes: Document the origins of data and how it was gathered to build trust.
- Data Preparation Operations: Maintain high quality through tasks like annotation, cleaning, and updating.
- Formulation of Assumptions: Clearly define what the data represents to avoid errors.
- Assessment of Data Suitability: Evaluate if datasets are available and fit for purpose.
- Bias Examination: Scrutinize data for biases that could affect fundamental rights.
- Bias Mitigation: Implement measures to detect and correct biases.
- Addressing Data Gaps and Shortcomings: Identify deficiencies that could hinder compliance.
2. Dataset Characteristics (Article 10(3))
Once governance practices are in place, the datasets themselves must meet quality benchmarks. They need to be:
- Relevant and Sufficiently Representative: Mirror real-world scenarios to avoid skewed results.
- Free of Errors and Complete: Minimize inaccuracies and missing values to ensure reliability.
- Statistically Appropriate: Ensure the data’s statistical properties align with the target population.
3. Contextual Considerations (Article 10(4))
Data doesn’t exist in a vacuum. This paragraph requires datasets to be customized to the AI’s specific geographical, behavioral, functional, or contextual settings. The benefits include:
- Promotes Fairness and Non-Discrimination: Representative data reduces biases that could disadvantage certain groups.
- Enhances Accuracy and Integrity: Tailored data improves completeness and reliability.
- Aligns with Legal Standards: Complies with GDPR principles.
- Reduces Risks: Matches data to operational contexts, avoiding mismatches that could lead to failures.
- Compliance Workflow: Assess the AI’s purpose, curate relevant data, and document decisions for ongoing bias mitigation.
4. Processing Special Categories of Personal Data (Article 10(4))
Special categories of personal data—such as health records, biometric info, or racial details—are highly sensitive. Providers can only process them exceptionally and only for bias detection and correction when absolutely necessary. Strict conditions must be met, including:
- No viable alternative data exists for the task.
- Technical limitations on reuse with privacy-preserving measures.
- Effective access controls and full documentation.
- Data must not be transferred or accessed by third parties.
- Delete the data once the bias is fixed or the retention period ends.
- Processing records must explain why special data was essential.
These safeguards protect fundamental rights while allowing limited use for critical improvements.
5. Testing Datasets for Non-Training Systems (Article 10(5))
Not all high-risk AI systems rely on machine learning models that “train” on data. For those that don’t, the full governance requirements apply only to testing datasets. This streamlines compliance without compromising quality for evaluation phases.
Why Does This Matter? The Bigger Picture
Article 10 isn’t just regulatory fine print; it’s a blueprint for compliance. By enforcing rigorous data governance, the EU AI Act helps prevent AI from perpetuating inequalities or causing unintended harm. For providers, compliance means investing in robust processes—resulting in AI that is more innovative, trustworthy, and market-ready.
If you’re building AI, start auditing your data practices against these pillars. As AI integrates deeper into society, remember: Great AI starts with great data governance.