Consent-Centric Data Challenges for AI Development in India

AI at a Crossroads: Navigating Consent-Centric Data in India

In the purview of Artificial Intelligence (AI), data is a key driving force for training advanced AI models. Advanced AI systems such as Large Language Models (LLMs) thrive on large volumes of high-quality datasets. However, the Indian Digital Personal Data Protection (DPDP) Act and its rules based on express, informed, and continuous consent pose ethical and practical considerations. This article aims to identify the implications of the DPDP Act’s consent-centric nature on AI development, especially in sectors that require curated, proprietary data.

The Consent-Centric Data Governance

India’s DPDP Act represents a significant milestone in the country’s data protection approach. The official DPDP Rules emphasize that every data point shall be collected according to the principle of consent by the data subject. It also excludes publicly available data in some instances. Unlike the European Union’s (EU) General Data Protection Regulation (GDPR) and Brazil’s Lei Geral de Proteção de Dados (LGPD), this framework narrowly recognises only consent as a valid processing basis, overlooking alternative legal mechanisms such as contractual necessity and legitimate interests that provide processing flexibility under leading international data protection regimes.

With rapidly unfolding AI development, consent as the basis for data protection is working at cross purposes with the prevailing mode of data collection to train large AI models. While the DPDP Act aims to protect an individual’s rights in a way that makes data collection practices more transparent and accountable, this regulatory development comes at a time when AI developers increasingly need data that is not easily accessible to the public. Ernst & Young’s (EY) comprehensive reports on sectoral AI development highlight the essentiality of high-quality, carefully curated datasets for effective LLM training. This finding is further corroborated by analyses of the specific challenges involved in developing generative models. The focus on explicit, granular consent poses a significant conundrum in such a context. How can the underlying consent-centric framework for data protection be reconciled with data requirements for AI innovation?

The Conundrum of Curated Data for Sector-Specific AI

The foundation of AI systems like LLMs rests entirely on their training data. In critical sectors such as healthcare, banking, and online advertising, data collection follows regulated protocols, often drawing from exclusive sources inaccessible to the general public. Within the DPDP Act framework, a consent manager is defined as an entity officially registered with the Data Protection Board of India. It provides a transparent, accessible, and interoperable platform that empowers data principals to grant, manage, review, and revoke their consent and serve as the primary intermediary between individuals and businesses.

However, this consent-based approach creates a fundamental tension in AI development. Requiring case-by-case consent significantly reduces the volume of available training data, creating a complex challenge with multiple dimensions. While consent-centric frameworks aim to build trust and ensure data subjects maintain control, they also introduce new problems for AI innovation. For instance, an additional layer of complexity arises at the intersection of data protection and copyright law. Recent cases highlight the legal issues that develop when curated data protected by copyrights is used to train AI models.

Given that LLMs need vast datasets to function, whether they can negotiate consent for each data element, including copyrighted content, is an enduring question. The tension between the need for comprehensive datasets and the stringent requirements for consent illustrates the challenges that lie ahead for AI developers in India.

Global Perspectives on Privacy and Innovation

Outside the Indian context, similar findings give a different perspective to the balance between privacy and innovation. Reports indicate that applying consent models in the current environment where AI is data-intensive may be difficult. Arguments have been made that privacy cannot remain acceptable only based on individual consent as presupposed by previous frameworks. The conflict between data protection and data utility has been highlighted in various studies, substantiating that even though the DPDP Act is ethical in its consent-centric approach, the lack of sufficient flexibility in its implementation may hinder technological advancement.

A flexible framework, as envisioned in best practices like those embodied in the EU AI Act, outlines responsible data governance and management practices and alternative anonymisation techniques that factor in context in addition to identifiers. These approaches highlight the importance of contextualising privacy protection within comprehensive risk assessment frameworks, vulnerabilities arising from data linkages, and the consequent risk interfaces create.

Balancing Innovation with Ethical Imperatives

The issue, therefore, is to find the balance between two opposite and equally important goals. On one hand, the ethical and legal positions aim to protect an individual’s privacy by ensuring that they consent knowingly and can withdraw their consent at any time. On the other hand, there are technological demands for big and organised data sets for AI development. The further development of AI in India depends on the availability of structured data, which can be accessed through specific mechanisms. However, these mechanisms must conform to the DPDP Act.

Technological solutions, such as Consent Managers, help in consent management more efficiently and manually while maintaining proper records and audit trails, but they add an extra layer of compliance. Blockchain technology can also be used to make the records of consent unalterable and transparent. When used together with methods such as subjective anonymisation, data analyses can help protect individual identities. These tools create a data environment that respects the subject’s rights and the development of AI technologies.

However, policy adaptations are also crucial. Standardised consent templates fail to reduce consent fatigue among the subjects and the researchers. There might be a need to allow sector-specific exemptions and regulatory sandboxes owing to the nature of business conducted in some industries that require curated data. Regulations could permit limited data sharing with the necessary conditions put in place to protect individuals’ privacy and consent while at the same time providing LLMs and other AI systems with the quality data they need.

Conclusion

India’s consent-based data protection regime, while protecting individual rights through informed consent mechanisms, might create operational challenges for AI innovation. The balance between privacy protection and technological innovation will depend on identifying effective solutions like responsive risk-based regulatory frameworks, including sandboxes and exemptions, and moving towards industry-led methods. This will help policymakers and industry leaders collaboratively design an ethical framework conducive to AI-driven progress, ensuring that India remains at the forefront of responsible technological evolution.