Dr. Deborshi Barat*

While artificial intelligence (AI) relies on the availability of vast datasets, the legality of using copyrighted material and/or protected personal information is unclear. Accordingly, this note explores the possibility of reconciling social needs in terms of training AI models with protectionist regimes (involving intellectual property or privacy laws) by using ‘artificially’ generated information. This is Part II of a two part article
IV. Navigating Legal and Practical Challenges
While the introduction of synthetic data into the realm of AI model training does bring forth various advantages, it is not without its challenges – both legal and practical.
Data Privacy and Regulatory Compliance
A primary concern surrounding synthetic data in AI training revolves around data privacy and regulatory compliance. The legal landscape, especially with regulations like the GDPR and the DPDP Act, mandates strict guidelines for the use and handling of personal data. While synthetic data, by design, eliminates the risk of exposing real and identifiable personal information, ensuring compliance with data protection laws on an ongoing basis nevertheless remains a critical consideration. Organizations must navigate a complex web of laws, and at the same time, need to keep abreast with rapidly evolving de-identification techniques and technologies, in order to ensure that the synthetic data which they generate and utilize continues to adhere to applicable legal standards of personal data protection.
Synthetic data may mitigate the risk of re-identification with respect to natural persons by temporarily rendering them anonymous. According to the GDPR, anonymity translates to a lack of identifiability such that an individual cannot be identified through either direct and indirect links and/or features, or in combination with additional pieces of information. However, data synthesis always involves a fine balance between utility and anonymity, respectively, where these two elements share an inverse relationship. In other words, a closely fitted synthetic dataset may maximize utility by remaining faithful to the original information and its underlying characteristics – thereby compromising anonymity. On the other hand, overzealous purges of identifiable information may render the resulting dataset significantly dissimilar in terms of the original’s statistical properties – thus eroding usefulness. Accordingly, a privacy assurance assessment could be performed periodically in order to confirm, from time to time, that the final synthetic output does not constitute personal data – as such term is defined in contemporaneous data protection legislation. Such an assessment may also determine the extent to which specific individuals remain identifiable from the synthetic dataset, as well as the amount of new information about them that might be revealed pursuant to re-identification.
Transparency and Informed Consent
The ethical use of synthetic data necessitates transparency and informed consent. Stakeholders, including individuals from whom the original data was sourced, must be aware of the synthetic data generation process and its subsequent use. Transparent communication becomes pivotal, and obtaining informed consent ensures that individuals are aware of how their data is being used – even in its synthetic form. This not only aligns with ethical standards but also contributes to building trust between organizations, customers, partners, regulators and the broader community.
Liability and Accountability
The dynamic nature of AI systems introduces complexities concerning liability and accountability. If an AI model trained on synthetic data produces unintended consequences or malfunctions altogether, important questions surrounding accountability may inevitably arise. Thus, establishing clear lines of responsibility and guidelines for the development and deployment of AI models becomes essential. This involves a collaborative effort between data scientists, legal experts and policymakers to define frameworks that ensure accountability without stifling innovation.
Bias Mitigation
Addressing bias in AI models has been a persistent challenge, and the use of synthetic data introduces its own set of considerations. While synthetic data can be tailored to avoid the perpetuation of existing biases, it does require – in the first place – a nuanced understanding of potential biases in the original dataset. Striking a balance between customization and avoiding unintended biases is crucial to ensure that AI models trained on synthetic data contribute to fair and equitable decision-making.
Generalization to Real-World Scenarios
Ensuring that AI models trained on synthetic data generalize well to real-world scenarios is a practical concern. While synthetic data aims to mimic real-world conditions, there is a risk of overfitting to the characteristics of the generated data. Careful validation and testing are essential to verify that the AI model’s performance extends beyond the synthetic training environment, providing reliable and accurate results in diverse, real-world situations.
Certain prominent use-cases of synthetic data, including across important sectors, is discussed below.
V. Critical Use Cases in Key Sectors
As AI continues to permeate critical sectors and activities, the demand for diverse and high-quality training data becomes increasingly vital. Two fundamental elements related to data quality are completeness and accuracy. Completeness ensures that certain data features are not unrepresented in a dataset, while accuracy ensures that they are not mispresented. Synthetic data offers the potential to strengthen both such dimensions by plugging gaps in datasets that result from difficulties in gaining access to collected data.
By acting as a replacement for missing data, synthetic data can potentially create more representative datasets. Furthermore, synthetic data can increase accuracy by verifying the correctness of the analysis performed on collected data, as exemplified by the use of synthetic data to create counterfactuals to fix AI models.
Healthcare
In the healthcare sector, the integration of AI holds immense promise for improving diagnostics, personalized medicine and patient care. In this regard, synthetic data proves invaluable – especially in scenarios where access to large, diverse and privacy-sensitive medical datasets is limited. Generating synthetic patient data enables the training of AI models to recognize patterns, predict diseases and optimize treatment plans without compromising patient privacy. This use case not only accelerates medical research but also facilitates the development of robust AI tools for enhanced clinical decision-making.
Banking and Financial Services
The banking and financial services sector relies heavily on data-driven decision-making, risk management and fraud detection. However, the use of real financial data poses significant challenges due to privacy regulations and the sensitive nature of financial information. In this respect, synthetic data allows financial institutions to create realistic datasets for training AI models without using actual transaction details. This aids in developing more accurate credit scoring models, improving fraud detection algorithms, and enhancing overall financial forecasting without exposing sensitive customer information.
Customer Confidentiality
Since synthetic data serves as a privacy-preserving alternative, it allows institutions to train robust AI models without compromising customer confidentiality. By creating synthetic datasets that mimic the statistical properties of real data, financial institutions can navigate privacy concerns even while advancing AI capabilities.
Fraud Detection
While fraud detection is a critical component of risk management in the financial sector, synthetic data plays a vital role in training AI models to recognize patterns indicative of fraudulent activities. By generating diverse and realistic synthetic transaction data, financial institutions can create sophisticated fraud detection algorithms. These algorithms, trained on synthetic datasets, can then be applied to real-world scenarios, bolstering the industry’s ability to safeguard against financial crimes.
Credit Scoring
Credit scoring models form the backbone of lending decisions in the financial sector. Traditional approaches to training these models often rely on historical credit data, which may be limited or biased. Synthetic data allows for the creation of expansive and balanced datasets that better represent diverse customer profiles and creditworthiness. This results in more accurate and fair credit scoring models, enabling financial institutions to make informed lending decisions and reduce the risk of default.
Customer Service
Synthetic data provides a shortcut to overcome the challenges associated with obtaining and managing large volumes of real-world financial data. Financial institutions can use synthetic datasets to expedite the development and testing of AI models for various applications, including risk management, customer service and investment strategies.
Stress Testing
Stress testing and scenario analysis are essential for risk assessment. In this regard, synthetic data enables financial institutions to create dynamic and customized scenarios for that purpose. This allows AI models to predict the impact of various economic conditions and market fluctuations on portfolios, providing valuable insights for risk mitigation and strategic decision-making.
Compliance and AML
Compliance with regulatory requirements is a fundamental aspect of the banking and financial services sector. Synthetic data facilitates the training of AI models for compliance monitoring, anti-money laundering (AML) programs and regulatory reporting. By simulating various scenarios involving suspicious activities and transactions, financial institutions can enhance the effectiveness of their compliance measures without exposing real-world sensitive data.
Autonomous Vehicles
In the realm of autonomous vehicles, training AI models to navigate real-world scenarios demands extensive and diverse datasets. However, obtaining such data can be logistically challenging and potentially risky. Synthetic data proves instrumental in creating realistic virtual environments for training autonomous vehicle AI. Simulated scenarios – generated through synthetic data – enable comprehensive testing and training of self-driving algorithms, thus ensuring robust performance in various conditions without the need for large-scale real-world testing.
Manufacturing and Industry
In the manufacturing sector, the adoption of AI and ‘Industry 4.0’ technologies into industrial processes proves increasingly useful – given the unprecedented scale of digitization in such sector (part of the so-called fourth industrial revolution, involving the integration of intelligent digital technologies which change the way machines communicate with each other – such as networks associated with the Industrial Internet-of-Things (“IoT”), automation, ‘Big Data’ analytics, cloud computing, robotics, 3D printing, ML, virtual and augmented reality, etc.). Synthetic data finds critical use in training AI models for predictive maintenance, quality control and optimization of production processes. Simulating various manufacturing scenarios allows for the creation of highly tailored datasets, enabling AI algorithms to identify potential issues, optimize efficiency, and enhance overall operational performance.
Defence and National Security
The defence and security sectors leverage AI for tasks such as image recognition, threat detection and predictive analysis. However, acquiring real-world data for training these models can be restricted due to confidentiality and security concerns. Synthetic data addresses this challenge by enabling the generation of diverse and realistic datasets for training AI tools in a controlled environment. This use case aids in developing advanced surveillance systems, improving threat detection capabilities, and enhancing overall national security.
Education
In the education sector, AI plays a significant role in personalized learning, adaptive assessments and educational content development. Synthetic data facilitates the creation of diverse student profiles, learning scenarios and educational datasets. This enables the training of AI models to provide tailored learning experiences, assess student performance, and develop adaptive teaching methodologies without compromising the privacy and/or personal information of real students.
VI. Use in M&A Transactions
AI models trained on synthetic data can analyze and process large datasets quickly, expediting the due diligence process. Thus, synthetic data could facilitate a comprehensive due diligence by providing simulated datasets that mimic the target company’s data environment without exposing sensitive information. Further, synthetic data aids in pre-merger integration planning by enabling organizations to simulate the integration of disparate datasets – thus identifying potential challenges before an actual merger. Moreover, by using synthetic data, companies can identify and address data integration issues early in the M&A process, reducing post-merger risks.
To illustrate, IT systems and architectures across data-driven companies are increasingly complex, such that valuable information remains ‘hidden’ or undiscovered because it is housed in discrete silos, or is distributed across diverse geographies, or gets generated in real-time. However, an alternative technological solution for data integration in M&A deals can be achieved through ‘data virtualization’ processes – which may also reduce time and cost, thereby enhancing operational efficiency.
In essence, virtualization is an agile data integration approach that presents information as a virtual layer through the use of ‘logical’ data warehouses – and thus remains independent of underlying databases or physical storage,
VII. Conclusion
It is important to remember that synthetic data may also create harms and risks, including in respect of ‘deepfakes’, as recently witnessed in India. Significantly, synthetic data models can be inaccurate. In some situations, adding synthetic data increases the risk of duplicating bias or errors. Another concern is that the use of synthetic data may produce a false sense of complacency about the original data. While de-identification offers some protection, the continuing development of efficient algorithms increases the likelihood that such data could be re-identified – especially given modern advances in quantum computing, which can break encryption methods.
Further, synthetic data can help change market dynamics. For example, given that synthetic data can be used to augment collected information (the aggregate of which may be too small in certain situations to be useful), businesses with even small datasets may be able to compete with organizations that collect and/or possess relatively large volumes of data.
Also, synthetic data may indirectly alter the competitive dynamics among entities through its effects on data sharing. For example, if the data collected by an entity does not produce a comparative advantage for the entity concerned, there is a higher probability that such entity will share that data. This dynamic may be especially relevant in the context of the IDP.
Accordingly, the market price of actual, collected data may be subject to the costs of generating comparable synthetic datasets. Further, the ability to share data will also rise. For instance, if synthetic data generated by transforming collected personal information does not fall within the scope of privacy laws, it need not then comply with such legal requirements that relate to data protection (e.g., under India’s DPDP Act). This, in turn, could make it easier to share such data. Additionally, the use of synthetic data may reduce the collection of unnecessary personal information and promote data minimization – as contemplated both in the GDPR and the DPDP Act.
Furthermore, synthetic data can promote other data protection principles as well – such as in respect of lawful processing and data security, integrity and quality. In fact, by replacing collected information with artificially generated data, or by adding synthetic data to a dataset that was originally collected from the physical world so that it screens outlier data points while retaining the statistical properties of the original dataset, synthetic data offers an additional layer of security.
For the purpose of preventing a personal data breach, existing anonymization techniques seek to introduce noise into a dataset, thus disturbing the relationship between attributes or the distribution of values for such attributes, or by stripping the dataset of some meaningful data points. Such techniques thus reduce the accuracy, and therefore the utility, of data. Synthetic data may be able to recalibrate this situation.
Anonymized personal data is typically not covered by legal frameworks related to privacy. However, it is unclear whether synthetic data should be considered anonymous. For instance, where synthetic data is generated using real data as an input or comparator, there remains the residual risk that such data can be linked back to an individual through inference or by linking the synthetic data with other datasets. If the technical and organizational safeguards required to prevent a personal data breach change over time (as they inevitably will), de-identification processes could be reversed more easily. In such situations, it is possible that synthetic datasets generated using an individual’s personal data may fall within the ambit of a data protection legislation.
On the other hand, ML provides pathways to use data and make inferences and/or probabilistic predictions. Synthetic data may increase these risks. Even if a synthetic dataset anonymizes information about each individual but enables the algorithm to learn about groups, once the algorithm can connect an individual to a group, it can make informed inferences about their preferences. In this regard, is debatable whether the concept of identifiability is sufficient to prevent harm to individuals, and whether it can capture linkages or inferences on which synthetic data might be based.
Synthetic data may also increase collective harms – which could arise when data analysis leads to decisions that affect a group of individuals whose data may or may not constitute part of the dataset. To be sure, if the synthetic generation process is successful, then the generated dataset will constitute a convincing imitation of a dataset about real individuals. However, if this ‘fake’ dataset can be used to impact individuals, then irrespective of the underlying data that is used to draw this inference, the threat to individuals rights may persist.
Nevertheless, synthetic data may reduce the collective action problem in data markets. After all, data collection involves transaction costs (such as privacy risks) which can affect the incentives of data principals to provide consent even when they know that such personal information will be put to good use – including for their own benefit as well as for others. If a critical mass of data principals are reluctant to consent to data processing, the intended collective good cannot be created. However, synthetic data can indirectly overcome this problem by reducing the number of required data entries to make an informed decision.
Synthetic data can also reduce the need and justification for data sharing arrangements in order to realize data synergies. If similar synergies can be realized by internally generating synthetic data, then such arrangements, especially between competitors, may reduce.
According to a recent report by Gartner, by 2024, 60% of all data used for the development of AI will be artificially generated, and by 2030, it may completely replace the use of real data in AI models. The synthetic data market is growing quickly, with several start-ups offering synthetic data generation tools, platforms and services. For instance, certain platforms allow customers to (i) perform data transformation, (ii) check the quality of input data, (iii) generate their own synthetic datasets, as well as (iv) train AI/ML models. Ultimately, such platforms should be able to integrate with other platforms in the customer’s data stack. In summary, synthetic data may reshape the dynamics of global data economics and unlock unprecedented business prospects.
Given India’s AI-related ambitions, there is the additional question of overcoming the comparative advantages of other jurisdictions. For instance, a foreign government may skew the playing field in favor of local entities by allowing them to test algorithms in ways that are far more favorable (including in respect of privacy concerns) relative to those that are legally available to companies in other countries. This problem could be overcome through the use of synthetic datasets.
*Deborshi Barat is a Counsel at S&R Associates, New Delhi. His areas of practice include regulatory and policy matters. Previously, he was an Associate Professor at the Jindal Global Law School. He holds a Ph.D. from The Fletcher School of Law and Diplomacy, Tufts University.
Read Part I here: https://lawschoolpolicyreview.com/2023/12/17/building-artificial-intelligence-with-artificial-data-fake-it-until-you-make-it-part-i/