Synthetic Data and GDPR: A Credible Alternative to Personal Data or a New Re-identification Risk?

Between Innovation and Vigilance, a New Balance for Organisations and DPOs

Driven by the rise of artificial intelligence and increasing GDPR constraints, synthetic data is progressively emerging as a solution often presented as “privacy by design.” Artificially generated from statistical models or AI algorithms, it aims to reproduce the overall behaviour of a real dataset without directly copying the individuals within it.

In practice, a model may learn that certain diseases are more common in specific age groups, that some purchasing behaviours correlate with geographic areas, or that certain income levels are statistically linked to education levels. Artificially generated data therefore retains the key characteristics of the original datasets: proportions, trends, or relationships between variables, while generating new profiles that are theoretically fictitious.

This approach is particularly attractive for sectors subject to strong regulatory requirements such as healthcare, finance, insurance, research, and cybersecurity. Organisations see it as a way to train AI models, conduct application testing, share datasets, or build simulation environments without directly exposing real personal data.

Synthetic Data and Anonymisation: European Authorities Urge Caution

However, European authorities call for caution. The European Data Protection Board (EDPB) reminds organisations that data can only be considered “anonymous” if re-identification remains reasonably impossible considering current and future technical means. A case-by-case assessment is therefore essential, particularly when source data is sensitive, extensive, or highly granular.

The CNIL has taken a similar position in its work on anonymisation and artificial intelligence. The authority highlights that some generation methods may preserve correlations strong enough to enable indirect re-identification, especially when combined with external databases or publicly available information.

Generative AI and Synthetic Data: A New Re-identification Risk

Risks increase with modern generative AI models such as GANs (Generative Adversarial Networks) or diffusion models. These systems are trained on very large volumes of data to learn complex statistical patterns. In principle, they are not intended to reproduce original data, but to generate new “similar” data.

However, when models are poorly configured, insufficiently regulated, or trained on datasets that are too small, they may “memorise” real examples contained in the training data.

Concrete Examples of Risks

A model may then reproduce almost identically:

a patient record,
a financial transaction,
or a combination of characteristics linked to a real person.

This risk, identified notably by ENISA and NIST, may facilitate re-identification mechanisms or indirect recovery of sensitive information. A malicious actor could, for example, attempt to determine whether a person’s data was used during model training (“membership inference”) or reconstruct confidential elements from the content generated by the algorithm.

Unlike traditional anonymisation techniques, the risk no longer comes solely from the dataset itself, but also from the behaviour of the algorithmic model generating the synthetic data.

What Role for the DPO Regarding Synthetic Data?

For Data Protection Officers (DPOs), the challenge is now methodological as much as legal. An organisation cannot assume that a dataset automatically falls outside GDPR simply because it is labelled “synthetic.” It is essential to document re-identification risks, test the robustness of the models used, control the origin of training data, and carefully assess the technical safeguards offered by AI vendors or providers.

Synthetic Data: A GDPR Compliance Lever Under Conditions

Despite these limitations, synthetic data remains a highly valuable compliance lever. It can reduce access to real data, secure development phases, facilitate application testing, and enable controlled data sharing in highly regulated environments.

Its adoption nevertheless requires strong governance, aligned with the GDPR principles of data minimisation, privacy by design, and accountability.

Conclusion

Synthetic data represents a major opportunity to reconcile innovation, AI, and personal data protection. However, it should not be viewed as a miracle solution that automatically removes GDPR obligations.

For organisations, the real challenge is to implement a structured approach combining risk assessment, model controls, and robust governance. When properly managed, synthetic data can become a genuine strategic asset for developing innovative projects while maintaining regulatory compliance.

Do you need support with your data compliance projects? Discover our services: https://www.dpo-consulting.com/

‍

Read this next

See all

Outsourced DPO: Why Outsourcing Your DPO Is the Best GDPR Strategy in 2026

Outsource your Data Protection Officer with a GDPR expert firm. Discover the benefits of an outsourced DPO: compliance, cost control, expertise, independence, and long-term support.

Personal Data Breaches: Understanding the Rise in Incidents and Managing CNIL Notification Requirements

In recent times, hardly a week goes by without the media reporting that an organization has suffered a cyberattack leading to a data breach.

DPIA: EDPB launches a European template to harmonize GDPR impact assessments

The European Data Protection Board (EDPB) has adopted a European Data Protection Impact Assessment (DPIA) template, currently open for public consultation until June 9, 2026.