Q&A Series: Solving Privacy Problems with Synthetic Data
Below are answers to several questions I received during my presentation at The Rise of Privacy Tech’s Data Privacy Week 2022 conference in January. For anyone interested in discussing these or other issues related to synthetic data, drop me a note in Gretel’s Slack Community!
Is converting sensitive data to synthetic data like adding noise to a dataset using differential privacy?
I would draw a distinction between the process of creating synthetic data and the mechanisms of differential privacy.
Creating synthetic data from a sensitive dataset involves using algorithms that can learn the distribution of the dataset and generate records from that distribution. For example, if a dataset shows that there is a strong correlation between smoking and developing lung cancer, then the algorithm should learn that, and the resultant synthetic data should also have that property.
In contrast, differentially private mechanisms, such as those that add noise to an aggregate, aim to introduce uncertainty about the presence of any particular individual in that dataset. This is done by calibrating that noise to the sensitivity of the aggregate (i.e. how much will the function change if one record is changed?).
The goal of differentially private synthetic data is to create a dataset that looks and feels like the original dataset without compromising the privacy of any individual data record in the original dataset. Often, we accomplish this by using differentially private methods to train the algorithms that generate synthetic data. There is some calibrated noise addition in, for example, the optimization process while accumulating gradients.
If you’re using ML to create synthetic data are you not open to adversarial attacks?
As long as your approach to generating synthetic data with ML includes the use of certain privacy-protection mechanisms, there’s a very low probability that you will be subject to successful attacks.
The best known method at this point to minimize the success of attacks on ML models is differential privacy. Differential privacy has the property of being future proof – the guarantees do not depend on any assumptions about the attacker, their methods, motives or computational resources. Commonly used methods to produce synthetic data, such as generative models and marginal-based methods, can be made differentially private.
Are there problems that are either too complex (or where the "important parameters" are unknown or contested), or too important for society as a whole to use synthetic data which may eliminate important data points or outliers? How do you draw that line of when to use synthetics and when real-world data should be used?
If the goal of an analysis is to study outliers or small, rare populations, it is likely not useful to study such data using privacy-preserving approaches like differential privacy that aim to prevent the undue impact of any single individual on the analysis. For such analyses, controlling who has access to the data and mandating training on acceptable use are more suitable. I touch on this topic in my blog on differential privacy.
But let’s say we are talking about synthetic data produced without formal privacy guarantees. Most synthetic data algorithms typically require a large number of examples to learn from (large depends on the algorithm - e.g. supply under 1000 examples and training a GAN will likely be a struggle). So the quality of synthetic samples for rare population subsets may be very poor, and there may be no option but to use the actual, real-world data for the analysis.
End-users of synthetic data are often excited by the potential of using it for a wide variety of analysis or ML tasks. But one major challenge is communicating to users the error that they will experience when using the data. Synthetic data can never be useful for all end-user tasks, while protecting privacy. How do you approach communicating the error or accuracy of synthetic data?
I’ll take a step back from this question to emphasize that it is important to define intended use when creating synthetic data. It may change the algorithm used for data generation. It may change the validators used to ensure the synthetic records follow proper logic. It may change the way the original data is pre-processed. So intended use can have a tremendous impact on the process.
The quality of the synthetic data should be measured depending on intended use. What are the various methods of measuring synthetic data quality? Well, there is a plethora of research out there on this topic and users typically choose the methods that are appropriate for the use case.
At Gretel, we recognize the importance of reporting data quality. That’s why for every synthetic model that’s built using our platform, we provide a synthetic data quality report, which can be used by the custodian of the original data to determine whether the synthetic datasets produced by the model will broadly maintain statistical fidelity.
However, it is possible that the person using the synthetic data needs further information about the original, sensitive dataset and does not have direct access to it. In such cases, a differentially private querying system can be useful in understanding relevant information about the original dataset in a privacy preserving manner, following which a comparison can be made to the synthetic dataset.
Thanks again to everyone who participated in the discussion. If you’d like to discuss my answers further or have other questions, send me a note in our Slack community or email us at firstname.lastname@example.org to continue the conversation!