Introducing Gretel Tabular DP: A fast, graph-based synthetic data model with strong differential privacy guarantees
We are excited to announce the public preview of Gretel Tabular DP, a differentially private graph-based generative model, which creates synthetic versions of sensitive data with provable mathematical guarantees of privacy. In this blog, we’ll show that when working with a dataset containing largely categorical values, Gretel Tabular DP can produce synthetic records that maintain high statistical symmetry with the original data, even with very conservative differential privacy budgets (ε < 1, δ < 10-7), in just 10 minutes.
Why is the guarantee of differential privacy important for synthetic data?
While generative models can produce synthetic datasets that preserve the statistical qualities of the training dataset without identifying any particular record in the training dataset, most generative models to date do not offer mathematical guarantees of privacy that can be used to facilitate information sharing or publishing. Without such mathematical guarantees, each adversarial attack on these models and the synthetic data they generate needs to be thwarted reactively.
In the last few years, there's been a steady stream of research demonstrating that generative models, ranging from large language models and GANs to diffusion models, have the potential to memorize and expose their inputs. In fact, this body of research has grown so much that there are even competitions to promote innovation around privacy attacks that can extract training data from large models.
We can never be sure of what attacks might become feasible in the future. This is exactly the problem that differential privacy (DP) solves by bounding the probability that a compromising event occurs. By introducing calibrated noise into an algorithm, DP defends against all future privacy attacks with a high probability.1 DP is commonly parametrized by two privacy parameters, epsilon (ε) and delta (δ). Epsilon puts a ceiling on how much the probability of a particular output can increase if you were to add or remove a single training example from a dataset. Delta is a bound on the external risk that won’t be restricted by epsilon, i.e. external risk that inherently exists no matter what you do with the dataset.
How can we generate differentially private synthetic data?
Tabular synthetic data can be produced using many different algorithms. For example, at Gretel we use generative adversarial networks (GANs), language models, and statistical models.
One approach to making tabular synthetic data differentially private is to modify existing models to meet the standard of DP, typically through the addition of noise. However, this method isn't always capable of producing data that's both private and of high quality. Large amounts of training data are typically required to achieve reasonable DP guarantees when training deep learning models from scratch. Further, the noise added for DP during the optimization process used to train deep learning models typically causes a delay in convergence and comes at the cost of degraded quality of synthetic samples.
Another approach is to use models that are designed with privacy at the forefront. Alternatives to deep learning approaches became popular during the 2018 DP synthetic data challenge hosted by NIST. They rely on measuring low dimensional distributions in a dataset combined with learning a graphical model representation of the dataset, all in a differentially private manner. Gretel Tabular DP takes this approach.
How does Gretel Tabular DP work?
Gretel Tabular DP follows the select-measure-generate paradigm developed by McKenna et al., which follows three steps.
- Automatically select a subset of correlated pairs of variables using a differentially private algorithm.
- Measure distributions of the selected pairs with differential privacy. These distributions, called marginals, are effectively large contingency tables of noisy counts.
- Estimate a probabilistic graphical model (PGM) that captures the relationship as described by the noisy marginals, and sample synthetic data from the PGM.
The differential privacy guarantee applies to each row of the tabular dataset. Adding or deleting one row of the dataset is guaranteed not to change the outputs by much; Tabular DP provides record level DP.
How accurate is synthetic data with differential privacy guarantees? An example with hospital data.
Let’s use Gretel Tabular DP with a dataset of over 100,000 diabetic patient records describing factors affecting patient readmission.2 There are 43 attributes including basic demographic information, specifics of their hospital stay, medications, and diagnoses, and an indicator for whether they were readmitted to the hospital. The dataset primarily contains categorical data types, with a handful of numeric type variables.
We trained three Gretel models to generate synthetic patient encounter data: Gretel LSTM, Gretel ACTGAN, and Gretel Tabular DP with two different levels of privacy.3 Here’s a notebook where you can follow along.
Table 1. Performance of Gretel models. Gretel LSTM with medium privacy filters, Gretel ACTGAN with medium privacy filters, Gretel Tabular DP with high privacy settings (ε = 0.5, δ = 3 × 10-8) and Gretel Tabular DP with medium privacy settings (ε = 1, δ = 3 × 10-8).
We are interested in comparing the tradeoff between the quality of data produced by these models and the privacy protections afforded by them. We measure synthetic data quality using Gretel’s Synthetic Data Quality Score (SQS), which is an estimate of how well the generated synthetic data maintains the same statistical properties as the original dataset. SQS can be viewed as a confidence score as to whether scientific conclusions drawn from the synthetic dataset would be the same if one were to have used the original dataset instead. On the other hand, we consider privacy to be a property of the data generation process. Gretel’s privacy filters provide heuristic protection against common adversarial attacks, while differential privacy provides quantifiable protection against all current and future adversarial attacks.
Table 1 shows a comparison of the models. While the model with the best quality synthetic data (SQS = 93) is Gretel ACTGAN with privacy filters turned on, Gretel Tabular DP performed comparably even with very conservative privacy parameters. For a differential privacy guarantee5 of ε = 0.5, δ = 3 × 10-8, we can generate a dataset of good quality (SQS = 91). Increasing the privacy budget to ε = 1, δ = 3 × 10-8 helps close the gap further (SQS = 92).
SQS is a composite of the three different scores shown in Table 1. Let’s dive into each component.
- Why is Field Distribution Stability consistently high with Gretel Tabular DP?
Gretel Tabular DP measures one-way distributions, and takes tremendous care to ensure that these distributions are preserved, even when there's a heavy class imbalance.
- Why is Deep Structure Stability lower than other Gretel models?
The model considers one-way and two-way marginals. As a result, more complex structure in three-way and higher marginals might not be best encoded by the model. We will soon add support for users to specify a limited number of three-way marginals that are important to preserve during modeling.
- Why is Field Correlation Stability low for this dataset?
Given Gretel Tabular DP explicitly measures two-way correlations in a dataset, we would expect them to be well preserved. However, not all two-way correlations are measured. So, if each variable has moderate to high correlation with only one other variable, then Tabular DP models should do very well on field correlation. However, if some variables are strongly correlated with multiple other variables, the model will be unable to capture this as well.
Why is Tabular DP so fast?
Particularly noteworthy is the low training time for Gretel Tabular DP, which utilized only a single CPU, and generated synthetic data within 10 minutes. Gretel LSTM and Gretel ACTGAN, both of which utilize GPUs, took much longer in comparison. This is because Gretel Tabular DP is calculating large tables of counts and estimating a graphical model, which only grows in complexity with the number of variables. So a dataset with over 100,000 rows and 43 columns is a cinch to process! In comparison, Gretel LSTM and Gretel ACTGAN utilize deep neural networks, which learn from each record over multiple iterations. As a result, datasets with more records take longer to train.
What kind of datasets will Gretel Tabular DP work well for?
Gretel Tabular DP works well on datasets with primarily categorical variables, relatively low cardinality (<100 unique categories per variable) and under 100 variables. As with all differentially private methods, increasing the number of records in the training dataset will typically result in improved synthetic data quality. Gretel Tabular DP isn't appropriate for time series data where maintaining correlations across sequential records is important, as the underlying graphical model has an assumption of independence between records.
How do we choose differential privacy parameters?
Gretel Tabular DP accepts privacy parameters ε and δ. The strength of the privacy guarantee desired determines ε. Small ε less than 1 provide the strongest guarantee. We recommend starting below 1, and increasing as necessary if higher quality synthetic data is desired. Experts recommend setting δ to a value much smaller than 1/n, where n is the number of records in the training set. By default, we initialize this value to be less than or equal to 1/n1.5 depending on the characteristics of your dataset. Choose a lower value for stronger privacy guarantees.
Gretel Tabular DP is a fast and powerful new Gretel model to generate high quality tabular synthetic data with mathematical guarantees of privacy. Try it out now by selecting the “Create provably private versions of sensitive data” card in Gretel Console!
- For a deeper dive into the importance of differential privacy in generative models, take a look at Jonathan Ullman’s technical blog post.
- https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008. The original dataset contains 50 columns. We removed those with only one category, high missingness, and variables with low correlation to readmission, such as payer_code.
- We also attempted to train Gretel LSTM with a differentially private optimizer, but the resultant model was not able to produce sufficient valid records after having spent a privacy budget (ε) larger than a million. As of May 8, 2023, we have deprecated DP for Gretel LSTM in favor of using Gretel Tabular DP.
- Gretel LSTM and Gretel ACTGAN use default parameters. Manual hyperparameter tuning was not performed.
- δ, the probability of accidentally leaking information, is set to <<1/n, where n is the number of training records. For this dataset, Gretel automatically sets δ = 1/(101766)1.5 ≈ 3 x 10-8.