How To Create Differentially Private Synthetic Data
A practical guide to creating differentially private, synthetic data with Python and TensorFlow.
In this post, we’ll train a synthetic data model on the popular Netflix Prize dataset, using a mathematical technique called differential privacy to protect the identities of anonymized users in the dataset from being discovered via known privacy attacks such as re-identification attacks. With differentially private synthetic data, our goal is to create a neural network model that can generate new data in the identical format as the source data, with increased privacy guarantees while retaining the source data’s statistical insights. Today, we will walk through a generalized approach to find optimal privacy parameters to train models with using differential privacy.
To dive a level deeper on the topic of differential privacy, see our previous post on differentially private synthetic data, or an excellent overview of private learning methods here⁵.
The Netflix dataset
According to Wikipedia¹, “the Netflix Prize dataset consists of 100,480,507 ratings that 480,189 users gave to 17,770 movies. Each training rating is a quadruplet of the form <user, movie, grade, date-of-grade>. The user and movie fields are integer IDs, while grades are from 1 to 5 (integral) stars”.
Netflix originally published this dataset in 2007 as part of a contest with a $1 million USD prize for the team that could beat their movie recommendation algorithm. However, researchers from the University of Austin demonstrated how an adversary with only basic knowledge of a subscriber could identify their records, and even “uncovering their apparent political preferences and other potentially sensitive information” ².
The goal of our synthetic model will be to create a new, artificial dataset of the same format as the Netflix Prize dataset and with the same statistical insights, but without memorizing or repeating any individual record. To keep training time manageable for lots of tests, we’ll work with 100k rows at a time.
Parameter tuning approach
In this example, we’ll use the gretel-synthetics library to create the synthetic data model and artificial dataset. Gretel-synthetics uses TensorFlow-Privacy an open source library created by Google researchers to train machine learning models with differential privacy guarantees. Finding optimal parameters for differential privacy is often model specific, and in the remainder of this post we’ll dive in on finding the optimal privacy parameters for our use case.
Here’s the approach we’ll use
- Find the ideal optimizer and learning_rate for our DP model.
- Gradually decrease l2_gradient_clip, then
- Gradually increase noise_multiplier (While maintaining acceptable % of new synthetic records that pass format validation)
Finding the ideal optimizer
For our first experiment, we train 200 models on the the first 100,000 records of Netflix data and bake off two popular optimizers supported by TensorFlow and Keras; ‘RMSprop’ and ‘Adam’. Our goal is to find out which optimizer can best minimize the number of records containing invalid structure or data generated by the model.
In the results above, the size of the circles correspond to the number of epochs in training [1–100]. As each epoch requires an additional pass over the data, higher epoch counts during training time have the effect of reducing privacy guarantees for the model. In our experiment, we can see that the tf.keras.optimizers.RMSprop optimizer with learning rates between 0.001 and 0.01 handily outperforms tf.keras.optimizers.Adam for this task.
# Results: Winner is RMSprop
In our second test, we try different learning rates. The higher the learning rate, the more impact that individual training steps have on the model. According to TensorFlow Privacy’s documentation³, “if the updates are noisy (such as when the additive noise is large compared to the clipping threshold), the learning rate must be kept low for the training procedure to converge”. For this example, we will try a variety of learning rates to find the right balance of learning rates versus model accuracy.
In the test above, we can see that learning rates in the range of 0.000799 (slowest, most accurate with only 1% of records failing and averaging 99 epochs) to 0.050241 (fastest, averaging 19 epochs, least accurate) generated models that passed our success threshold with less than 30% of records generated failing validation. However, past a certain point (0.00006) in our tests, the model hits our test maximum of 100 epochs and never converges, resulting in a high number of failed records.
# Results: Default learning_rate of 0.0015
The l2_norm_clip describes the cumulative gradient across all neural network parameters that will be clipped so that its L2 norm is at most the value of l2_norm_clip⁴.
On our test dataset, the lower l2 gradient clipping settings do not strongly affect our model accuracy, we’ll go with the recommended TensorFlow defaults.
# Results: Default l2_gradient_clip of 0.5
This governs the amount of noise added during training. According to the TensorFlow privacy documentation⁴, “Generally, more noise results in better privacy and lower utility. The noise_multiplier generally has to be at least 0.3 to obtain rigorous privacy guarantees, but smaller values may still be acceptable for practical purposes”. The amount of noise that can be injected into a dataset is often dependent on the quantity of training data available, and the variation within that training data. With the Netflix prize dataset, you generally want at least 100k rows of training data to be able to use a high level of noise. Try experimenting with different data sets and noise multiplier values using the notebook below.
Putting it all together
We just ran several thousand experiments to optimize our differential privacy settings for our synthetic data model. Woohoo! Try generating a synthetic version of the Netflix challenge dataset yourself with our notebook, or switch out the CSV and try running on your own dataset.
Check out an example Jupyter notebook to generate a synthetic dataset with differential privacy guarantees using gretel-synthetics.
At Gretel.ai we are super excited about the possibility of using synthetic data to augment training sets to create ML and AI models that generalize better against unknown data and with reduced algorithmic biases. We’d love to hear about your use cases- feel free to reach out to us for a more in-depth discussion in the comments, twitter, or email@example.com. Follow us to keep up on the latest trends with synthetic data!
Interested in training on your own dataset? Gretel-synthetics is free and open source, and you can start experimenting in seconds via Colaboratory. If you like gretel-synthetics give us a ⭐ on GitHub!
¹ Wikipedia, https://en.wikipedia.org/wiki/Netflix_Prize
² Arvind Narayanan, Vitaly Shmatikov, How To Break Anonymity of the Netflix Prize Dataset
³ TensorFlow Privacy Classification Tutorial
⁴ TensorFlow Privacy README
⁵ Laurens van der Maaten, Awni Hannun, The Trade-Offs of Private Prediction