How to create differentially private synthetic data
In this video, Alex walks you through creating a differentially-private synthetic dataset and how to determine the quality of that data.
Generate textbook-quality synthetic data for training LLMs and SLMs
How to get started creating high quality synthetic data for LLM training
How to use Gretel's PostgreSQL connector to automate synthetic data generation
Automating synthetic data generation using Gretel and PostgreSQL
Hey, this is Alex from Gretel. Today, we're going to walk through an example of creating differentially private synthetic data. We'll then compare that example against both the real world data that was based on and another set of synthetic data that's not using differential privacy. We'll look at some of the privacy benefits and the effect on accuracy for the downstream models.
To follow along, go ahead and connect to dock stock Gretel at AI. From there, I'm going to click on SDK notebooks, just walks through a few different examples of using Gretel for different use cases. Look for one called synthetic data with differential privacy. And we'll go ahead and open that in co-lab. I'm gonna go ahead and click run here and then we'll walk through some of the what's happening in the notebook as it starts training the model. So I'll go ahead and click run all, no need to configure a GPU here. We're going to use the global cloud service to do the model training for us.
So we're going to start off by installing the Gretel client and then download a data set that we're going to use for for synthesis today. So here we can see it contains hour of the day east scooter. So a bike ID and we have latitude longitude coordinates for source and destination. The exercise we're going to do today is that we're going to insert secrets, so very precise latitudes and longitudes into this dataset. And we're going to see how many times we have to repeat it before the model starts to memorize that data.
So I'll go head over to Gretel, grab my API key, go ahead and paste it in here and click go Here, we get a preview of the dataset so we can see different source latitude, longitude. So very precise location coordinates here. What we're trying to do here. And we're simulating this idea of how many times would a individual user's address. For example, need to be repeated here before the model memorizes that address and replace it.
So the goal of synthetic data is really to learn about locations and patterns and things like that. We're not trying to learn about individual users when we think about things from a privacy perspective. So we're trying to make sure that the model here creates a new data set that is representative of what it was trained on, but it doesn't actually point to any real humans data.
To measure this what we're going to do. And we have a function here called create canaries. What that does is it takes a set of secrets. So here we've created some totally fake latitudes and longitudes here, but these are our secrets.And we want to see how many times do we need to repeat these in the input data for it to exist or be repeated by the model when we ask it to create a new data set. So for each row here, we can see we're sampling. Let's go ahead and take it. See how many rows we're running. It's about 270 rows, I think.
For each row. What is going to do is insert one of these secrets here.I can see 63 being repeated right here. It's being repeated right here. So you see it in these different rows where we're inserting secrets in it's doing it at different frequencies. So here we have this 85 numbers being presented repeated 5% of the time, the 63 numbers and 50% of records, so we should see that quite often. We're combining that with our training data and we're going ahead and start training our model. Here we see the Gretel configuration. We've included all the different config options we're using here. Not all these are necessary, but it helps just to walk through. So we're loading with our default configuration.
Let's talk about some of the things that we're doing, that's different here. So this is really what you see here. That's different for this initial example. Here we are training a model. We're using differential privacy equals true. We're also using what I would consider to be a pretty minimal amount of noise. So differential privacy, how we implemented it, we use a technique called DPS GD. So essentially what we're doing is inserting noise into the optimizer as it's training a neural network to learn to recreate this dataset. What that does is it has the effect of preventing that model from learning secrets in the data. So we want the model to learn patterns. We don't want to learn the model to learn or memorize specific data in the end, the training data. So we're inserting, I would say a minimal amount of noise here. You can turn this up based on your privacy concerns, but we'll see what kind of effect this has on the overall data set.
The second thing we're doing, and this is called the LT norm clipping. We are clipping the gradients in the optimizer. And what that does is it prevents the optimizer from learning too much at each step. So both of these really kind of intentionally build privacy directly into our model to make sure we understand some of the privacy benefits here, we're turning off privacy filters. So privacy filters are technology that we've built at Gretel that has meant to identify and remove records that could create risk from a privacy perspective for known attacks members and inference record linkage attacks, things like that. We're going to turn those off at first and we're going to look at the effect that DP has running all by itself on this input dataset
And see the configuration that we're training with here. And let's go ahead and start a training. DP does require a little bit of extra compute versus what we're able to do when we're just running a regular synthetics on a GPU. And you'll notice some of that as the the models training here. So I won't make everyone kind of wait through this entire thing. We'll go ahead and jump and look at some of the results of these models that have been trained and completed. If you want to try experimenting with your own settings, I would encourage try changing some of the different settings, try turning DP on and off, try adding outliers and similarity filters. And at the end here, we're going to see how often from a generated dataset that those secrets are repeated.
Here we can see the project that we created. Let's go ahead and check in. So here we have a new project called rideshare DP model and go into the Gretel console here, search on rideshare There's our project could see that we just started training a new model here. Hilarious, awesome troll. We'll take a look at that. So it's in the process of training right now. I would expect this to take probably about 20 minutes to train this.So we're going to go ahead and jump over to a trained model and look at the results So we're going to start here first.
The first thing I did for this I ran a version of the exact same notebook we had, but I disabled differential privacy. And as you can see here, DP is turned off and I also disabled all of the filters. So the question is saying, just creating synthetic data by training a model, what is the chance that it's going to memorize the replay to these secrets? Go ahead on and take a look at the results here.
So here are our four to secret values that we created. We can see that value 185 was inserted 14 times and value for the 63 dot 2, 2, 4, 2 latitude coordinator was inserted 145 times into the dataset. So it is natural at some point that a dataset, a model with learn and start repeating these things that it sees that frequently in the data with privacy filters off and differential privacy off, sojust running a language model on this data. We can see that this secret that existed 14 times was not recreated. The secret that was inserted 145 times was repeated six times by the model. In some cases that's not acceptable. In some cases it's perfectly acceptable. So it's really about finding that kind of right balance between accuracy and privacy that you're looking for with your model.
Look at the next example here. So Gretel's default settings is to completely default run with Gretel has privacy filters enabled by default. So let's see what happens running a default cradle here. We can see that none of these initial secrets with the privacy filters turned on were repeated all the way up until the end here where the 63 dot 2 0 2 4 2, which was inserted in this case, 133 times was repeated four times in the day in the final data set.
So now what we came here to do, let's go ahead and look at the results of differential privacy and privacy filters being run on this dataset. So go down here and look at the results here. And as we were hoping, and as we were expecting here, none of these secrets were reported and it repeated the input data. So anywhere from the secret, that was repeated 20 times to the one that was done 129 times, none of these were repeated by the the output of the model, which is really encouraging.
If you are concerned about addresses, credit cards, types of data like that, social security numbers, things like that, that maybe users put into a free torrent warm text field, or something like that, that shouldn't exist here. We're giving you some guarantees and some levers you can turn to, to make sure that those things don't get repeated downstream by your machine learning models.
So really encouraging results here. Let's take a look at what effect does differential privacy and using these settings have on the accuracy of the model. So one of the things that happens every time that a synthetic model is created with Gretel is we run what's called an accuracy report. So here's the case of differential. Privacy is not able to, we're running as a standard vanilla Gretel synthetics model. You can see it trained pretty quickly here at trained in 18 minutes on, on a GPU and the grotto cloud. The quality score is pretty good. So here, we're looking at an 88 out of 100.
Let's go ahead and take a look at what does that mean? So I'd like to look at the report and it gives us a nice kind of feel for for why did we give it this 88 out of 100 score? So I'll go ahead and download that can open up the model, take a look here. None of obviously the training lines are duplicated and the output data, we can see what was turned on. So our privacy filters were on differential. Privacy was off. And what you see here is the correlations that exist in the input data were learned very well. So, you know, very little separation correlations, similar patterns here, and this is where you really see things with differential privacy. So here we can see the distribution when we look at a PCA graph outliers and things like that, that existed in the input data were largely recreated in the synthetic data as well. So this is default run gives you really good accuracy and good privacy protections, but not as good as, as you'll see when we run differential privacy. We also see the distribution of each individual field here is very close to what it was in the training data. So this is something that gives you a warm and fuzzy that your synthetic model has very well learned. The insights in the patterns and the original input data and being able to recreate it.
Okay, let's go ahead and take a look at what impact this has when we are running a differential privacy. So go back, back out here, let's go to our rideshare data sets. Here's one we created which we were running differential privacy on. So we'll take a look at the config here, but you can see the quality score has dropped pretty significantly privacy protections. Excellent training took a little bit longer because the methods that we use to insert noise and clip gradients slow down the model a little bit. Let's go ahead and take a look at the configuration that was used. So same configuration. We just previewed differential privacy is on minimal amount of noise being inserted in here.I'd note that this data set is only about 20,000 lines long. So typically with differential privacy, you really start to see the privacy benefits and accuracy when you have a much larger data set, let's say 50,000, a hundred thousand, even a million rows of data. But for this example, we are starting with something a little bit smaller.
Now let's go ahead and take a look at the report and see, can we see any differences other than this synthetic quality score being lower that would help us understand, like, why is this model performing differently than our vanilla model trained on the standard credit config privacy protections? Excellent. Once again here we've got a 79, 100 score. No data was repeated. Let's take a look here and start seeing what's different. So the correlations you can see here on specially, when you look at this a different graph, isn't quite as strong as what we had previously.So here you see very little differences at all. Here, you can see a little bit stronger as we go down here. This is where you really start to see the difference, these outliers that exist. And you can see inside the PCA distribution graph are no longer there in the input data. So it did learn the patterns and it's recreating the overall insights of the original data pretty well. Let's see that even we look at a field to field distribution graph. They're not as close as they were before, but they're still close. So they're still creating and learning those insights. But you're getting much stronger privacy protections.
I hope this has been useful. Please try and go through yourself, run this notebook try changing any of these different settings. So we've laid them all out. So it's really easy to change things. Try changing the number of reports, try changing the noise multiplier or the clipping, and take a look. Even the number of secrets that are inserted into your data and take a look at how that compares against the settings where you're not using differential privacy and try to find that right balance between privacy and accuracy that you're looking for. Cheers.