Synthesize 2023: Leveraging Privacy-Enhancing Technologies with Large Foundation Models
A Deep Dive on PETs and Foundation Models
Generate textbook-quality synthetic data for training LLMs and SLMs
How to get started creating high quality synthetic data for LLM training
How to use Gretel's PostgreSQL connector to automate synthetic data generation
Automating synthetic data generation using Gretel and PostgreSQL
Peter Kairouz (00:57):
I'm Peter. I'm a researcher at Google and I'm very excited to be at Synthesize with all of you. Today I'll be presenting on privacy enhancing technologies with large foundation models. Before we jump into the privacy part, let's take a look at foundation models, which are becoming an increasingly integral part of our lives. These are very large, big models that can operate on many modalities of the data, from text to images to speech to structured data, and 3D data, so on and so forth. And they can help us really with a lot of things.
We've seen it recently with Chat GPT and others, question answering, but they can also be used for sentiment analysis, information extraction, image captioning, object recognition, instruction following, so on and so forth. They unlock so many opportunities and potentials. There's a few of them from Google, from open AI and from many other companies, whether startups or big tech companies.
These models, they do come with some risks, many of which are somewhat understood and many of which are being currently researched. Some of these risks include the factuality or correctness of the answers that they give, the inequity or unfairness that they may create sometimes, being abused or misused. Some safety questions, security and privacy concerns. And last but not least, copyrights and intellectual property.
Obviously in the talk we cannot tackle all of them, but in this one we're going to focus on privacy. So let's talk a bit about privacy, especially in the context of these large foundation models and let's see if there is a concern to begin with. In fact, one can show, and this is from researchers from Google and academia, where they showed that you can extract personal information from large language models by cleverly querying them. So you can see on this figure extracted from a paper from late 2020, we can retrieve very personal information, an individual's name, email address, phone number, fax number, physical address, so on and so forth.
Now, keep in mind that GPT-2 was actually trained on web data and that's probably okay, but nevertheless, we could extract information about very specific individuals. These are not even celebrities. You may wonder, is this also the case for diffusion models? Can we now extract information from these types of image models? And there's very recent research from just last week that shows that you can indeed, by cleverly prompting these models, search and retrieve certain images that were used during training.
Again, this is a setting where we're retrieving images of popular things including celebrities, icons and figures. And so we're not really making claims about some privacy violations, but nevertheless, the fact that these models are over fitting and memorizing certain training examples is something that could raise privacy questions or that we need to be a little bit aware of.
We talked a lot about these image models and language models. Does this happen with structured data? In fact, the US Census Bureau, which releases statistics about 300 million plus people, a lot of variables, we're talking 1.8 billion variables. This is all structured data in terms of contingency tables, information such as location, sex, age, race, ethnicity, so on and so forth. In total, we have about 100 billion statistics released publicly in every census.
What we could show is that because we have many more equations than variables, we can solve to find some of the private variables. So for instance, we can show that about 46% of the records in the original data set can be reconstructed exactly, and about 70% up to ±1 in terms of estimating the age of individuals. Now, this is all to say that every little value we release, whether it's structured or unstructured, when it gets to scale, every little value leaks some information, and this whole thing adds up.
So these are the risks of very large models, whether we are talking about synthetic data in terms of contingency tables for structured data, all the way to image and language data. Now, how do we go about reasoning for privacy in this case? The first and foremost thing that I want to make very clear is that privacy is a very multifaceted object. It's very complex, it has so many different axis, it means so many different things to different people, but we're currently converging towards some guiding principles that address separate concerns. And in this talk, I'll focus on a few of those principles, but just to overview them to you, there's at a high level about four privacy principles.
Some that are for the user, others that are for the platform, and yet others for the verifiers. You can learn more about this in our recent communications of the ACM paper that's titled Federated Learning and Privacy.
Let's begin with the first one, which is, the user has transparency and some user-centric control of what data is used and what purpose it is used. For example, since we're talking about foundation models, a lot of these foundation models are actually trained on public or web data. You've scraped the web, Twitter, Wikipedia, et cetera, and you've trained these models, big language models or image models. Is that okay? Just because you know the data is on the web, does it mean that these models are allowed to be trained on it? Maybe in some cases yes, because we're training on data that's benign such as Wikipedia articles.
But maybe in other cases if somebody makes a tweet and says something that's a little controversial, they don't want the information of that tweet to be captured by Chat-GPT and then be broadcasted to 100s of millions if not billions of people at scale. Finally, the user should have the right to say, I want to be forgotten, and this is captured in GDPR. So we should be able to ask the model trainer to remove the data that is personal from it at some point in time. This is known as the right to be forgotten.
So moving on, we can talk about the second privacy principle, which is data minimization. And this is about only collecting the data that is needed for a specific computation, limiting access to the data at all stages and processing an individual's data as early as possible and discarding any collected and processed data as soon as possible. So there's early aggregation and minimal retention with focused collection. And on this principle, a technology called federated learning, which is getting a lot of attention, both in academia and in industry, across the different areas of industry, is very helpful for minimizing the data that the server and service providers actually have access to.
So in FL, you have a number of client devices. Those can be computers, phones, they can be institutions, hospitals, banks, so on and so forth. They collaborate to train jointly under the orchestration of a service provider, a joint model that is trained on the collection of data that's actually found on each client device. And they do so without ever transferring or exchanging the data amongst each other. So this server is actually orchestrating the model training and the learning is actually happening on edge. So instead of pushing the data to the cloud, centering and centralizing it all in one place, leading to potential risks, now you actually push the model to the edge and you train the model where the data resides and you only send back updates that have been privatized.
So you can see on this figure that for federated learning, it really gives minimized data exposure, because we don't give the data to the service provider, we only sent those focused updates, which are updated models that we've trained on the edge, and then we aggregate those updates as soon as they arrive to the server. This is known as immediate aggregation and we discard any metadata related to the identities of the client devices that have participated in the training. This is anonymous ephemeral updates.
And finally, once we obtain a model that we like, we do only a single deployment at scale at the end of the day. So there's data minimization at all stages of the pipeline. Now we could strengthen data minimization using cryptographic and secure hardware technologies. Encryption at rest or on the wire ensures that if anybody is spying on the link of communication between the server and those client devices will not be able to inspect the messages and will not be able to see the contents of those messages in terms of, what are the model parameters that are being sent from the cloud to the devices and back from the devices to the cloud.
Further, if we use multi-party computation and cryptographic primitives, we could ensure that the server only sees those updates in aggregates as opposed to the individual messages that are coming from single devices. And this whole thing can be done in secure enclaves or trusted execution environments to give remote attestation power to the clients that indeed the protocol is being followed. Let's talk about one such specific technology, which is known as secure aggregation, which Google researchers helped developing it, along with many academic researchers and other researchers.
So secure aggregation is a cryptographic protocol. It's a multi-party computation protocol that allows the server to obtain the sum of high dimensional vectors. These are representations of the model updates that the clients are sending back to the server, and it does so in a way that ensures, and this is cryptographic, that the server only learns the sum and no other information whatsoever about the individual vector that's being sent back from the client to the server.
There's a lot of details about secure aggregation in terms of how it is implemented. It's a complex protocol. I'm going to leave it out of the picture for now, and we could assume that we have this interesting magic box that we can use in order to reduce the information that the server can see and only allow the server to see aggregated information across a large cohort of clients in every training ground of the model in a federated learning system. This is one example.
Now moving on, we can talk a little about the third privacy principle, which is data anonymization. This is about making sure that the released output of any computation, especially when we train these foundation models, does not reveal anything unique to an individual. So we care about aggregate statistics, including the model parameters, and we want to be able to release them to engineers and scientists or beyond, but we want to make sure that they do not vary significantly based whether or not any particular user data was included during training in the aggregation process. That's the principle.
And in this round, differential privacy is the gold standard. It says that if you compute a query on a dataset to get a model output, this is the whole training algorithm that we use in order to train a certain machine learning model, if you do it on an adjacent dataset D-prime, where these two data sets defer by one unit X that I'm going to get to in a bit, these two outputs, the models that were computed on data sets, D and D-prime, should in theory look a lot alike, statistically. If they don't look a lot alike, if they are actually very different from one another, then the presence or absence of that unit X is going to allow us to distinguish between them, but we want them to be very close so that if we change one unit X, we don't lead to drastic differences in the model that's being trained.
And that's why we need randomized training. And differential privacy is captured via a parameter called epsilon. And the smaller this epsilon, the better the privacy guarantee is because we're ensuring that these two models, the one that's computed on D and the one that's computed on D-prime, are actually closer in statistical measure to one another.
Now let's talk a little bit about this Unit X changing one X from D to D-prime, what it really means. It could mean a lot of things. It could mean if we change one token, if we're training language model, and this is the only difference between two neighboring or adjacent data sets, D and D-prime, it could mean that we change one example, which is a sequence of tokens that form a row in a training batch. That's the standard thing that's used in many current ML infrastructures and algorithms that implement differential privacy.
But it could mean other things and it could take other units. For instance, it could mean changing a full paragraph, going from one data set to the other, changing many different sequences of tokens in a training batch. It could mean changing the full document or it could mean changing the entire user data or even the entire organizational data. As you can see, as we go down this list of units, we're actually strengthening exponentially the privacy guarantee that we're giving while training.
Now in the context of federated learning, we typically use the notion of user level DP because the unit that we believe should be protected is the client or the user. So one thing that we want to emphasize, is that nowadays with all the experience of training with differential privacy, the question of whether or not training with DP is actually viable, is becoming less and less exciting. Because now that we have very solid understanding of DP, we know that rather than asking the question of whether or not DP will work, we should just ask instead, how large should the population of users or how large should the dataset be or how much compute am I willing to spend in order for me to get the right accuracy under strong differential privacy guarantees?
And this is mostly because the more computations you can afford to perform and the larger the dataset you have, the better the accuracy becomes under a certain fixed privacy constraint that you apply. And we'll see an example later in this talk. So let's see how differential privacy is applied to federated learning. First, there's this very popular algorithm that's called federated averaging, in which when the model is sent to the device, the training happens on the bots.
You can see here that I'm running multiple steps of stochastic grading descent in order to update the model locally. All these operations happen locally. And then what we only send at the end of the day is the delta between the model that's received and the final updated model on device. And the server aggregates these deltas across a number of client devices that participate in the cohort, and it uses this aggregate in order to update its own model, and then it goes in rounds until the loss converges.
So you can do 1000s of rounds during the course of training. Now if you want to add DP into the picture, all you have to do here is, after you are done training locally on the device, you take the updated model and you clip it in the L-two sets. You clip the L-two norm of the updated model. This ensures that we're limiting a user's contribution, it bounced the sensitivity of the model to any individual user. And then once we've aggregated the updates across these clients, we're going to add noise to make sure that the model on the server does not overfit to any user's update. Usually we use Gaussian noise in this process.
Crucially, in this step, it says that in order to give good privacy guarantees, each device or client should participate with a probability queue. And obviously the smaller this probability queue, the better the privacy is. It gives a boost in the privacy level that we can give. And this is natural, because if I don't participate very often and statistically speaking, if my probability of participation is very low, then the privacy that I should be getting is actually quite high.
However, there's a problem here because when we train these models in a federated setting, we don't have tight control over the probability in which we could sample clients. This is because the clients can be online or offline at random points in time. The server does not know the whole population. The population may be dynamic, new clients could actually come in and go offline, and it's hard to track and sample truly uniformly across the training. And this is not only a problem in federated settings. In fact, it's a big problem in even centralized settings, especially for foundation models that are trained on the web.
How can you ensure that your sampling records or client data uniformly at random so that you could give this guarantee? You can see here on this plot that I'm showing on the right-hand side, that if we do not assume this type of a constraint, the test accuracy under the exact same epsilon constraint of privacy, will become significantly smaller. You can see the dash line compared to the solid line, how the solid line is significantly lower, especially at levels of privacy that are below eight or five.
In a recent work, a few researchers from Google, myself included, we've presented the first algorithm that gives state-of-the-art provable differential privacy guarantees without assuming any type of sampling or shuffling of records or client devices in the federated or centralized setting. And in fact, this was used to train a production grade LSTM. That's a language model for next word prediction tasks, it was used for the Spanish language Gboard users. We've trained at over 2000 rounds that spent six days and we had 6,500 devices participating in every round, and we made sure that every device cannot participate more than once in every 24 hours.
We've shown also that the model quality improved over previously trained models. To the best of our knowledge, this was the first production neural network that's trained directly on user data and announced with a formal and strong differential privacy guarantee. And I'll tell you a bit about this differential privacy guarantee. But instead of just giving you the DP guarantee, I want to emphasize that sometimes it's even more important to give you information about the training process itself. For instance, for this DP guarantee, what is the scope of the guarantee itself? Does it apply to all users, to some users, at the record level? What's the unit of privacy?
So in this case, this guarantee applies to all users of on-device data for training that one single model that was deployed in production. And the unit of privacy that we have used is one with respect to zeroing out all of the data of any one device across all of its participations under the scope above. We've assumed that there's a trusted aggregator that implements the DP mechanism faithfully and that the DP guarantee holds for the final model released, as well as for all the intermediate models across all the training rounds of the process. And finally, with respect to the DP guarantee, we achieved an epsilon of 8.9 for a delta of 10 to the negative 10. And the code is online for inspection, for vetting, so on and so forth.
This is a big milestone and we're hoping to build on it to train with even stronger epsilons and even bigger models. Now, one way in which we could quickly strengthen this guarantee is to say that in addition to training with differential privacy, we're going to bring in that secure aggregation box that we used and talked about at the beginning. This means that the server would now only be able to see the aggregate updates in every training round and would have to add the noise to the aggregated update.
In fact, we can go one step further and say that, well, why don't we add the noise on device itself instead of having to rely on the server in terms of adding the noise and implementing the DP mechanism correctly? And this is indeed exactly what we have done in a line of research. It took a lot of work, because we wanted to make sure that the noise that's added locally combines with cryptographic primitives, and we can give a provable strong correct privacy guarantee.
These things need a lot of theoretical analysis for us to be able to give proper strong guarantees and to train these systems at state-of-the-art privacy accuracy trade-offs, and I invite you to check the papers on this slide if you would like to learn more.
Finally, before I close, I want to go back to the main topic of this talk, which is foundation models. Everything we've seen so far is progress towards being able to train these large foundation models with DP, with data minimization and with user-centric controls, but there's a lot more work to be done. For instance, we believe that we will be in a setting where we're not going to train the full big foundation model with 100s of billions parameters on a user's device. Even if that user is a hospital or a bank, it's just going to be impossible in terms of compute.
However, we can train a slice of it or an adaptation of it or try to fine tune some parts of it for a certain application or task, and that needs to be done with DP. So combining different modalities and training across different privacy levels and requirements, from what we require for training on the web versus what we require for training on institution data, versus user data, device data, so on and so forth. This is currently being studied under the rubric of mixed federated learning or mixed training or private public user data training. And there's a lot of research and interesting work that needs to be done.
Second, using techniques to measure privacy empirically and test these algorithms is extremely important. So we want scalable empirical privacy tests that could add value for measuring both memorization and potential copyright infringement along with any measures of membership inference. And there's a lot of beautiful research that's happening in this space.
And last but not least, there's a lot of work that needs to be done in order to tell us and inform us how we could enforce this right to be forgotten, in terms of if the user at one point says, hey, you've trained this model, I was okay with it, and you train it under a certain privacy constrained with certain data minimization technologies and cryptographic technologies, but now I want you to completely forgot my data. How can we go back and remove that user's data from the model without having to retrain the model from scratch and spend a lot of energy and time for it?
All right, with that, I'll stop here. I'll be happy to take questions for you. Thanks again for having me, and if you want anything, you can interact with me using my Twitter handle. Thanks a lot.