Teaching large language models to zip their lips

Gretel introduces Reinforcement Learning from Privacy Feedback (RLPF), a novel approach to reduce the likelihood of a language model leaking private information.


Gretel introduces Reinforcement Learning from Privacy Feedback (RLPF), a method that can be used to align large language models (LLMs) to improve generative quality while also making them more privacy-preserving. Language models leaking proprietary data or custom prompts is a problem that's currently plaguing many generative AI applications. We propose RLPF to mitigate some of these issues. We also suggest future directions to reduce bias, discrimination, and other harmful characteristics that might exist in today’s language models.

The problem: privacy leaks in large language models

While language models are increasing rapidly in capabilities, they can still leak personally identifiable information, proprietary training data, or custom prompts when interacting with users. This undesirable behavior comes from many factors, but the primary factor is the way these language models are trained and how they can memorize training data.

Some of this leaked sensitive information may have come from internet data used to train LLMs, some or much of which wasn’t intended for training AI models when it was published. Leakage from an LLM is an unforeseen side effect, even though the data was publicly available. The problem of leakage compounds when the data was private or proprietary to begin with, and used to fine-tune a model. Leakage of data that had strict sharing requirements due to regulation, policy, or a desire for privacy can have profoundly detrimental impacts on organizations and individuals. 

One approach for mitigating privacy leaks that has a strong theoretical grounding is to use differential privacy (DP). We have several posts about this technology and are strong believers in its applicability. Applying these techniques to LLMs is an active area of research with somewhat mixed results. Training instability and potential degradation of downstream performance makes DP a tricky method to apply directly to LLMs. However, looking past these potential issues, DP training can provide a mathematical guarantee that your model preserves privacy, which is a strong plus.

Considering other approaches to address the problem of leakage, we were inspired by recent work on alignment that uses reinforcement learning (RL) and human preference data to encourage models to be more helpful and less harmful, and asked: 

Can we combine the best privacy measures with the immense alignment success of RL-based methods? 

Reinforcement Learning from Privacy Feedback (RLPF)

We discovered that you can replace ‌human feedback data with concrete measures of privacy and use RL to improve the language model’s capabilities in a multi-task fashion. We call this RLPF since it uses privacy feedback to reduce a model’s tendency to leak sensitive information.

Our initial experiments suggest that reinforcement learning can be used to make language models less likely to reveal private information while still maintaining generation quality. We show this on a text summarization task and present a few benchmark models as a comparison (see results below). 

In traditional reinforcement learning from human feedback (RLHF), human preferences regarding generated text are collected and ranked. A separate model is trained to predict these preferences. This model then acts as a reward for the original language model. The model would be rewarded when it generated things the reward model encoded that humans would prefer and was penalized when it generated improper text.

We test our method on a text summarization task using the CNN/Daily Mail dataset. This dataset consists of pairs of articles and highlights. Each highlight is a summary of the main points of the associated article. The goal of this task is for a language model to generate accurate highlights from a given article.

There are a number of ways to reward a model when it generates high quality highlights. METEOR is a common metric used to determine how well an article was summarized. Without diving too deep into specifics, METEOR can be thought of as a measure of overlap between the original article and the summarized article. A score of 1 implies that the article was well summarized and a score of 0 suggests a poor summarization. In practice, good numbers are usually around 0.2. It isn’t a perfect measure of summarization quality, but is powerful enough to be useful. 

We can use METEOR as a proxy for human preferences (they're highly correlated) in a reinforcement learning loop. At the end of training, our base model will have improved its ability to summarize text. 

However, one issue here is that summaries will often contain private information from the original article. We might want a system that can generate anonymous summaries that don’t reveal sensitive information. METEOR will actually encourage the opposite since it relies on overlap. Therefore, we need to introduce a second reward for privacy. 

We found that you can use a plethora of metrics from natural language processing (NLP) as a reward. You could use the Flesch-Kincaid readability index, a toxicity score, or in our case, a measure of privacy. In the example below, we count the number of names that appear in a summary and penalize the model (negative reward) when the summaries leak names. This is done using a named entity recognition (NER) system. When combined with METEOR to form a reward for reinforcement learning, language models can improve at summarization while simultaneously improving at preserving privacy.

reward = [-1 if ner_scores > 0
          else meteor_score
          for meteor_score, ner_score in zip(summary_scores, privacy_scores)]

This usage of existing NLP techniques side-steps the expense of collecting human feedback and makes the intended behavior of the system more interpretable. Also, the ability to incorporate disparate feedback from various metrics allows us to potentially train models to mitigate biased, discriminatory, or other harmful language. 

Preliminary results

These results are preliminary and meant to illustrate a path forward, but we’re excited about the promising signs of life we see with this method.

Figure 1: NER scores showing the likelihood of generating hightlights with names
Figure 2: METEOR scores showing the quality of summarization for three models


In Figure 1, we see three models with their associated average NER score. A lower score means the model is less likely to generate highlights that contain a person’s name. In Figure 2 we show each model with its METEOR score. A higher value correlates with higher quality summaries of the original article. In both cases, our RLPF-tuned FlanT5-XL model has improved over the baseline model in privacy preservation and summarization quality, and even slightly outperforms the widely used and powerful ChatGPT language model.

These two figures separate out the performance of each model on each task separately. Our RLPF model was trained on the tasks jointly and was able to improve on both of them over the baseline. 

What’s next for RLPF?

We have shown that we can use a measure of privacy and summarization quality to improve the output of a language model using Reinforcement Learning from Privacy Feedback. This is an exciting discovery that could be applied to a variety of problems. One extension of this approach that we’re particularly excited about is reducing biased language in model generations. If this is exciting to you, please do reach out! We’d love to hear from you. 



Thanks to Joshua Greaves from Google Brain, who helped craft the reward function used in this research.