Generate Differentially Private Synthetic Text with Gretel GPT

Safely leverage sensitive or proprietary text data for advanced language model training and fine-tuning

Introduction

Organizations possess troves of text data, such as customer call logs, patient interactions with digital health chatbots, legal correspondences, survey responses, and internal company emails. While this data is incredibly valuable for developing external tools or enhancing internal systems, it is often underutilized due to privacy concerns. 

Imagine you’re building a chatbot application using customer call logs. A common approach to using such data is de-identification. 

Original Message:

Hi, this is Pugsley. I need help with order number 1042. It should have been delivered to 123 Maple Street, but I haven’t received anything yet. Might be due to earthquake-related delays.

De-identified Message:

Hi, this is [NAME]. I need help with order number [ORDER_NUMBER]. It should have been delivered to [ADDRESS], but I haven’t received anything yet. Might be due to earthquake-related delays.

While deidentification removes explicit identifiers such as names, order numbers, and addresses, residual phrases and context can still lead to re-identification. For instance, the mention of an earthquake provides a unique context. Further, if Pugsley is active on local forums, sharing similar issues about package deliveries, this external information combined with the de-identified chat logs might suffice to re-identify them. If this de-identified message is accidentally revealed by the chatbot, it would pose a significant privacy problem.

An alternative is to generate ✨differentially private synthetic text✨ that captures the essence of the information contained in call logs without referencing any particular customer like Pugsley. This can be achieved by fine-tuning a large language model (LLM) with differential privacy (DP). DP plays a crucial role in safeguarding the private information of individuals or entities by adding calibrated noise during the learning process. This significantly reduces the risk of exposing unique linguistic patterns or specific contextual details, which can be exploited by adversarial attacks.

An LLM fine-tuned with DP is likely to produce a much less sensitive synthetic message, while still maintaining the insights and distribution of the original data. A differentially private synthetic example might look like this:

Hi, I’m John Smith. Reaching out because I haven’t received order 001 yet due to shipping delays. Can you help me?

Keep reading to learn how to safeguard Pugsley’s privacy with DP! We’ll dive into using Gretel GPT to generate DP synthetic data for privacy sensitive use cases involving medical notes and private conversations. 

New to differential privacy? Check out these resources!

  • Our blog about the concept of DP, which points to several introductory resources and clarifies misconceptions.
  • Our prior research on DP for text generation, where we discuss our algorithmic approach to applying DP to fine-tuning LLMs.

Start creating differentially private synthetic texts now 👇 

TL;DR - Use Gretel GPT to fine-tune your favorite language model, such as Mistral models with DP. Simply add the `privacy_params` section to your model config.

models:
  - gpt_x:
    ...
    privacy_params:
      dp: true
      epsilon: 8
      delta: auto
      entity_column_name: user_id

The following parameters enable DP fine-tuning.

  • dp: true activates DP fine-tuning for privacy sensitive data sources.
  • epsilon ε - Privacy loss parameter for DP. Specify the maximum privacy budget, epsilon, available for model fine-tuning. We use a default of 8.
  • delta δ - Probability of accidentally leaking information. By default, delta is automatically set based on the characteristics of your dataset to be less than or equal to 1/n^1.2, where n is the number of training records. 
  • entity_column_name is an optional parameter. The default value of `null` maintains record-level DP. This ensures that adding or removing any single record in the dataset has minimal impact on the final model, thereby protecting the privacy of individual records. The default setting is suitable for datasets where each record is independent.  This parameter must be specified when the unit of privacy extends beyond a record. For example, if the dataset contains several records linked to an individual, we would specify the column name with individual identifiers here. This maintains user-level DP, which protects all data associated with a unique user or entity, ensuring that even if multiple records pertain to the same user, their collective information is afforded the DP guarantee. 

The effectiveness of DP fine-tuning

Datasets
In this post, we compare DP vs non-private fine-tuning with Gretel GPT using two datasets in domains where sensitive information is commonplace. Historically, achieving strong privacy guarantees (e.g. ε=5) and maintaining the accuracy of the real world data with differential privacy has required large amounts of training data and compute. In the examples below, we show how a combination of powerful LLMs pretrained on large amounts of public data, along with accelerated training techniques such as QLoRa, enables the generation of high-quality synthetic texts with much smaller data volumes, while maintaining quality comparable to real-world data.

  1. AGBonnet/augmented-clinical-notes on 🤗 contains clinical notes describing the symptoms, diagnosis and treatment of 30k patients. ‍
  2. alexa/Commonsense-Dialogues on Github consists of 9k snippets of everyday conversations between people.
Table 1. Example records from each of our training datasets. 

Models

For each dataset, we trained two models – one without DP and one with a DP budget of ε=5 and δ=auto. We used machines with 24GB of GPU RAM.  Reference the appendix for configuration details.

Evaluation Metrics

To assess the quality of synthetic text generated by the models, we use Gretel's Synthetic Text Quality Score (Text SQS). Text SQS is composed of two scores - Semantic Similarity, which measures how closely the synthetic texts resemble the original texts in meaning, and Structure Similarity, which compares the average characters per word, words per sentence, and sentence counts between the datasets. 

Results for DP with ε=5 vs. no-DP

Models trained with DP can produce synthetic text that attains comparable Text SQS to those trained without DP. Our results show that DP can easily be enabled with little to no loss in data quality. 

  • For `augmented-clinical-notes`, the overall Text SQS was exactly the same – the DP model fell slightly short on semantic similarity, but was able to produce text that matched the original structure better. 1
  • For `commonsense-dialogs`, the only shortcoming of the DP model was on the structure similarity score, however the semantic similarity was maintained just as well.
Figure 1. Comparing the results of applying differential privacy on both datasets.

We also reviewed two generated samples from each model for a quick visual check. They closely matched the style and context of the original training samples. This consistency held true for the models trained with and without DP.

Table 2a. Examples of generated records with the epsilon value set to 5.
Table 2b. Examples of generated records without differential privacy.

The privacy-utility trade-off

An advantage of Gretel’s approach versus other DP approaches is that DP is injected during the model's fine-tuning phase when updating weights. This results in a differentially private model that can generate multiple large datasets of synthetic texts without additional privacy costs, allowing unrestricted use for training other models.

The DP guarantee is achieved via the introduction of calibrated noise within the optimizer while fine-tuning the model.2 However, the noise required to guarantee privacy can deteriorate the model’s ability to consistently generate context-specific text. Below we demonstrate the trade-off between privacy, measured by epsilon, and utility, measured by Text SQS.

Figure 2. The trade-off between privacy, measured by epsilon, and utility, measured by Text SQS.

For both datasets, we fine-tuned models with DP using values of epsilon ranging from 0.1 (extremely high privacy) to 8 (considered an industry standard by companies like Apple) to observe changes in SQS.3 The plots show how increasing our privacy budget leads to higher quality synthetic data. In fact, epsilons even as low as 1 deliver reasonably high SQS. For your dataset and use case, you can explore different points on this curve with a quick change of your Gretel GPT config.

Keep in mind that DP will affect different models and datasets differently, so it is fundamental to experiment with similar public datasets beforehand. Below are some guidelines on how to get the best out of such experiments.

Tips for DP fine-tuning

Fine-tuning language models necessitates careful selection of model training hyperparameters, especially when incorporating noise to ensure DP. Below are some practical guidelines for selecting optimal hyperparameters for DP fine-tuning with Gretel GPT.

Learning rate

Our experience has been that the learning rate is the most important hyperparameter, significantly influencing the rate at which training loss decreases.  Typically, larger batch sizes warrant larger learning rates. Further, we observed that certain model architectures such as Mistral require lower learning rates. When using an effective batch size of 128 with `mistralai/Mistral-7B-Instruct-v0.2`, learning rates between 0.001 and 0.005 work well.

Batch size

We can control effective batch size in Gretel GPT using the `batch_size` and `gradient_accumulation_steps` parameters. In our experiments, setting the product of these two parameters to values greater than 128 helped the model achieve a lower training loss within fewer steps. Typically on GPUs with 24GB of RAM, when using a medium sized base model such as `mistralai/Mistral-7B-Instruct-v0.2` and 512 tokens of the model’s context for training, `batch_size = 32` is the maximum that fits into memory.  Increasing the gradient accumulation steps allows for a larger effective batch size. 

This is in line with guidance in literature, where the benefit of increasing the batch size is attributed to the corresponding almost linear decrease in the standard deviation of the noise added to the average batch gradient. 

Epochs

Generally, the longer a model is trained, the higher the likelihood of model convergence. However, with DP fine-tuning, the privacy budget must be divided across training steps. More steps mean a reduced budget per step, which increases the standard deviation of added noise and can hinder convergence. Our findings suggest that for datasets with 10,000 records, training with DP for 3 to 5 epochs is effective to generate high-quality synthetic data.

Dataset size

Figure 3. The impact of dataset size on Text SQS scores.

Dataset size impacts the quality of synthetic data produced for a fixed number of training epochs. We experimented with the augmented-clinical-notes dataset, varying the size of the dataset by sampling and fine-tuning Gretel GPT for 3 epochs with ε = 5, while maintaining consistent hyperparameters across all models We found that when training with less than 10k records, the generated synthetic text sometimes does not maintain contextual and structural similarity to the original text. 

If your dataset is smaller than 10k records, consider increasing the number of epochs to achieve higher quality. But do keep in mind the caveat described above about dividing your privacy budget across more training steps. Alternatively, consider collecting more samples to increase the size and diversity of your dataset (try Gretel Navigator!). 

Compute considerations

Gretel provides two deployment options: Gretel Cloud, a hassle-free software-as-a-service (SaaS) solution requiring no deployment effort, and Gretel Hybrid, which integrates into your cloud environment. Gretel Hybrid combines the benefits of using your infrastructure for training synthetic data models with Gretel’s advanced tools, offering a balance of control and convenience. Within Gretel Cloud, your Gretel GPT models will run on machines with 24 GB of GPU RAM by default. Contact us to scale up your DP fine-tuning with larger computational resources.

Details of our implementation

Gretel GPT incorporates DP using the Differentially Private Stochastic Gradient Descent (DP-SGD) algorithm (Song et al., 2013; Abadi et al., 2016). DP-SGD adds noise to the optimization process and clips gradients to prevent memorization of any single data example. We have tailored DP-SGD for LLM training following the work of Li et al., 2021 and Yu et al., 2021. More details about our algorithmic approach, as well as measuring the downstream utility of differentially private synthetic texts against real world data can be found in this blog.‍

In our implementation, we leverage open-source tools, including `opacus` for the DP optimizer and `prv_accountant`for privacy accounting. Additionally, we enable DP fine-tuning exclusively with parameter efficient fine-tuning (PEFT) methods. See Gretel GPT docs for more information on leveraging PEFT.

Conclusion

Generating synthetic text with differential privacy guarantees is now incredibly easy with Gretel GPT, requiring just two additional lines of code! If you're ready to get started, check out the short Youtube tutorial below. If you're interested in discussing use cases for securely leveraging your sensitive text data, reach out to us on Discord. 

‍

Appendix

augmented-clinical-notes

Training time: 6h

Dataset: https://gretel-public-website.s3.us-west-2.amazonaws.com/datasets/dp-synthetic-text/clinical_notes.csv

Config: 

schema_version: "1.0"
name: clinicalnotes_epsilon_5
models:
  - gpt_x:
      pretrained_model: mistralai/Mistral-7B-Instruct-v0.2
      column_name: text
      params:
        batch_size: 16
        steps: 2500
        weight_decay: 0.01
        warmup_steps: 100
        lr_scheduler: linear
        learning_rate: 0.00096
        max_tokens: 512
        gradient_accumulation_steps: 8
      peft_params:
        lora_r: 8
        lora_alpha_over_r: 1
      privacy_params:
        dp: true
        epsilon: 5
        delta: auto
      generate:
        num_records: 80
        maximum_text_length: 512

commonsense-dialogs

Training time: 2h 25m

Dataset: https://gretel-public-website.s3.us-west-2.amazonaws.com/datasets/dp-synthetic-text/commonsense_dialogues.csv

Config:

schema_version: "1.0"
name: commonsense_epsilon_5
models:
  - gpt_x:
      pretrained_model: mistralai/Mistral-7B-Instruct-v0.2
      column_name: text
      params:
        batch_size: 8
        epochs: 3
        weight_decay: 0.01
        warmup_steps: 100
        lr_scheduler: linear
        learning_rate: 0.001
        max_tokens: 512
        gradient_accumulation_steps: 16
      peft_params:
        lora_r: 8
        lora_alpha_over_r: 1
      privacy_params:
        dp: true
        epsilon: 5
        delta: auto
      generate:
        num_records: 80
        maximum_text_length: 512

References

1 For `augmented-clinical-notes`, the text structure score is approximately 40. This score is influenced by the number of output tokens used during generation. Although we set the configuration to use 512 tokens for both training and generation, our score comparison is against the untruncated training set. As a result, the generated text is shorter than the original, resulting in mid-range scores. To achieve higher scores, increasing the maximum number of tokens during record generation is recommended.

2 FAQ - Opacus

3 Each model was fine-tuned with the same hyperparameters. For 10 different synthetic datasets we generated, we computed SQS (shown on the plots).