Generating differentially private synthetic text using Gretel GPT

May 29, 2024

How to generate differentially private synthetic text data using Gretel GPT

Read the blog post

Generate Differentially Private Synthetic Text with Gretel GPT

Read the post

Safely leverage sensitive or proprietary text data for advanced language model training and fine-tuning

Transcription

Summary

In this video, I will walk you through the process of generating differentially private synthetic text using Gretel GPT. This topic has been popular among our users who need privacy guarantees when creating training sets for language models. I will explain how differential privacy works and provide a practical example using clinical notes. No action is requested from viewers, but you can try running the provided datasets yourself.

Transcript

(lightly edited for clarity)

‍

Hey, this is Alex from Gretel. Today, I'm going to walk you through a topic that has been very popular with our users: generating differentially private synthetic text using Gretel GPT.

When creating a training set for an LLM or a RAG-based system and needing privacy guarantees, you need to go beyond traditional de-identification methods like removing or replacing names or addresses. You need guarantees that no other attributes inside a text sequence (such as a reference to an earthquake-related delay plus a username) can be used to identify a user.

Differential privacy provides guarantees that the model cannot memorize or replay data at any individual level. Let’s dive into how that works with a practical example.

First, let’s look at some datasets: one on clinical notes and another on a dialogue-based system, similar to what many people train on. These datasets and configurations are linked in the video, so you can try running them yourself. We'll run through the process for augmented clinical notes. In the Gretel console, I’ll start with the pre-trained models and walk you through the fine-tuning process using differential privacy (DP). Select the blueprint for generating natural language. I've already uploaded the datasets to a project called DP Synthetic Text. We'll use that, select the clinical notes, and start the training.

Next, I’ll edit the configuration. I copied the config from a previous tab and replaced the old config. We’re fine-tuning the Mistral 7 billion instruct model. I changed the column name for fine-tuning to "text." You’ll see other parameters here, including DP settings where you set the Epsilon. Delta is automatically calculated. We're generating 80 records and comparing them to real-world data to compute the synthetic quality score.

When I click run, the background process involves fine-tuning the Mistral model using Gretel's DP implementation and creating a synthetic quality score to assess the model’s performance. The training involves about 30,000 records, and it takes around 9 hours. The quality score was 76 out of 100. We’re using a high level of privacy, which impacts quality, especially with only 30,000 records.

Now, let’s look at the real-world data to compare it with the synthetic text. Here’s a set of clinical notes with specific, domain-oriented language. The synthetic text should resemble this.

Next, we’ll look at the synthetic text generated. It should closely match the original data’s semantics. Here, you see records that match the original data’s look and feel, indicating the model learned the domain well without pointing to real individuals from the dataset.

Finally, let’s review the quality report. We generated synthetic text with differential privacy guarantees. The report compares the synthetic text to real-world data on text semantics and structure. The synthetic text has fewer words per sentence, fewer characters per word, and fewer sentences overall, but it still closely matches the original data. We can adjust inference parameters to improve this.

This model allows you to sample unlimited synthetic text data for training models that resemble real-world data while ensuring privacy. It’s ideal for training chatbots or fine-tuning LLMs on domain-specific data where you want insights without compromising individual privacy. For example, wanting to study and learn about a disease, but not the patient.

Follow along, let us know if you have any questions, and we look forward to hearing from you. Thanks!

Generating differentially private synthetic text using Gretel GPT

Video description

More Videos