Synthesizing Private Patient Data with Gretel: A Step-by-Step Guide

Create privacy-safe synthetic patient data with Gretel, ensuring compliance, secure sharing, and actionable insights for AI and machine learning in healthcare.

In today's data-driven healthcare landscape, the ability to leverage patient data for AI and machine learning applications is crucial. However, the sensitive nature of this information poses significant privacy and compliance challenges. This blog post will walk you through a powerful workflow for generating high-quality, privacy-safe synthetic patient data using Gretel's suite of tools.

Introduction

Electronic health records (EHRs) often contain complex, multi-modal data including numeric values, categorical information, free text, embedded JSON, and time-series data. Our goal is to create synthetic versions of this data that maintain its utility while preserving patient privacy. We'll achieve this using two key components from Gretel:

  1. Gretel Transform for removing PII
  2. Gretel Navigator Fine-Tuning for protection against privacy attacks 

This approach goes beyond simple PII removal, addressing the limitations of traditional anonymization techniques. By generating synthetic data, we create new records not based on any single individual, providing robust protection against various privacy attacks and re-identification risks.

Dataset Overview

For this demonstration, we'll be working with a synthetic patient dataset created using Gretel Navigator. This dataset is designed to emulate realistic patient records while adhering to privacy and data protection standards. Here's a quick overview of the dataset:

  • Size: 7,348 rows
  • Features: 17 columns including patient demographics, medical events, and detailed notes
  • Data Types: Numeric, categorical, text, embedded JSON, and null values
  • License: Apache 2.0

The dataset includes the following features:

  • patient_id: Unique identifier for each patient
  • first_name, last_name: Patient names
  • date_of_birth: Patient's date of birth
  • sex: Gender of the patient
  • race: Racial background of the patient
  • weight: Weight in pounds
  • height: Height in inches
  • event_id: Unique identifier for each medical event, per patient
  • event_type: Type of medical event (e.g., Symptom, Diagnosis Test)
  • event_date: Date of the medical event
  • event_name: Name of the medical event
  • provider_name: Name of the healthcare provider
  • reason: Reason for the medical event
  • result: Result of the medical event
  • details: Additional details about the event (in JSON format)
  • notes: Additional notes about the patient's condition or treatment

This rich, multi-modal dataset provides an excellent foundation for demonstrating the power of synthetic data generation in a healthcare context.

Setting Up the Environment

First, let's set up our environment and import the necessary libraries. You can run this code yourself using this Google Colab notebook.

!pip install -Uqq gretel-client datasets

from gretel_client import Gretel
from datasets import load_dataset
import yaml
import pandas as pd
from gretel_client.helpers import poll

# Get your free Gretel API key from: https://console.gretel.ai
gretel = Gretel(api_key="your_api_key", validate=True, project_name="patient-data")

Loading the Dataset

We'll use the synthetic patient events dataset provided by Gretel:

ds = load_dataset("gretelai/synthetic_patient_events")
df = ds['train'].to_pandas()

# Let's take a look at the first few rows of our dataset
print(df.head())

This will give you a sense of the data we're working with, including the various data types and the complexity of the information contained in each record.

De-identifying Data with Gretel Transform

Before generating synthetic data, we need to de-identify personally identifiable information (PII). Gretel's Transform v2 (TV2) provides a powerful framework for this task:

config = """
schema_version: "1.0"
models:
  - transform_v2:
      steps:
        - rows:
            update:
              - name: patient_id
                value: this | hash | truncate(10, end='')
              - name: first_name
                value: >
                  fake.first_name_female() if row.sex == 'Female' else
                  fake.first_name_male() if row.sex == 'Male' else
                  fake.first_name()
              - name: last_name
                value: fake.last_name()
"""

parsed_config = yaml.safe_load(config)

model = gretel.get_project().create_model_obj(
    model_config=parsed_config,
    data_source=df
)

model.submit_cloud()
poll(model, verbose=False)

transformed_df = pd.read_csv(model.get_artifact_link("data_preview"), compression="gzip")

This configuration hashes and truncates the patient_id, and replaces first_name and last_name with realistic fake names while preserving the gender association.

Generating Synthetic Data with Navigator Fine-Tuning

Now that our data is de-identified, we can use Gretel Navigator Fine-Tuning to generate high-quality synthetic data:

base_config = "navigator-ft"
data_source = transformed_df

generate_params = {
    "num_records": len(df),
    "temperature": 0.7
}

trained = gretel.submit_train(
    base_config=base_config,
    data_source=data_source,
    job_label="synthetic_patient_data",
    generate=generate_params,
    group_training_examples_by="patient_id",
    order_training_examples_by="event_date"
)

Learning Time-Series Sequences in Patient Data

One of the powerful features of Navigator Fine-tuning is its ability to learn and recreate time-series sequences in data. This is particularly useful for patient data, where understanding the progression of events, or predicting the next logical action is crucial.

In our configuration, we've used two important parameters:

  1. group_training_examples_by="patient_id": This tells the model to group all events related to a single patient together.
  2. order_training_examples_by="event_date": This ensures that the events for each patient are ordered chronologically.

By setting these parameters, we help the synthetic model learn the typical progression of a patient through the hospital system, from initial check-in to check-out and follow-ups. This means our synthetic data will not only preserve the statistical properties of the original data but also maintain realistic patient journeys.

For example, in the synthetic data, you might see patterns like:

  1. Initial admission event
  2. Followed by diagnostic tests
  3. Then treatment events
  4. Finally, discharge and follow-up events

This sequential learning is crucial for many healthcare analytics tasks, such as predicting patient outcomes, optimizing care pathways, or studying the effectiveness of treatment sequences.

Evaluating the Synthetic Data

One of the great features of Gretel is that it automatically generates a comprehensive report evaluating the quality of the synthetic data (i.e., how accurately it mimics the statistical properties of the original data). This report provides detailed insights into the synthetic quality score, data distributions, and other important metrics. You can easily view this report within your notebook:

# Display the full report within this notebook
trained.report.display_in_notebook()
Figure 1. A preview of the synthetic data quality and privacy protection report.

This command will render an interactive HTML report directly in your notebook. The report includes visualizations comparing the distributions of the original and synthetic data, correlation analyses, and various other metrics that help you assess the quality and utility of your synthetic dataset.

(Disclaimer: The privacy metrics currently use random holdouts, which may affect NavFT's pattern learning for event data. This limitation can impact the accuracy of results in some scenarios. We're working on an update to address this issue soon. Stay tuned.)

Loading and Examining the Synthetic Data

After generating the synthetic data, you can easily load it into a DataFrame for further analysis or use:

# Fetch the synthetically generated data
df_synth = trained.fetch_report_synthetic_data()

# Display the first few rows of the synthetic data
print(df_synth.head())

This will give you a quick look at the structure and content of your newly generated synthetic data. You can now use this DataFrame (df_synth) for your downstream tasks, such as model training, data analysis, or application testing.

Conclusion

By combining Gretel Transform v2 for de-identification and Navigator Fine-Tuning for synthetic data generation, we've created a powerful workflow for working with sensitive patient data. This approach allows healthcare organizations to:

  1. Safely leverage sensitive data for AI and ML use cases
  2. Break down data silos and promote broader data accessibility
  3. Unlock the potential of restricted datasets
  4. Accelerate innovation while maintaining privacy and compliance
  5. Scale data operations seamlessly across large datasets

The synthetic data generated through this process maintains the complex relationships and time-series nature of the original data, making it suitable for a wide range of healthcare analytics and machine learning tasks. At the same time, it provides strong privacy, as the synthetic records are not based on any single individual.

As the healthcare industry continues to evolve, techniques like these will be crucial in balancing the need for data-driven insights with the imperative of patient privacy.

Further Resources

By leveraging synthetic data and tools like Gretel, we can unlock the potential of healthcare data while maintaining the highest standards of privacy and data protection. Start exploring the possibilities of synthetic data in your projects today!