Generate Complex Synthetic Data with Tabular Fine-Tuning + Differential Privacy

Learn how to generate secure, high-quality synthetic tabular data using Tabular Fine-Tuning and Differential Privacy tools. 🛡️ 🚀

Published by

Andre Manoel

Lipika Ramaswamy

•

Updated

November 21, 2024

Tabular Fine-Tuning (TabFT) is the model of choice at Gretel for generating synthetic tabular datasets containing numerical, categorical, free text, and event-driven fields. Today, we enable its fine-tuning with differential privacy (DP). That means strong, formal guarantees that an attacker in possession of the differentially private synthetic dataset is highly unlikely to infer any sensitive information about the original dataset.

Gretel now supports three models with differential privacy: TabFT, Gretel GPT, and Tabular DP. The choice of model depends on your dataset:

TabFT: Ideal for tabular datasets with mixed column types (e.g., numerical, categorical, free-text).
Gretel GPT: Best for free-text-only datasets.
Tabular DP: Suited for numerical and categorical-only datasets.

In this blog post, we use TabFT with DP, and will run some experiments to demonstrate some consequences of enabling DP, both positive and negative.

Create differentially private data now 👇

You can generate your own differentially private synthetic data using the configuration below. This example uses Gretel Console, but you can also try it in our SDK with this end-to-end notebook.

schema_version: 1.0
name: "tabft-dp"
models:
- tabular_ft:
    group_training_examples_by: null
    order_training_examples_by: null

    params:
        num_input_records_to_sample: auto
        batch_size: 8

    privacy_params:
        dp: true
        epsilon: 8

    generate:
        num_records: 1000
        use_structured_generation: true

‍

To enable DP, we’ll specify privacy_params. To enable DP, set dp: true, which activates fine-tuning with differential privacy. The epsilon parameter controls the privacy loss, with smaller values offering stronger guarantees against the leakage of training data. Meanwhile, delta represents the probability of accidental data leakage. By default, delta is automatically calculated based on the dataset’s characteristics to be less than or equal to 1/n^1.2, where n is the number of training records.

Additionally, we adjust the standard parameters slightly to account for the incorporation of differential privacy. Batch_size equals 8. When training with DP, we recommend increasing the batch_size from the default value of 1 to larger values such as 8 or 16. A larger batch size reduces the noise added to gradients, improving the overall results while maintaining DP guarantees. However, higher values may lead to memory constraints, so it’s best to test incrementally to find the optimal configuration.

The use_structured_generation parameter enables the use of the dataset schema to enforce structure in the generated outputs by manipulating output logits. This helps ensure that the generated data adheres to the original schema, including numerical ranges and categories. However, if the schema itself is considered private, you should set this parameter to false to avoid exposing private information.

Understanding the Effect of DP 🔍

Privacy-Utility Tradeoff ⚖️

Differential privacy enhances resilience to attacks, making it nearly impossible for sensitive sequences in the training data to appear in synthetic data. However, this often reduces synthetic data quality, requiring a balance between privacy and utility.

Experimental Setup 🧪

To explore this tradeoff for TabFT with DP, we utilized two datasets, each with unique characteristics.

Datasets

The Women’s E-Commerce Clothing Reviews dataset consists of 23,486 rows and 10 columns. It primarily contains numerical and categorical data, along with one free-text column, review_text. For example, a sample record might look like this:

{
	"review_text": "I loved this shirt until the first time i washed it. it shrunk so much it became unwearable. when i returned it the salesperson ",
	"age": 39,
	"rating": 1,
	"positive_feedback_count": 0,
	"division_name": "General", 
	"department_name": "Tops",
	"class_name": "Knits",
	"recommended_ind": 0
}

The gretelai/gretel-patient-events-v1 dataset is a synthetic patient events dataset that exemplifies many desirable characteristics of multimodal tabular data. It includes multiple free-text and date columns, making it a valuable resource for testing and modeling complex data scenarios.

{
    "patient_id": "pmc-6431471-1",
    "first_name": "Aisha",
    "last_name:" "Liang",
    "date_of_birth": "04/17/1960",
    "sex": "Female",
    "race": "Asian",
    "weight": 135,
    "height": 61,
    "event_id": 1,
    "event_type": "Admission",
    "event_date": "04/17/2023",
    "event_name": "Initial admission",
    "provider_name": "Dr. Rosa Fernandez",
    "reason": "Generalized malaise, dyspnea, cough",
    "result": null,
    "details": {"intensity": "N/A", "location": "N/A"},
    "notes": "Patient admitted with symptoms including malaise, dyspnea on exertion, and cough, exhibiting hypotension and fever on arrival. Initial laboratory tests indicated possible infection, broad-spectrum antibiotics and other treatments were administered, significantly stabilizing patient overnight."
}

Metrics

We use five metrics to evaluate our experiments.

Synthetic Quality Score (SQS): an estimate of how well the generated synthetic data maintains the same statistical properties as the original dataset.
Data Privacy Score (DPS): a measure of how resilient the synthetic dataset is to membership inference attacks and attribute inference attacks.
Percentage of Valid Records: percentage of generated samples that follow the schema of the original dataset
Training Runtime: total time (in seconds) taken to fine-tune the model
Generation Runtime: total time (in seconds) taken to generate all synthetic samples

In general, we anticipate that employing DP will boost DPS while decreasing SQS. There may also be a decline in the percentage of valid records. For that reason, we employ structured generation. Typically, both training and generation runtime will increase, especially when the percentage of valid records is low, as many generated records may need to be discarded.

Results📈

You can reproduce these results using the included notebook.

E-commerce Dataset

The results of our experiments with the e-commerce dataset, shown in Figure 1(a) [link], highlight the impact of varying the privacy parameter ϵ\epsilonϵ on key metrics. As expected, introducing differential privacy (DP) caused the median Synthetic Quality Score (SQS) to decrease from 87 to 78 with ϵ=16\epsilon = 16ϵ=16. When ϵ\epsilonϵ was further reduced to 8 and 4, the scores dropped to 76 and 72, respectively, reflecting the tradeoff between stronger privacy guarantees and data quality.

Simultaneously, the Data Privacy Score (DPS) showed a progressive, though not substantial, increase with lower ϵ\epsilonϵ values. While DPS serves as a useful proxy, it’s crucial to remember that DP provides formal guarantees, which are better demonstrated through metrics like the absence of personally identifiable information (PII) replay. Thanks to structured generation, the percentage of valid records remained at 100% for all DP configurations. However, both training and generation runtimes increased significantly, approximately 2.5 to 3 times higher than runs without DP.

We also assessed how performance of downstream classification tasks are affected by the use of DP in generating synthetic data. In Figure 2, we look at the histogram of ratings at a given category (“Tops”), and show they are very similar.

Moreover, we trained a simple (logistic regression) model to predict rating based on the review text. As training data, we used 1000 samples from either the synthetic data generated without DP, or the synthetic data generated with DP. An embedding model from sentence-transformers was used to encode the review text, and the Pycaret library for training the model; models were evaluated using another 1000 samples from the original data. For more details, please check the accompanying notebook.

Table 1: Classification accuracy and AUC for predicting rating from review text.

Results are shown in Table 1. We can conclude that, even if the data is not as similar as it would be without DP, it is still very useful.

Patient Events Dataset

For this dataset, we specified group_training_examples_by and order_training_examples_by parameters, effectively feeding the model not a single record, but ordered events of a given patient. We will guarantee user-level DP, i.e., each patient’s entire contribution to the dataset is protected by the DP guarantee.

For this dataset, we fix ε=8, and experiment with 2 batch sizes, equal to 1 and 2 respectively. We observe for this dataset and in general, that using larger batches leads to better metrics.

We observe a slight decrease in the SQS (6 and 3 point median drop for batch sizes 1 and 2 respectively). There is also an accompanying increase in the DPS (Figure 1(b)). Here, however, structured generation was not as effective, leading to a lower percentage of valid records. This leads to much higher generation times.

Despite the higher runtimes, it is interesting to see how DP has degraded performance by very little in this case, despite the complexity of the dataset.

Figure 1: metrics for two different datasets.

Figure 2: histogram of ratings for a particular category (“Tops”) on the different original/synthetic datasets, showing how downstream tasks such as analytics are affected.

Tricks to boost performance of DP + TabFT 🚀

Larger datasets will yield better results with DP. We recommend at least a few thousand samples.
When working with fairly complex or small datasets, we recommend increasing your privacy budget (ε >= 8). High values are known to provide good empirical protection, even if the formal protection is decreased.
Increase the batch size as much as possible; if your dataset has few columns and you are not grouping records, try values like 8 or 16.

Give it a try!

Incorporating Differential Privacy into Tabular Fine Tuning presents an exciting opportunity to unlock access to sensitive datasets. With just a few lines of code, you too can explore the privacy-utility tradeoff for your own sensitive datasets. Take it for a spin!

If you have questions or comments about TabFT or the Gretel Platform, join us and 1,600 other developers, engineers, data scientists, and privacy fans in the Synthetic Data Community on Discord.

‍

Update: "Navigator Fine Tuning" is now called "Tabular Fine-Tuning."

Get started with Gretel

Learn how our platform can streamline your data privacy and compliance.

Want to learn more?

Discover the benefits of synthetic data.

Request a demo

Generate synthetic data at scale

Synthesize millions of rows with high accuracy using Gretel ACTGAN.

Scale your data