Quantifying PII Exposure in Synthetic Data
Introduction
Synthetic data is revolutionizing privacy-first data sharing, but ensuring privacy isn't just a box to check—it's a continuous process. Enter PII Replay, Gretel's newest privacy metric to identify and reduce sensitive information exposure. This tool works alongside Membership Inference Protection and Attribute Inference Protection, ensuring your synthetic data remains private by design. Let’s explore how it works, what the numbers mean, and how you can use it effectively.
PII Replay identifies instances of sensitive information found in your original training data and counts how often those values show up in the synthetic output. With this metric, you can now easily see whether any of the original PII is showing up in your synthetic data. For columns where replaying original PII would be a large privacy risk, you can now quickly verify whether the synthetic data meets the mark as-is or if there are additional pre- or post-processing steps that you should consider taking.
Want to give it a try? Check out this notebook! Read on to learn more about how PII Replay is calculated, why you might expect to see PII replayed, and steps you can take to minimize PII replay.
‍
How it works
PII Replay works by leveraging Gretel Transform to identify and classify instances of PII in the original training data. Next, it scans corresponding columns in the synthetic output, counting each time a value from the training data appears in the synthetic dataset.
Results appear in a table, with one row per dataset column analyzed for PII Replay. For each dataset column, the table shows total instances of PII in the training data, how many were unique, and how many were replayed in the synthetic output—making it easy to spot privacy risks at a glance.
‍
In this example, you can see that many types of PII were included in the search, such as address, administrative unit, city, company, country, credit card number, and so on. You can see from the header that the reference data had 5,000 rows while the output data for which we are measuring PII Replay had 4,500 rows.
To interpret the results, consider the first row of the table: the dataset column fname
is categorized as PII Type: first_name
. In the reference data, all 5,000 rows were labeled with this type, with 56 unique values.
Comparing to the output data, we see that 4,464 rows out of the 4,500 in the output data had a value that matched one of the 56 from the original. Of those 56 unique values, 25 were replayed (45%).
The rest of the rows can be read in the same way.
Measuring PII Replay
In order to minimize the chance of sensitive PII from the original data showing up in the synthetic output, we recommend using Gretel Transform to anonymize any columns containing PII prior to running a synthetics model, such as Navigator Fine Tuning, to generate synthetic data. After generating the synthetic data, you can pass in your original training data and final synthetic output to Gretel Evaluate to generate an SQS report with the PII Replay metric.
Currently, PII Replay is only supported in Evaluate reports because it requires comparing original data to the final synthetic output, not intermediary results.
Below, we walk through an example. If you want to follow along, you can copy the code directly or use this Colab Notebook.
To get started, you will need a free Gretel account to install Gretel’s Python client and configure your session using your API key:
pip install -U gretel-client
from gretel_client import Gretel
gretel = Gretel(api_key="prompt", validate=True, project_name="pii-replay-project")
‍
Next, load in the dataset, separating into train and test to be able to measure privacy metrics:
import pandas as pd
datasource = "https://gretel-public-website.s3.us-west-2.amazonaws.com/datasets/kaggle/employee_data.csv"
df = pd.read_csv(datasource)
test_df = None
# Drop columns to simplify example
df = df.drop(columns=["Supervisor", "BusinessUnit", "EmployeeType", "PayZone", "EmployeeClassificationType", "TerminationType", "TerminationDescription", "DepartmentType", "JobFunctionDescription", "DOB", "LocationCode", "RaceDesc", "MaritalDesc"])
df.head()
# Shuffle the dataset randomly to ensure a random test set
# Set random_state to ensure reproducibility
shuffled_df = df.sample(frac=1, random_state=42).reset_index(drop=True)
# Split into test (5% holdout) and train
split_index = int(len(shuffled_df) * 0.05)
test_df = shuffled_df.iloc[:split_index]
train_df = shuffled_df.iloc[split_index:]
‍
The dataset we analyzed contains detailed employee information across three key categories:
- Personal Information: This includes sensitive identifiers such as names, email addresses, gender, and employee IDs.
- Employment Details: Data points such as job titles, divisions, performance scores, and employment start and end dates.
- Location Data: Geographic details, such as U.S. states where employees are based.
These categories provide a well-rounded view of employee data, allowing us to assess privacy risks across multiple dimensions.
Example 1: Synthetics only
First, we will synthesize the dataset using Navigator Fine Tuning, our flagship synthetics model. Navigator Fine Tuning is an LLM under the hood, and supports numeric, categorical, free text, and event-driven tabular data.
nav_ft_trained = gretel.submit_train("navigator-ft", data_source=train_df, evaluate={"skip": True}, generate={"num_records": 1000})
nav_ft_result = nav_ft_trained.fetch_report_synthetic_data()
‍
To measure PII replay, we run Evaluate:
EVALUATE_CONFIG = """
schema_version: "1.0"
name: "evaluate-config"
models:
- evaluate:
data_source: "__tmp__"
pii_replay:
skip: false
entities: ["first_name","last_name","email","state"]
"""
evaluate_report = gretel.submit_evaluate(EVALUATE_CONFIG, data_source=nav_ft_result, ref_data=train_df, test_data=test_df).evaluate_report
evaluate_report.display_in_notebook()
At first glance, seeing replayed PII might seem concerning—but context is key.
For example, consider the first row: first names. The U.S. Census lists only about 5,000 common first names. In a dataset with 3,000 training rows and 1,000 synthetic rows, some overlap is not just expected—it’s reassuring. Lack of overlap could imply the model is deliberately excluding names from the training data, which could ironically signal a potential privacy risk. Instead, that could suggest that first names from the training data were explicitly removed from the synthetic data. An adversary may realize that *lack* of existence in the synthetic dataset means a name is more likely to have existed in the training data. Normally, it is the opposite (overfitting) with which we are concerned.
The next row, last names, has a similar story.
Now consider email. In this dataset, all of the emails were formatted as firstname.lastname@bilearner.org. This means that replayed emails will precisely be a measure of replayed full names. Here, we see that while 30-40% of individual first and last names were repeated, only 1% of full names were.
When we think about privacy, if a name like Alex or John shows up in the synthetic output, it generally would not necessarily make an adversary more likely to think that someone named Alex or John was any more likely to have been in the training set. However, if a known person’s full name like Alex Langley or John Ellis showed up, it could raise more suspicion. It is both expected and reassuring to see that very few full names were replayed comparing to individual first and last names.
Last, let’s look at the column for U.S. state. Given there are roughly only 50 options from which to choose (give or take due to various territories), we should expect nearly all to be repeated if the dataset included statistics across the country. In this example, we saw the original dataset contained 27 states, and the final output replayed 22 of those. Since the model is seeing the same values for state hundreds of times, we should intuitively expect that it learns and replays the majority.
To summarize, you should not necessarily expect (or often even want) PII Replay of 0 across the board. You should consider each column in context, both of the data and of the real world. In general, you should expect entities that are rarer, like full address or full name, to have lower amounts of PII replay than entities that are more common, like first name or U.S. state.
Example 2: Transform + Synthetics
As the information icon in the PII replay graph indicates, we recommend running Transform prior to running Synthetics as the easiest and most reliable way to ensure the model cannot learn any sensitive information.
Below, we run Transform first to fake names, emails, and locations, followed by Navigator Fine Tuning.
# Run Transform
TRANSFORM_CONFIG = """
schema_version: "1.0"
name: transform-config
models:
- transform_v2:
globals:
locales:
- en_US
classify:
enable: true
entities:
- first_name
- last_name
- email
- state
auto_add_entities: true
num_samples: 3
steps:
- rows:
update:
- name: FirstName
value: fake.first_name_male() if row['GenderCode'] == 'Male' else
fake.first_name_female()
- name: LastName
value: fake.last_name()
- name: ADEmail
value: row["FirstName"] + "." + row["LastName"] + "@bilearner.com"
- name: State
value: fake.state_abbr()
"""
transform_result = gretel.submit_transform(TRANSFORM_CONFIG, data_source=train_df).transformed_df
# Run Navigator Fine Tuning
tr_nav_ft_trained = gretel.submit_train("navigator-ft", data_source=transform_result, evaluate={"skip": True}, generate={"num_records": 1000})
tr_nav_ft_result = tr_nav_ft_trained.fetch_report_synthetic_data()
‍
As before, we run Evaluate to generate the report with PII Replay comparing the original training data to the final synthetic output:
evaluate_report = gretel.submit_evaluate(EVALUATE_CONFIG, data_source=tr_nav_ft_result, ref_data=train_df, test_data=test_df).evaluate_report
evaluate_report.display_in_notebook()
‍
Even after running Transform first, PII replay isn't always 0. How can the model generate data it hasn’t seen? The answer lies in the details.
Let’s again consider first names. As mentioned before, there are approximately 5,000 common first names in the U.S. Similarly, Faker (called when running Transform to generate the new values) only uses around 7,000 first names. When faking nearly 3,000 names to transform the training data, we should expect a high number of collisions. Likewise, despite faking the names, we should expect Navigator Fine Tuning to generate many of these same common names in its final output.
What we should also see is that transforming the names first reduced the overall amount of PII replay compared to running synthetics only. Indeed, this was the case. PII replay dropped from 32% to 11% for first names and 40% to 25% for last names.
Remember, email was an exact proxy for a full name. While first name and last name still saw some replay, it is great to see that the truly sensitive information, full name, did not.
For many types of entities, you need to consider in context whether they are considered sensitive, or whether they are something you want the model to learn. Here, we chose to transform the column for U.S. state. As a result, the model replayed far fewer instances than it did when we ran synthetics alone (10 states replayed rather than 22, out of the list of 27 that existed in the training data).
Depending on your use case, U.S. state may be an important factor in your analysis. You might not consider it sensitive, and you may want the model to learn about the original values. This is why there cannot be a foolproof, universal approach when it comes to Transform. You need to analyze your data in context to decide which columns you want to transform.
The key takeaway: PII replay for common entities should be a sign that the model is working, not an indicator that something is broken. Running Transform first ensures that any replay that occurred is due to common chance rather than the model learning sensitive information.
Strategizes to Minimize PII Replay
Ensuring privacy in synthetic data requires thoughtful preparation and careful analysis. Here are several strategies to help you minimize Personally Identifiable Information (PII) Replay while maintaining high-quality synthetic datasets:
1. Use Transform Before Generating Synthetics
The best way to prevent sensitive data from appearing in your synthetic output is by running Gretel’s Transform tool first. Transform anonymizes sensitive data, guaranteeing the model cannot learn private information. This step ensures that any sensitive values appearing in the synthetic data are purely coincidental, not learned.
However, double-checking the Transform output is essential. A small typo in your configuration can lead to unintended results, so review the output carefully to confirm your transformations were applied as expected.
2. Choose a Model Designed to Minimize PII Replay
Not all models handle PII in the same way. Models like ACTGAN generate only values found in the training data, which increases the likelihood of PII Replay. In contrast, LLM-based models, like Gretel’s Navigator Fine Tuning, can generate novel values, significantly reducing the risk of PII exposure. Whenever minimizing PII Replay is a priority, opt for a model like Navigator Fine Tuning.
3. Leverage Differential Privacy
For an additional layer of privacy, consider using differential privacy. Gretel supports differential privacy across several models, including:
- Navigator Fine Tuning (numeric, categorical, and free text data),
- GPT (free text data only),
- TabularDP (numeric and categorical data).
Differential privacy provides mathematical guarantees that individual records in the training dataset cannot be reconstructed. While this method enhances privacy, it often results in a slight trade-off in data quality, so it’s best used when privacy requirements outweigh other considerations.
4. Pre-Process to Remove Unnecessary Columns
If your downstream task doesn’t require specific sensitive columns—such as names or personal identifiers—you can remove them before generating synthetic data. For example, if your analysis focuses on job roles or locations, you may safely exclude a “name” column entirely. This simple step can eliminate unnecessary privacy risks.
5. Use Pre- and Post-Processing Strategically
In some cases, sensitive columns like “name” are critical for downstream tasks but aren’t strongly related to other dataset variables. Here, a hybrid approach works best:
- Remove sensitive columns during the initial synthetic data generation process.
- Use tools like Navigator in a post-processing step to regenerate the removed data.
This approach ensures sensitive information isn’t learned by the model but can still be included in your final dataset where necessary.
6. Filter Out Specific Values with Care
You may be tempted to filter out highly sensitive values in pre- or post-processing. While this approach can immediately reduce privacy concerns, it comes with risks. Filtering out data can skew results or lead to what’s known as the Privacy Onion Effect: removing obvious outliers exposes a new “layer” of previously safe data to similar risks. Use this technique sparingly and in conjunction with other privacy-preserving measures.
Final Thoughts
Privacy solutions are not one-size-fits-all. The best combination of tools and techniques depends on your dataset, use case, and requirements. Gretel’s robust suite of metrics, models, and features empowers you to create synthetic data that balances privacy with usability.
Try PIIÂ Replay for yourself with this hands-on notebook, so you can start designing safer data for better models.
If you have questions or comments about PIIÂ Replay or the Gretel Platform, join us and 1,600 other developers, engineers, data scientists, and privacy fans in the Synthetic Data Community on Discord.