Deep dive on generating synthetic data for Healthcare
Take a deep dive on training Gretel’s open-source, synthetic data library to generate electronic health records that protect individual privacy (PII).
Sharing data safely is one of the biggest challenges in the healthcare industry today. For hospitals and health organizations, being able to compare and contrast new patient data with other medical organizations in their area and across the world can help doctors quickly diagnose patients and provide the best treatment possible.
Synthetic data, or data that is artificially manufactured rather than generated by real-world events, is a promising technology for helping healthcare organizations to share knowledge while protecting individual privacy.
In this post, we will deep dive on training Gretel’s open-source, synthetic data library to generate Electronic Health Records (EHR) that protect individual privacy, while capturing the key statistical insights from the original source data. For our example dataset we’ll be working with information exported from Epic, a top EHR system used by hospitals.
Generating synthetic EHR data
Our dataset contains 27,963 de-identified emergency room discharge summaries from an ER over the course of six months. There are 22 columns of data in CSV format, including date, arrival time, demographics, chief complaint and primary diagnosis (free text). Check out our previous post to try generating synthetic EHR records with your own dataset. Below is the configuration we’re using for this post.
From the training results above, we can see that even with differential privacy enabled, our model was able to quickly learn the semantics of the training set, getting to 70%+ accuracy within 10 epochs of training. For the remainder of this post, we’ll focus on the question of “How useful is my synthetic dataset”? We’ll do this by exploring the similarities and differences between our artificially generated synthetic dataset (with guarantees that no actual health records were memorized or replayed), vs. the original source dataset.
Easy: Similarity of value distributions per column
The simplest, most naive approach to compare our original and synthetic data is to compare the distributions of each column. Ideally, the distributions for both categorical and numerical datatypes should match. Let’s look at this for an example field Patient.Class .
At first glance the field distributions for our example field Patient.Classlook pretty close, with the synthetic data model having a slightly higher propensity to generate “Inpatient” records than “Emergency” records vs. the source data. Let’s try something harder, like recreating the times that patients check in to the ER.
How do we look at this quantitatively? In statistics, we can leverage the Kullback-Liebler divergence (also known as KL Divergence or relative entropy), which is a measure of how one probability distribution is different than a second. Fields with a high KL divergence might not be ideal for synthetic data, as they might (for example) be highly random to begin with.
Hard: Maintaining insights across fields
With synthetic data what we really want to capture are the statistical insights, or relationships between fields. For a lot of synthetic data use cases, such as data sharing or training machine learning models, maintaining the “insights” in a dataset is even more important than the per-column distribution above.
We can express these relationships and insights more quantitatively as the correlations between field values. In human language, the correlation is the measure of the relationship between two values. And this correlation can be positive or negative. For example:
- As the weather gets colder, heating costs increase.
- Or, as the weather gets colder, air conditioning costs decrease.
Formalizing this mathematically, we can use correlations such as Pearson’s correlation coefficient, which results in a value in the range [-1, 1].
Below is a correlation matrix across the fields in our original training set. In building a synthetic dataset, we seek to learn and replay these correlations as much as possible.
Running correlations against our original training dataset, the Gretel-generated synthetic dataset, and a “Naive” implementation of synthetic data, we can quantify what insights from the original data have been retained by our model.
Naive (simple) synthetic data. A naive approach to creating a synthetic dataset would be to model the value distributions per column, and then shuffle their distributions, with the goal of enabling per-column statistics but with some privacy such as prevent re-identification attacks. This approach has limited utility, as the relationships between fields get lost.
Gretel’s ML-generated synthetic data. Utilizing differential privacy during training, Gretel synthetics can re-create many of the same distributions and insights from the source data, while guaranteeing that individual health records are not memorized or replayed.
Gretel’s approach trains a machine learning-based model to recreate both the distributions and field correlations in the training set as closely as possible.
In the matrix below, we can visually see the similarities between the correlation matrices for the original and synthetic datasets. To characterize overall model quality, we can calculate the Root Mean Squared Error (RMSE) For our EHR dataset, the squared error is only 0.093 even while using aggressive differential privacy guarantees (ε = 0.606 and delta = 1.710e-07) to protect user privacy!
We’re super excited about the possibility of synthetic datasets to enable health organizations to safely share data with differential privacy guarantees. We’d love to hear about your use cases- feel free to reach out to us for a more in-depth discussion in the comments, twitter, or [email protected]. Follow us to keep up on the latest trends with synthetic data!