How accurate is my synthetic data?
Gretel’s new synthetic report is here, featuring a high-level score and metrics to help you assess the quality of your synthetic data.
Today we are thrilled to launch Gretel's new synthetic data report. At the heart of the report is a new, overall Synthetic Data Quality Score metric that helps you assess the quality of your synthetic data versus the original source data. To demonstrate the new report, we trained a synthetic model on the “Stroke Prediction Dataset”, a popular healthcare dataset from Kaggle and then used the model to generate 5000 synthetic records. In this blog, we’ll step through the synthetic report that was generated for this dataset and demonstrate how simple it is to grasp both a quick high level quality summary as well the more in-depth scoring methodology underneath.
Introducing the Synthetic Data Quality Score
Our Synthetic Data Quality Score, as shown below, is 94 which is classified as Excellent. This score is an estimate of how well the generated synthetic data maintains the same statistical properties as the original dataset. In this sense, the quality score can be viewed as a utility score or a confidence score as to whether scientific conclusions drawn from the synthetic dataset would be the same if one were to have used the original dataset instead. Understanding the synthetic data model’s quality and accuracy is vital when the data will be used to answer critical questions such as “Is this patient at risk of a stroke?”.
The Synthetic Data Quality Score appears right at the top of the report and can therefore be used as a quick summary of how things went. All circled question marks will expand into additional relevant information. For example, if you click on the question mark to the left, you’ll get the below table showing how to interpret your score. When your score is Excellent or Good, any of the listed use cases are viable for your synthetic data. When your score is Moderate, then the viable use cases are more limited. With any score other than Excellent, you can always try and improve your model with our tips and advice in our deep dives.
Also included at the top of the report are summary statistics about the shape of the training and synthetic data, as well scores for each of the different quality metrics. There are 12 fields in our Stroke Prediction dataset and 5110 training records. All records and fields were used to train our synthetic model (with all default settings) and from the model we generated 5000 synthetic records. Our overall Synthetic Data Quality Score (94) was computed by taking a weighted combination of the individual quality metrics: Field Distribution Stability (96) , Field Correlation Stability (92) and Deep Structure Stability (95).
Training Field Details
For those interested in the details behind the quality scores, the report then continues on first showing an overview of the training field details.
The high level Field Distribution Stability score is computed by taking the average of the individual Field Distribution Stability scores, shown in the right most column above. To better understand a field's Distribution Stability score, you can click on the field name to be taken to a graph comparing the training and synthetic distributions.
Data Quality Metrics
Field Correlation Stability
To measure Field Correlation Stability, the correlation between every pair of fields is computed first in the training data, and then in the synthetic data. The absolute difference between these values is then computed and averaged across all fields. The lower this average value is, the higher the Field Correlation Stability quality score will be.
To aid in the comparison of field correlations, a heatmap is shown for both the training data and the synthetic data, as well as a heatmap for the computed difference of correlation values. To view the details of what each square in the heatmap refers to, simply hover over the square with your cursor. The hover text will show you the two fields involved, as well as the correlation in the training data, the correlation in the synthetic data and the difference between the two.
Below are the correlation matrices for our Stroke Prediction dataset. You can see in a glimpse how well the training correlations are maintained in the synthetic data.
Deep Structure Stability
To verify the statistical integrity of deeper, multi-field distributions and correlations, Gretel compares a Principal Component Analysis (PCA) computed first on the original data, then again on the synthetic data. The idea behind PCA is to capture in just a few features the essential shape of all the features. These new features are what is referred to as the Principal Components.
Below are the Stroke Prediction dataset principal components computed first using the training data and second using the synthetic data. You can visually see their similarity instantly.
Gretel computes a synthetic quality score by comparing the distributional distance between the principal components in the original data and those in the synthetic data. The closer the principal components are, the higher the synthetic quality score will be. An example principal component comparison is shown below. As PCA is a very common approach used in machine learning for both dimensionality reduction and visualization, this metric gives immediate feedback as to the utility of the synthetic data for machine learning purposes.
Field Distribution Stability
Field Distribution Stability is a measure of how closely the field distributions in the synthetic data mirror those in the original data. For each field we use a common approach for comparing two distributions referred to as the Jensen-Shannon Distance. The lower the JS Distance score is on average across all fields, the higher the Field Distribution Stability quality score will be. To aid in the comparison of original versus synthetic field distributions, a bar chart or histogram is shown for each field. Below are example graphs from two of the fields in our Stroke Prediction dataset.
Synthetic data is a viable privacy choice for many different use cases. Oftentimes those use cases require high quality statistical symmetry with the original data. In this blog, we’ve shown how the Synthetic Data Quality Score can be used as a quick answer as to whether your synthetic data has the quality needed for your use case. The details behind the score can sometimes help narrow in on what might be problematic fields. You can utilize our advice in our documentation for handling tough fields as well a multitude of other tips for improving your Synthetic Data Quality Score.