Gretel's New Synthetic Performance Report
Gretel's Premium SDK now includes detailed reporting that shows you how accurate your synthetic data's statistical distributions and correlations are.
One of the first, and most important questions you may have after generating a synthetic dataset is “How accurate is my synthetic data?”. Gretel Synthetics includes a performance report that shows you just how well the distributions in your training data were maintained in your new synthetic data. In this blog, we'll step you through our newly released version of the performance report which includes interactive Plotly graphs and stylish HTML formatting.
See one of our latest blogs or video walkthroughs on how to generate synthetic data. With the latest version of our premium SDKs, the report is now written to a file of your choosing. The line in the notebook that generates the performance report can be utilized as shown below:
We have also made a public example of our report available here. This report is from synthesizing medical records into a safe, shareable version of the data.
Let's break down the sections of the report!
First is the report header, which gives 3 high level pieces of useful information.
- Duplicated lines between training and synthetic data: A good synthetic model would never memorize any of the training lines. If lines are repeated here, you might be overfitting your model. Try decreasing the vocabulary size and/or number of training epochs to improve your model performance.
- We compute the Mean Squared Error (MAE) between correlation values in the training and synthetic data as an estimate of how well correlations between fields are maintained. For example, if the correlation between field1 and field2 is .2 in the training set, but .3 in the synthetic set, then the absolute error here is .1. We take the average of this error across all field combinations, so the lower your MAE score is the better your field correlations were maintained.
- Jensen-Shannon (JS) Distance, on the other hand, looks at how distribution is maintained within a single field. It's values range from 0 to 1. A value of 0 means the training and synthetic distributions are exactly alike. A value of 1 means the two distributions are maximally different. Just how low your MAE and JS Distance score should be depends on how you intend to use the synthetic data. For example, if the intent is to use the synthetic data for development testing, then maintaining distributions and correlations may not be important. Higher MAE and JS Distance scores may have no consequence. However, if the intent is to use the synthetic data for machine learning or statistical analysis, then maintaining your data’s statistical integrity can be critical. Therefore, very low MAE and JS Distance scores are desirable.
Per Field Jensen-Shannon Score
The next section in the report will show the JS Distance score for each individual field (as shown below). For each field there is a Category Count (number of unique values) and a Distance Score. The lower your Distance Score the better the within-field distribution was maintained in the synthetic data.
Gretel computes correlations between all pairs of fields in the training data and then again in the synthetic data. The specific correlation algorithm used depends on the data type of the two fields being compared. We use Pearson's r for correlation between numeric fields, Theil's U for correlation between categorical fields and Correlation Ratio for correlation between numeric and categorical fields.
We use a heatmap to show these correlations:
As you mouse over the individual squares of the heatmap, Plotly will show you the corresponding two fields and the specific correlation value between those fields. If you're new to Plotly, play around with the modebar that appears as you mouse over the upper right area of the graph. One menu option that's particularly nice for heatmaps is the "Toggle Spike Lines". When you click this on, a dotted line to both the x and y axes will appear. The "Camera" option is also nice for downloading a png of the graph.
Following the training and synthetic correlation matrices, you'll see a heatmap that is essentially the difference between the two. We literally take the training correlation value for a field-field combination and subtract the corresponding synthetic correlation value. This makes it easy to narrow in quickly on any differences.
After the correlation difference heatmap, you'll see a series of graphs, one per field that we're able to graph. Categorical fields with a really high number of unique fields are a bit too messy to graph. We use bar charts for categorical fields and histograms for numeric fields. The Plotly modebar option we like to use on these charts is the "Compare data on hover" option. When this is enabled, instead of the mouse-over showing just the value for the closest bar, it will show both values for the closest pair of bars so you can quickly see both the training and synthetic values for this category or numeric bin. Below we show both an example bar chart as well as a histogram.
Graphs are so much fun, aren't they? Graphs say in one glimpse, the equivalence of what many hard to follow English sentences could say (not to mention the aesthetics benefit).
How Can I Improve My Model?
Gretel Synthetics generally does very well at maintaining within-field distributions as well as cross field correlations. However, if your JS Distance scores and/or your MAE score aren’t as high as you’d like them to be, here are few suggestions for fine tuning your model:
- Try retraining your model a few more times. The training process is stochastic and it can often take several tries to obtain the best model.
- If you can, increase the number of training examples. If you have no more data, try duplicating your existing set two or three times.
- Make sure you’re training with enough epochs. Too few can result in an underfit model.
- Try increasing the rnn_units parameter from 256 units to 512 or 1024. This will enable the model to learn more complex patterns.
- Experiment with the dropout_rate. A lower value may improve model accuracy.
- Gretel.ai’s premium SDKs (free during public beta) improve field-to-field correlations by clustering similar fields at training time, and our automated validators will improve model output by ensuring that the synthetic data semantics closely matches the training dataset. Check out our example walkthrough.
At Gretel.ai we are super excited about the possibility of using synthetic data to augment training sets to create ML and AI models that generalize better against unknown data and with reduced algorithmic biases. We’d love to hear about your use cases- feel free to reach out to us for a more in-depth discussion in the comments, twitter, or firstname.lastname@example.org. Like gretel-synthetics? Give us a ⭐ on Github!