Compare Synthetic and Real Data on ML Models with the new Gretel Synthetic Data Utility Report
Most organizations are eager to use synthetic data to augment their real-world training datasets and improve the performance of machine learning models in production. But how well do models trained with synthetic data compare to those trained on just raw data alone? Establishing this comparison is difficult and delays the AI-development lifecycle. Today, we’re excited to announce that you can now assess the performance of synthetic data on ML models with a few simple clicks using our new Gretel Evaluate classification and regression tasks, which provide you with a new Synthetic Data Utility Report, featuring a novel ML Quality Score (MQS).
The ability to validate synthetic data fidelity using Gretel Evaluate, our API for measuring synthetic data quality and utility, is one of the reasons why thousands of developers trust data generated with Gretel. Our new Synthetic Data Utility Report expands upon the statistical measures of quality you have come to know and love through the Synthetic Data Quality Report, Synthetic Data Quality Score (SQS), and Benchmark—all part of the broader Gretel Evaluate offering.
Analyzing synthetic data performance is easier than ever before
In a previous post, we showed you how to generate synthetic grocery store purchase data using the Gretel ACTGAN model and then used the open-source AutoML PyCaret classification library to evaluate the performance of prediction models trained on synthetic data versus those trained on real-world data.
Now you can perform this evaluation in the Gretel environment, including directly in the low-code Gretel Console.
In this blog, we’ll walk through an example showing how to: (1) add a Gretel Evaluate classification or regression task to any Gretel Synthetics model, (2) use the Evaluate SDK, and (3) understand the Synthetic Data Utility Report.
Synthesize & analyze data using Gretel Evaluate
You’re probably already familiar with this step: synthesizing data! In this example, we’ll use the publicly available bank marketing dataset which predicts whether a customer will subscribe to a term deposit (prediction: yes/no in target column “y”).
It’s easier than ever to synthesize + evaluate! Head over to the Gretel Console to start with the sample `bank_marketing` dataset and use the existing configuration.
The Evaluate tasks call the open-source AutoML PyCaret library to train and evaluate classification and regression models. The classification and regression tasks are pre-set with defaults for `holdout` `metric` and `models` values. All you have to do is make sure the `target` column matches your data.
Here’s what the `bank_marketing` dataset looks like:
Since we’re interested in training classifiers to predict the categorical “yes/no” outcome, we’ll set the Evaluate task to `classification` and `target` parameter to “y”. Then all we have to do is click “Begin training” to start the process.
While model training is in progress, we can check out the logs to verify that training is progressing without error. You’ll see that reports should be created at the end of model training.
What about regression?
Let’s say we want to predict the numeric `thalach` value (which refers to maximum heart rate) in the publicly available UCI heart disease dataset. You can download the processed_cleveland_heart_disease dataset using this public Gretel S3 bucket link.
Here’s what the model configuration looks like:
Notice that `task` and the `target` changed to match the use case and dataset. Now, click “Begin Training” to start this model, and follow along in the logs to make sure training is progressing without error.
Where’s the report?
The classification and regression Synthetic Data Utility Reports are available in the Records & downloads section of the model page. We’ll go through each part of the report later on in this blog.
In the previous examples, we didn’t change the default values for:
- `holdout` - the train-test split
- `metric` - what the AutoML models optimize on
- `models` - the classification and regression models to train
If you need more customization, you can change these optional parameters, like so:
The supported models and metrics are detailed in the documentation.
Using the Gretel Evaluate SDK
In the notebooks, you’ll also see an option to call the Evaluate SDK directly on two datasets (one synthetic, one real-world) to create a Synthetic Data Utility Report.
Indicate whether you want to evaluate classification or regression by importing `DownstreamClassificationReport` or `DownstreamRegressionReport` from the Gretel Client, and set your `data_source` to point to the synthetic data and `ref_data` to point to the real data.
After you start the job, head over to your Projects list in the Gretel Console to see more training logs and easily download the Utility Report after the job is completed.
For detailed more information about the SDK, see the documentation.
The Synthetic Data Utility Report
The new Gretel Evaluate classification and regression tasks generate a Synthetic Data Utility Report showing you the performance of the synthetic data compared to the performance of real data.
First, you’ll see a high-level ML Quality Score(MQS) which gives you an at-a-glance signal about the utility of your synthetic data for downstream models in your ML pipeline.
The MQS is calculated as the ratio of the average score from the top-performing models trained on synthetic data to the average score from the top-performing models trained on real data. In the examples above, the classification MQS is 96%, which is a promising result for the synthetic data. You may even see MQS above 100%, which means the synthetic data outperformed real data - an exciting result!
The top performing models are highlighted in the next section of the report:
In the example below, during model training, “accuracy” was set as the classification metric in the Gretel Evaluate task configuration. (Note: “accuracy” is the default metric for classification and “R2” (i.e. R-squared) the default for regression, so you don’t need to change those options unless you want to optimize training on a different metric).
In the classification report, you’ll find a results table with metrics for each model:
In the regression report, you’ll find a table of results based on regression metrics:
You can also dive deep into plots to understand the top models. Confusion matrices are plotted out in the classification report for the top models for both synthetic and real data.
In the regression report, review the error distribution for more information on the downstream models.
You can find more information about the Synthetic Data Utility Report in the documentation.
You’ve trained and evaluated synthetic data on machine learning models against real-world data, now what? By reviewing the Data Utility Report and MQS, you can get immediate feedback on whether your synthetic data is ready for your ML training or experimentation workflows.
If the MQS is not as high as you’d like, you can try a few things, including reviewing the Synthetic Data Quality Report to deep dive into synthetic data quality and fidelity metrics. If your data quality results aren’t where you want them to be, you can also follow our tips for improving your synthetic data.
With these new Evaluate offerings and Synthetic Data Utility Report, we hope you’ll find it quicker and easier than ever to validate the performance of synthetic data, and safely incorporate it into your business.
How will you use this feature? Let us know by joining our Discord Community to share your feedback. We’re here to help. Go forth and synthesize!