How does Gretel define synthetic data quality?

Gretel recognizes that synthetic data quality is highly context-specific and depends on the intended use case. As such, we are continuously developing additional metrics to better assess the suitability of synthetic data for various downstream applications.

Currently, Gretel defines synthetic data quality using the Synthetic Data Quality Report, which evaluates the generated synthetic data based on two key factors: the Synthetic Data Quality Score (SQS) and the Privacy Protection Level (PPL).

The SQS represents how well the synthetic data maintains the statistical properties of the original dataset, making it suitable for different use cases depending on the score. The PPL is determined by the privacy mechanisms enabled during the synthetic data generation process, such as outlier and similarity filters, overfitting prevention, and differential privacy.

To provide a comprehensive understanding of the synthetic data quality, Gretel's report includes three key metrics:

Field Correlation Stability: Measures the similarity of correlations between field pairs in the training and synthetic data.
Deep Structure Stability: Compares the principal components of the original and synthetic data using Principal Component Analysis (PCA).
Field Distribution Stability: Evaluates how closely the field distributions in the synthetic data mirror those in the original data using the Jensen-Shannon Distance.

These metrics, along with summary statistics and a training field overview, help users understand the quality and utility of the generated synthetic data for their specific use cases. As Gretel expands its metrics to better align with downstream applications, users will have an even clearer picture of how well-suited the synthetic data is for their particular needs.

How does Gretel define synthetic data quality?

Ready to try Gretel?