Measure the utility and quality of GPT-generated text using Gretel’s new text report

Use Gretel’s text metrics to evaluate the semantic and structural similarity between AI-generated text and training text

Introduction

With recent innovations in various language models, generating synthetic text data has never been easier, but how can we evaluate that synthetic data against the real data? ‌We’re excited to announce our Synthetic Text Data Quality Report, which measures text semantics and structure in 50 languages!

The report highlights Gretel’s Synthetic Text Data Quality Score (or Text SQS), an estimate of how well the generated synthetic data maintains the same semantic and structural properties as the original dataset. In this sense, the score can be viewed as a utility score or a confidence score as to whether scientific conclusions drawn from the synthetic dataset would be the same if one were to have used the original dataset instead. If you don't require semantic or structural symmetry, as might be the case in a testing or demo environment, a lower score may be just as acceptable.

Let’s break down the different sections of the report! In this report, we are comparing ‌text data from Amazon product reviews with the generated synthetic text from our GPT-x model.

How to understand the Text Synthetic Data Quality Score (Text SQS):

The Text Synthetic Data Quality Score (Text SQS) is computed as the weighted combination of the text semantic similarity and text structure similarity scores. We’ll discuss each score more specifically in the next section.

Figure 1: The Synthetic Text Data Quality Score

One way to interpret the Text SQS is to take a look at what use cases the generated synthetic data would be suitable for. Below, we break down recommendations based on the score. If the score is “Poor” or “Very Poor”, read our tips for a multitude of ideas to improve the score.

Figure 2: Matrix of use cases based on Text SQS

Measuring semantic and structural similarity:

Figure 3: The two component parts of the Text SQS

Text Semantic Similarity Score:

This section of the report shows how semantically similar the real and synthetic texts are, with a score in the range 0–100. Semantic similarity refers to the similarity of the meaning of two texts. An embedding model is used to encode the text to a vector of size 512. We use a cosine similarity score to compare the average embedded vectors from the real vs synthetic. 

A higher score assures the user that they can enable the synthetic text data in downstream semantic text classification tasks in place of the original text samples.

Text Structure Similarity Score:

Structural similarity indicates how the average characters per word , words per sentence, and sentence counts compare between the datasets. We use Jensen-Shannon divergence to calculate the distance between the real vs synthetic distribution across the above statistics in the entire text dataset. You can see each distribution in the following plot. Similar real and synthetic distributions result in a higher text structure similarity score, which is the scaled average of the three distance values.

Figure 4: Graphs showing comparative distributions of real (green) and synthetic (purple) data

Summarize the real and synthetic datasets:

This table is an overall view of the training and synthetic text data features. In this comparison we always consider the minimum number between training data and synthetic data records, or the default of 80 generated records. A lower number of duplicated training lines ensures more privacy in the synthetic generated text. Missing values refer to empty string rows. Similar average character, word, and sentence counts across training and synthetic data guarantee higher text structure similarity.

Figure 5: Table with fine-grained summary of real and synthetic data

Visualize semantic similarity with principal component analysis (PCA):

In the following plot, we observe the relation across the first four principal components of the average embedded vectors in the real and synthetic text along with the variance ratio explained by each component. The diagonal plots, on the other hand, show the distribution of each principal component for the real and synthetic texts plotted on top of each other. Similar real and synthetic scatter matrices and distribution plots depict a higher semantic similarity score, which gives a user more confidence in replacing the original text with the synthetic for semantic text classification tasks.

Figure 6: PCA scatter matrix showing semantic similarity of real (green) and synthetic (purple) data

Try it yourself

You can generate and dig into a report yourself by selecting the “Generate natural language text using GPT” card on the Gretel Console. 

To learn more about using Gretel GPT, check out our post on fine-tuning Gretel’s GPT models to generate Q&A-style tweets from your favorite Twitter personalities and the accompanying notebook. If you have any questions, drop us a line in the Synthetic Data Discord Community.

Video: How to use Gretel's new Synthetic Text Data Quality Report

Watch Nicole Pang, Senior Product Manager at Gretel, walk through the Synthetic Text Data Quality Report and explain how it helps users evaluate generated text data.