How to use Gretel's new Synthetic Text Data Quality Report
Video description
How to use Gretel's new synthetic text data quality report
More Videos
Read the blog post
Transcription
Speaker 1 (00:00):
Hey folks. I'm Nicole Pang. I'm a product lead here at Gretel, and today we'll understand how you can utilize gretel's new synthetic text data quality report. So if you've ever used a GPT model, you know that it outputs natural language text, unstructured data. But have you ever wondered how does it compare to a training data? How do I understand the quality or utility of this natural language synthetic data? If I want to use it for my machine learning models, how can I understand the metrics and scores? Well, we've created a report with those metrics and scores and I'm excited to show you all about it. So I'll kick off a GBT model on the Gretel dashboard here@cons.gretel.ai. You can feel free to follow along by logging in yourself. I'll select generate natural language text using GPT. What's so easy about this is that I can make a selection with very few changes here. I've already chosen a project. If I don't have a data source to upload, I can also use our sample data set to get started. I've uploaded here a review data set of Amazon dress reviews. Next up the model configuration, like I just mentioned, I don't actually need to change anything in the default configuration if I don't want to. You can see here I'm using the pre-trained gredo AI MTV 7 billion parameter model. So I'll just click begin training.
(01:25)
This kicks off the model training and you can see we're training the model based on the dataset that I just uploaded. This is previewing the first 20 records of the data and once the model starts, I'll be able to check out the logs here, look along, check along on the progress of training and any other messages that might show up as training and generation happens. Oh, here we go. That's great. It downloaded. It's downloaded the model from the remote source right now. So we'll check back in a little bit when generation happens and take a look at the report. Okay, great. So our training is completed, our synthetic data records were generated and also the synthetic text quality report was also run. You can see here there's going to be a preview for you of the report in the console. You can see the overall score is excellent.
(02:19)
Semantic similarity score is also excellent. Structure score is good and we'll get into what that means. I'm going to download a preview of my results. This will give me the data, so it's a nice ability to do that. I'm also going to download the report and in case you haven't, you're not familiar with our console. You can also find all of these downloads available here. And if you want to run more and generate more synthetic data, you can always enter the number of rows you want here and click generate and that'll kick off a model run, which you'll see the results here and you'll
Speaker 2 (02:58):
Be able download it. So the results I just downloaded, downloads as a zip file. Once I unzip it, I can take a quick look through here. Yes, it looks very similar to the training data as in dress reviews, but what's really cool about this quality report that is newly launched is you don't just have to rely on looking through a sample of the generated data to understand the quality. We give you some metrics to understand and measure that quality. So the synthetic text data quality report is easy to digest. You can see here we give you a high level score. We show you more about what the score means. It's a combination of the text semantic similarity, which means how close and meaning the synthetic generated text is to the real training data and the text structure similarity score, which shows you how close in structure the synthetic data is to the training data.
(03:58)
So the structure meaning average characters per word, words per sentence, average sentence count for the the report, we have the principle component analysis. So this gives you a bit more data on how the semantic similarity score, what's calculated. In general, the closer the principle components are between the train and synthetic. You can see here we have the synthetic and purple and train and green, the higher the semantic similarity score will be, and on the diagonals, the synthetic and train are plotted on top of one another. So you can also see the similarity in shapes here for the tech structure score, we plot also the distribution. So you can take a look at the sentence count distribution, the words per sentence distribution, and the characters per word distribution. Also, you can see here pretty similar. So that's it for the report. Feel free to take a look, always try it out on console doc red.ai and let us know if you have any questions. Happy synthesizing.