Presentation

Generating Synthetic Data with Only Three Lines of Code

November 22, 2022

The simplest way to create high quality synthetic data

Read the blog post

Transcription

Speaker 1 (00:09):

And hello everyone. Welcome to another Gretel workshop today on this wonderful Wednesday afternoon. My name is Mason. I'm the lead developer advocate here at Gretel. And joining me today is a real treat. We have our chief product officer and co-founder Alex Watson.

Speaker 2 (00:24):

Hey Mason, how you doing? Hey,

Speaker 1 (00:26):

Doing well. Yeah. So today we're going to talk about a project that is, I'm really excited about. It's one of my favorite parts about the Gretel SDK right now. And I keep really close eye on the development of it, which is called Gretel Trainer. If you've ever used Gretel before, you may have seen our SDK, which is a good way to interact with our API. But we were like, we can make this simpler. And I know Alex always takes simplicity as a challenge. I've heard you use the quote before, if you can do it in four lines, do it in three. If you can do it in three, do it in two. So I definitely think we took the approacher, Alex and his team took the approach to let's simplify Gretel and where we can generate synthetic data and we now have Gretel trainer. So anything you want to add about that?

Speaker 2 (01:10):

No, I think that's a really good summary, right? So I think a lot of our users are pretty familiar with the Gretel client, SDK. That's what we have in most of our blueprint notebooks. So if you're looking at the open source examples from Gretel, we have this production, highly scalable pipeline, SDK. One of the things that we've noticed and we always strive for is adding simplicity in this idea that every time that we make it one step simpler, we really open up the aperture for different use cases that our users and developers and things like that can go after. And one of the things I think is so exciting about what we've done with trainer so far, and maybe I'll start with just giving a little bit of context on trainer. I use an analogy for those like wishing learning folks out there. I think of trainer is being a lot like what Caris is for TensorFlow.

(01:56)
Where under the hood you have this incredibly powerful machine learning library. It can do anything, but sometimes there's a lot of complexity to get started. So when I think about trainer and when I use trainer in my own projects, it's when I'm trying to do something really quickly. There's two goals that we have with the trainer, SDK they're going to walking through today. One is Mason called Out Simplicity. How do we go from 15 lines of code to generate synthetic data, which was incredible a year ago, down to five lines of code down to three lines of code. And so really I would view what you see in Trainer as a preview and experimentation for what's coming into our production SDKs over time. It enables fast experimentation and feedback and for us to learn for how users want to build things. And the second thing that's pretty cool too is we also use trainer to preview some functionality that's going to be coming to our service.

(02:48)
So we can talk about this as we go through the examples, but Trainer has some incredibly powerful concepts built into it that are coming to our production service. For example, the ability to process data sets that have let's say hundreds of thousands of columns. And that sounds insanely big, but really it was taking the learnings that we've had working on genomic data with companies like Illumina for example, where you have these massively highly dimensional data assessing how do we train a synthetic model that can be created. Trainer has that built in. So what was in our original case study with the Illumina folks, three or four notebooks full of steps that you would use to process data, it's now four lines of code that's required to do this with trainer. So very exciting.

Speaker 1 (03:38):

Yeah, that's really cool. I'm super excited. I definitely like the quickness of it. I feel like I can get a lot done really quickly and I can, I don't get bogged down trying to figure out synthetic data. I just make synthetic data and I keep on my merry way, which I think is definitely the accelerating piece that we're always aiming for. So the first thing we're going to do is let's just go ahead and get the base, show you how basic trainer is and how simply you can get something going. So what we're first going to do is we're going to import trainer from our library. We've already pre-installed the library. Nobody wants to watch me PIP install on screen. We have this dataset that we use. It's the US adult income 5K dataset. It's pretty standard dataset in the machine learning data science space.

(04:22)
It's a good one. And it really does kind of highlight some of the functions that we can do here with trainer. So what we're going to do is we're going to create a model. So we do trainer, trainer, we're just creating a class of it. And as you can see, we have all of these options that are here. But what we're going to start with is we're going to do project name and we're going to just call this one, we're going to just call it trainer or that's the default, that's a bad name for it. Let's call it workshop. Since this is our Gretel workshop, if you don't give it a name, it defaults, it calls it trainer. So if I had done project name was trainer, it would've been completely pointless. We wouldn't, wouldn't have gotten anywhere with it. And then the next thing we do is we basically tell it, Hey, train on this dataset. So train a synthetic data model on this dataset, which the fact that it's just that like hey, train the dataset, love it. And then we say model generate, and then we can say the number of records here is equal to a hundred. And then we just go ahead and hit play. Oh, I made a booboo, this is what happens whenever I sometimes the PIP install is mean to me.

(05:33)
There we go. Okay, I might've had something in my cache. Who knows? Google collab. Okay. So the next thing we're going to do is it's going to ask us for our Gretel API key. So whenever you're using Google collab or some sort of system that basically is using like you're using a web-based notebook or something where you don't have access to a command line interface, you can just input your key here. We will go over later how to do a configure on your command line if you are doing this locally, say with maybe a con environment. So we're just going to go here to API key. It is also on the front page, but I like going to the API key just to be certain. I grab it, we hit enter and then we go and it, as you can see, it logged in as me and now we are training a model. Now the cool thing about this is is that we can either watch this from here or we can come back to the dashboard and we can watch the progress of our synthetic data model. Come on, there we go. Have it catch up with me and we can come and watch all of our logs here from the dashboard. So if you are also using the dashboard or you want a nicer UI for watching as your model trains, you can totally come here and watch it.

Speaker 2 (06:49):

Mason, maybe we'll walk through as it's picking up a worker here, walk through exactly what's happening under the hood with trainer as we go through. Fantastic. If you mind going back to the code, maybe I'll talk about some of the steps that's happening behind the scenes that you absolutely have the ability to influence should you want to, but trainer makes a lot of these decisions for you. So in that top line, what we're doing is we're instantiating a trainer object. This trainer object has the ability to interact with any of the different models that Gretel supports. So kind of cool thing here is there's not one model to rule them all. In the Gretel world, there's some models that work really well for text data. There's some that work really well for tabular data. Ones that work when you're trying to generate billions of records.

(07:28)
The trainer object here, what it's doing is it actually on the model dot train? It takes a look at the data sets, it looks at the dimensions of this US adult income dataset. It says, what do I know about this dataset? It's 50,000 rows and 20 columns. So it takes a look at the data and then makes a decision automatically for you, what is the optimal model to run here? Trying to find that perfect balance of accuracy and for some customers as well, privacy. So that's happening. And really the task here when it's training is the model, the deep learning model, generative model we have under the hood is examining this dataset and it's learning to recreate the dataset. So a model completion means that this model has learned how to create another dataset that has the same insights. If you were to query it will give you very similar responses, but without actually replicating any of the real data. So this synthetic model or synthetic data that's generated has the same look and feel as the original data, but it's not based on any real world people, objects, things, sort of things like that.

Speaker 1 (08:34):

Awesome. Yeah. So what are some of the things you look at whenever you're trying to figure out what model? So you said something like is it number of columns amount, is it text versus numerical data? What are some of default or the most important things to look for whenever trainer decides what is the best model to use?

Speaker 2 (08:55):

Yeah, it's always a complicated question. It has something to do with your use case and also the types of data you're sending through. I believe here you can see as you see this model selection here, you can see what model it picked. So here synthetics is the default label that we have for a language model. It's based on an lstm, it's running out of the hood. What we found are language models. So a lot of people are familiar with sequence based models, transformers, things like that. These models that are really great at learning and recreating human language turns out they're also really great at learning and recreating like structured or tabular data. So that's what's happening under the hood. Language models aren't as fast as some of the other alternative deep learning generative models like diffusion models or gans, but they have the ability to learn patterns, complex patterns such as like natural language text and create datasets that have really great accuracy.

(09:54)
So I think describing what happened here, it took a look at the dataset and said the datasets not too big. There aren't too many columns in it, which makes it an ideal candidate for a training language model, which would take a little bit longer but is excellent at learning the different patterns and data. So we'll see that. We'll get a look at the quality score when it's done, but I would expect that the quality score for this configuration that it's picked here is really nice. So the benefit here is it's abstracting some of that decision process where you have to say like, Hey, what model? Go to Gretel docs, Gretel AI site and look at the models and figure out which is the best for you to run. We believe most of our users don't need to worry about that and we can abstract away a lot of that complexity.

Speaker 1 (10:39):

That's really useful. Yeah, I think that whenever you want to get started, especially with a new tool, it's fun to read the docs, but it's more fun when the tool just works out of box and you don't have to spend a lot of time figuring it out. So I guess I would say if you're new to using Gretel and you're new to using trainer, let trainer figure it out first. If it doesn't work, then can always just, then you can go and fine tune it, but let trainer give it its best shot first and then we can figure out from there. And I've trained this data set enough times to know that it does a very good job of selecting and it usually comes out really well as we can see here, we can see, oh yeah, we definitely get a little bit more data here in the

Speaker 2 (11:20):

Get a lot more data when you're looking at the console, right? The trainer output is very simple. So what's happened here where it says creating synthetic model is we are using the Gretel cloud to train this model. So it's not happening on your CoLab environment or on your own workstation. In this case it's using the Gretel cloud. It's spun up a container with access to A GPU and it's using that to train the model. You have the ability to choose to run models locally. We're just taking advantage of the cloud here. So infrastructure setup is not something you have to worry about. And then each step here, each epoch is essentially a complete pass over the original data and we see things and this gets very complicated very fast. The things going down here, the goal is to increase accuracy. So you want to see this accurate data account move up and you want to see the loss move down. It looks like it's generating data right now. If I able to read that kind of small print. So it's finished training the model, it's generated a sample dataset and it's generating a quality report for us automatically right now.

Speaker 1 (12:19):

Yes. Looks like it's creating this synthetic data quality report model creation complete. I think eventually, I know it will spit out the okay right here. So we have an SQS score of 92, which is excellent. Which is, that's great. Yeah, that's really

Speaker 2 (12:36):

Great. Well it's finishing up generation here, Mason, maybe you go back and show everyone a preview of what the report looks like.

Speaker 1 (12:41):

Yeah, I do love the report. Also hands out to having a copy button on your logs as someone who's had a debug stuff before, being able to copy the whole logs is definitely valuable. So yes, here is the preview of our synthetic data quality report, but if you want the full version, you usually download this and then open it up and we get our synthetic data quality report, which always has such goodies. I love the expandability here. Do you want to talk a little bit about just the high level of this? I know we could spend hours digging in deep

Speaker 2 (13:16):

This. Absolutely. And it's probably another workshop for us to do, I think would be a good time for it. But really the zero through 100 score is so useful for saying, is my model doing good enough? And here we could see it's doing excellent. So we really don't have to much to worry about. It looks like the trainer picked a good set of parameters. I always go through and the first thing you want to look at is the number of training lines duplicated and just make sure that your synthetic model didn't actually copy over any of the original data. That shouldn't be possible with the default privacy settings that we have. So this is all new data being created. If you want to go through and scroll through there, you can always kind of geek out and look at some of the fund distributions and things like that to really give you that sense of comfort. But we're in a good spot. Trainer did its job. We've got a nice synthetic model and sample data set to work with. So we're ready for whatever's next.

Speaker 1 (14:03):

Yeah, love it. So now we go back here and we have our synthetic dataset. So we've created a hundred rows of synthetic data and I always clicking on this little magic button, it makes life a lot easier to read. Yeah, so now we have synthetic data. We have our basically the US adult 5K income and it looks very much like our other data. I wish we had done a Let's do something real quick. Where's the, actually no, because it's the same. We can do this. Let's copy this. Let's do a code box and then what we'll do is we'll pull you up and we'll preview the original data so we can do a comparison and see like, hey, this is, well no, you can't go above pip. Cool. And then I have that code over here somewhere. So yeah, let's just do this, do that, delete that. The import pandas. Pandas as pd. I didn't tip install pandas. Of course I didn't. There we go on that again. If you hit stop, I'm clicking too fast. I'm trying to do it quickly, I'm getting excited and I double clicked and then it went away.

Speaker 2 (15:25):

I'm sorry. Hey Mason, I think we might have the variable. That incorrect me. There you go.

Speaker 1 (15:29):

Yeah, I found it. There we go. Okay, so this is the original dataset. I'm also going to take off my Apple watch because it's going to start talking to me. So as you can see, the original dataset, age, private, high school grad, married. So this is just kind of the data we can expect to see. And if we look at our generated data, if you had to put them side by side next to each other, I wouldn't be able to tell you the difference. They look almost identical. Is there anything you want to point out here Alex?

Speaker 2 (16:00):

No, I think this is one. So maybe talking about the use cases we would see for why would I create a synthetic version of this dataset? Often privacy is a really big deal. So you have here we're looking at something that's obviously very sensitive. You might have demographics or individual employees or things like that that this income data is based on. I need to create another version that I can use for machine learning model training or add to my data warehouse or things like that that actually don't point back to real people in my business. So that's one big use case. The second big use case we see is boosting the representation of certain classes in the data. And this is really to help you train models that work better on the types of data that the model will encounter in the real world. So if you might have, for example, a new type of data coming into your model or a class imbalance where you've got very few examples of high school graduates that are living in a certain state. In Ohio for example, you could use conditional data generation, which I believe we're going to go through next, right, Mason. So essentially telling the model to generate more records that look like this and having it auto complete the rest is another big piece of functionality.

Speaker 1 (17:19):

Yeah, that's actually something that I was playing around with this week. I finally got around to it and I really enjoy conditional data generation. It feels magical almost to be honest. So actually let's rerun this one. So as we can see, we have a whole bunch of columns here to choose from. One of the ones that whenever I was doing my playing around this week, I decided, I was like, Hey, show me high school graduates that were never married that are making over 50 K, and then show me people who have their bachelor's degree who are never married, who are making under 50 K. So that's the kind of data that I chose. Let's generate some data that looks like that. So what we have to do is we have to first create our seed fields. So Alex, can you explain what happens when we apply a seed field to a synthetic model? How does that work? What does the model do with this data?

Speaker 2 (18:08):

We are telling our model and the kind of cool thing here, this works across all the different models we support. So this idea of conditional data generation is that we give the model a little bit of information. For example, here, I want to see a certain education level or marital status and have them model complete the rest of it by predefining the seed fields like what we're doing here. We can really optimize the model to do a good job with this task. So here by saying like, hey, these are the fields that we think we might want to use for conditional data generation. Essentially we architect the model training data in a way that it's going to be really great at completing the rest of the data. So we've defined three different fields here and what we can do after we've trained the model want to generate more data is by specifying exactly what each one of these fields should contain. We can do that with the data frame or we can do it with CSV. We can have the model essentially auto complete the rest of the data for us.

Speaker 1 (19:00):

Cool. Okay. I love that. I love that example. Sometimes I just ask questions because I want to know the answers and I hope that other people are getting value out of them too. So what we do next is we supply the seed fields as a parameter here and then what I am going to do just to be a little bit, to allow us to not have to always retrain the model. If we want to play around with this a little bit, we're actually going to do the generator in a different box. So we're not going to generate from the, I did that wrong, my bad. This parameter goes in the train, not the trainer thing. This is why you have notes on your other screen. Always have notes people it definitely helps and you can't have an extra comma there because yeah. Okay, so we, that's all good.

(19:46)
So all we're going to do is we're going to seed field on education, marital status and income bracket. So let's go ahead and train the model. As you can see my credentials were already cashed so we don't have to keep re-putting it. And then I actually did, while Alex was talking, I did change the project to workshop conditional. So we are going to have a different model here in the dashboard and we can watch what's going on here. Once it gets to a certain point, it looks like it's already going so we should be able to see what's going on. Cool. Okay, so now we're just waiting on it to train. What questions do we have? What do we want to talk about?

Speaker 2 (20:27):

I could jump on and talk about maybe some of the differences in conditional data generation between different models.

Speaker 1 (20:32):

Yeah, go right ahead.

Speaker 2 (20:33):

So what we see happening right here, just for a quick update, anyone looking at the screen here, we've started training a model and what that means is we've kicked off this process in the Gretel cloud, what it is doing is looking for an available GPU worker. If worker doesn't exist, it's spinning it up. So you have autoscaling essentially spinning up a worker for us and it's just starting to process the data. Right now with conditional data generation, you start to see some slight differences in models and performance. And so if you're sitting there trying to augment a machine learning data dataset, this may actually impact the decision of what model you want to run.

(21:10)
To summarize what could be its own session on unconditional data generation, maybe some simple guidelines that I think about when we're doing something like this or we're working with a customer on it, the GaN models that we have. So act GaN, CT GaN models that we support. So these generative models that are using something called a generative adversarial network under the hood have advantages in that they work really well with really wide data. So you can have thousands of columns and generate really good results. However, their conditional data generation isn't as sophisticated as it is with the language model based system. So with the language model based system, I could put something in, I could say to use our example we used a second ago, my education level is something that didn't even exist inside the dataset, never finished college. You could just kind of make up a new category that didn't even exist in the original data.

(22:09)
The language models are built around this use case of thousands, millions of variations of text. They don't even blink at it. They say Great, I saw college in there somewhere. I'm going to go ahead and complete the rest of the record. So they're very adaptable and flexible for the different types of data you're sending in the GaN models. So using for example here where one of the other models we support is called GaN, they work by essentially taking every type of category. You might have a high school education tag that you had, so like high school GED for example, and turning it into a number so it gets encoded as a number. And what that means is that Gant don't have that level of flexibility to see new types of data and handle it well. So if you're training a model and you're using it for conditional data generation, gans will work really well on a couple columns that you want to conditionally generate.

(23:07)
And the language model is extremely flexible. So while it's not as fast, you can define as many columns as you want to all the way up to the final column and just have it finished the data for you and be confident it's going to work every time. So a little bit of a balance there and one that we're working on all the time is trying to close the gap between our different models as we continue to evolve. But at the moment I like to think gans work really well on high dimensional data. That means lots of columns, lots of rows, the language models work really well when you can have really long data, but especially when you want to have variability inside of that data that's either rare in your original dataset if you're trying to boost a few examples of fraud for example, then LSTM is the synthetic models are going to be your friend. And also if your machine learning accuracy, the language models work really well for that too.

Speaker 1 (24:05):

Awesome. Yeah, there's a lot that goes into all this.

Speaker 2 (24:09):

It is, it is. And trying to make it simple is definitely a goal. So here we can see some kind of cool stuff happened with the other model here. So it trained the default tells it to train, I believe to about a hundred epochs of data. But what it found, and this is called early stopping, is that it got really good results and wasn't improving much against the real world data. So it stopped at EPOCH 37, feeling like it had a really good set there. And then the next thing it does is generates as we saw before, a sample set of data which we used to evaluate the quality of that data. So it terminated training early because it came to a really good solution and it didn't. This is really important for people that are looking at a privacy use case that you want your model to learn patterns and data, but you don't want it to memorize the data. So in the case above it discovered that it had learned the patterns as well as it was going to without line for line memorizing data and said, great, I'm going to go ahead and terminate training. I take the best epoch that I had, the best accuracy score and move on and start generating data. And that's what we see happening here.

Speaker 1 (25:11):

Cool. So question because I've always heard that but I've never quite understood it. Learning and not memorizing. Why is memorizing bad in a synthetic data model?

Speaker 2 (25:21):

We want to build with synthetic data, you want to build on the original real world data that you're training on. You don't want to memorize it. There's a reason why you're not using your real world data, either A, like privacy, you can't use it for privacy concerns or B, if you're trying to augment, you're trying to take your machine learning dataset that maybe works really well on 80% of use cases, but on that final 10 to 20% isn't doing very well, you don't want to repeat more data from the real world dataset to get there. So the idea is you want your synthetic data model to introduce new variations and those new variations will help that machine learning model respond better to changes in real world data when it comes in. So maybe backing up a step here, I think one of the top challenges we see with the users we work with and keeping these machine learning models that they've built, deployed and creating value for their business is dealing with data drift.

(26:19)
So this idea of data drift and concept drift data drift is when just fundamentally somehow that the inputs to your system are changing. So inflation would be an example, right? Where you start seeing increased costs for things that are just the results of things changing over time. Concept drift is when all of a sudden people start using your system in a much different way than they ever before. And the simplest way to I would use to describe concept drift is the pandemic. Think of how people used Zoom and streaming platforms before pandemic and then after. And what this results in is machine learning models only work on the data they've been trained on and they don't generalize very well. So synthetic data is really powerful when you start to see these shifts in patterns and usage, whether it's this data drift we talked about or concept drift to take those few examples that you have generate a ton of new examples that you can use and then update the training set for your machine learning model and that helps you keep that model deployed and creating value for you longer.

Speaker 1 (27:23):

Okay, that makes sense. Yeah, awesome. I'm learning too. Alright, cool. It looks like we're still over here. You say we're generating data and comparing for accuracy right now, correct? Yeah. So I guess while we're waiting on it, so what did it do? Did it see that the accuracy here is all within a thousandth of a decimal place? So it decided hey, we're not really getting any better. How does it determine when to basically short circuit and stop?

Speaker 2 (27:53):

There's some logic in there and it really depends on your configuration that you choose, but our default configuration is saying if you're not improving or if the accuracy starts to drop, maybe it did there in the last couple EPOCs, then it's going to say, okay, we've got the best version of our model here, let's look at the accuracy and loss and take a look and see if we can figure out what's happening.

Speaker 1 (28:12):

Yeah, it looks like, let me get Zoom in as much as I can on that one. And it keeps scrolling every time a new one pops up. So it looks like it was like 0.8 9, 8, 9 5, 8, 9, 5 8, 8 9. I'm going to hold it here. That'll work there. Yeah, it looks like it might've actually started to drop in 4, 6, 4 7 by a 10000th of a decimal point, but

Speaker 2 (28:36):

That's enough. So the model will grab that kind of best score that we had around epoch 30 and use that and then kind of move on to the next step.

Speaker 1 (28:44):

Okay,

Speaker 2 (28:46):

Cool. Second thing that we're seeing here is actually a really neat kind of illustration of language models versus the GA models that we offer. So language models, incredible performance, incredible accuracy, but by virtue of how they work, they can't be paralyzed in the same way that GaN can. So I think maybe a fun example to show next is using either statistical models or GaN models like how fast we can generate data here. A language model essentially predicts the next token in a sequence. So it's saying my high school degree was A GED and my age is 24, so it's predicting that how old is my age in this dataset? And because of that you have a level of recursion. So essentially the inputs from the previous thing are required for the inputs next. And anybody in the computer science world knows that recursion is not good for paralyzation. So here we've got this paralyzed and it's running on a lot of separate threads on CPU for inference, but we can't get those same types of performance, explosive performance gains that we can get when we use a generative model like a GaN where everything's just matrices and numbers and they can be very easily paralyzed.

Speaker 1 (30:02):

Cool. So would you say that the LSTM model that we're using here, if you really care about super, super high accuracy, it may be a little bit slower, that's fine, but if we go to a statistical model, which we are totally going to do here in about as soon as this is done, you it's a trade off, you have to sacrifice a little bit of accuracy for speed.

Speaker 2 (30:21):

That's fair and scale in some cases becomes possible with the other types of models. If you're working with a dataset that's a terabyte in size, billions of records, it wouldn't make sense in many cases to train a language model on it. You'd want to use one of the other models that could process that dataset quickly and give you an answer within an hour instead of waiting for this to finish. So it is certainly a balance that you try to find. Each one's got different benefits, but these language models here we're referring to the LTN we offer also offer a pre-trained language model using a transformer, a GPT-3 type architecture. It's even slower, but both offer really state-of-the-art, performance and accuracy.

Speaker 1 (31:06):

Awesome. Oh looks like we're getting there. Number of saving model. Let's go back over here and see. Nope, still going. We're getting there to the end generating. Yep. Model creation complete, but it still says it's creating the SQS report. Oh no, it finished over here. I was waiting for, so we have an SQS of 95 that's getting up there and now what we can do is we can generate, oh, we're doing the conditional trainings, we have to change this. So now what we do to conditionally generate is we're going to create a seed data frame. Now you said we could do this in CSV and data frame, correct?

Speaker 2 (31:47):

That's correct.

Speaker 1 (31:48):

Okay. And then all we're going to do is just create a data frame here and we're going to go with high school grad rad, we're going to go with never married and we're going to see greater than 50 K. How sensitive to capitalization in case is this model? So if I had put this as a lower case, K, would that have affected anything?

Speaker 2 (32:18):

You always want to replicate that data as closely as you can or that your mileage may vary. So with this particular model using a language model, it'll be pretty adaptable to even misspellings or new variations that are put in there. But definitely recommend here whenever possible is to match that as close as you can.

Speaker 1 (32:38):

Fantastic. And then we set the columns as the seed field. So now we have our data frame and we just say model generate seed, DF equals seed df, that's what you get when you name things the same thing. Cool. And now we go ahead and we are now going to ship it off. Can we view when it does conditional, where does it Yeah, we should

Speaker 2 (33:05):

Be able to see that if you click in on the model that we've kicked off and there you go.

Speaker 1 (33:11):

And now we just wait on it here for it to go. And this usually happens pretty quickly, I've remember. Awesome. We've got that. Some

Speaker 2 (33:20):

Impressive live coding there. Mason.

Speaker 1 (33:22):

You know what? It's really easy when you have it on the side, which I've always done every time. Yes, a hundred percent. If you ever want to do live coding, make yourself notes. It's not cheating. No one wants to see you fumble over misnamed parameters for 25 minutes. Okay, I've seen that before. It's not fun. And that way even if you have it there, if it doesn't work, you can copy and paste it. There you go. There's nothing wrong with giving up at the end. So we have some data and here we go. So we have a high school graduate and who is never married making greater than 50 K. And then we have a person who has a bachelor's degree who is never married, who is making under 50 K.

Speaker 2 (34:05):

So the rest of this data, it looks like the race, the gender relationship, all of these different things were built by the model, which is pretty cool. Yes.

Speaker 1 (34:14):

Cool. And then if you want to do more like say you want a lot more of these, you can just do a little bit of python math multiplication here to do more, but I don't think we need to do that. So now we just got done doing all of that. Let's go ahead and play with the statistical model now, which is amplify. Let me clear this out and figure out how I want to approach this. So we're still going to want this model generate. We'll come back to that and we'll leave that there. So now, as Alex had said earlier, right now we do, the trainer does a really good job of picking your model for you, but if you did want to specify your model, you can do it and you do have to import it. So there are docs for this, I'll share those at the end. But from, you know what, we're going to be good Python engineers and we're going to do this up here.

(35:11)
Gretel trainer models, import Gretel, amplify. Let's rerun this. Okay, so we're good. So we come down here and now we specify model type gretel Amplify. We're going to change the project name to amplify here in the trainer. And now what we can do is we can specify our model type. So we tell it exactly what to do and then we just give it our data set and we train the model. I left the seed fields here. You don't need 'em anymore, but I just didn't delete them. We're not using 'em anymore. Okay, Alex, do you want to tell us what's going on underneath the hood here?

Speaker 2 (35:53):

Yeah, what we're doing is we're training here, not a deep learning model. This is a statistical model based on calculus. And so what this is, I would view this as a state-of-the-art statistical model that has a couple advantages that make it really great for certain situations. One, it runs on CPU, which is wonderful, right? Not every person, for example, if you're running this inside your own data center inside your own BPC or on your workstation might have access to GPU to use here you can get really great performance. I'd say we see an accuracy drop of about 10 to 15% typically versus our deep learning models from the amplify model, which can run on your laptop, it can run on any kind of CPU basis instance you have. Another thing we've done is we've spent a lot of time here building amplify to be extremely fast.

(36:48)
So here you can see the SQS finished already. So both on training where you can have amplify trained on data sets that are in the gigabytes to terabytes, which would take quite a bit of time with the deep learning model, we see about a 10 point drop in our synthetic quality score versus the original data, but still very good and then highly optimized. So essentially taking advantage of CPUs and threading and instruction sets, but it's a highly optimized ability to generate a lot of data. This came out of some customer discussions that we've had where customers are trying to build large variations. So take for example, 10,000 example data points you saw coming into your pipeline. How do I create a billion or 10 billion records that I can use to stress test my infrastructure? So this is where Amplify comes in. It's not just copying the data, it's creating whole new records, different distributions of integers, ages, things like that. And we can do it really fast.

Speaker 1 (37:51):

Awesome. Yes, and we did see that this executed 57 seconds. The reason that Alex and I have been talking so much is whenever you do live coding with models, you tend to have to fill that space. So we've been filling the space there, but we got this one done before Alex was done with this with explaining it. So 57 seconds is pretty good. And then now what we can do is we can set our NU Records option and let's go ahead and just do, let's make a million records. Let's see how quickly we can make a million

Speaker 2 (38:17):

Records. Let's see if we can crash CoLab. I love it.

Speaker 1 (38:20):

Crashes. I've crashed it at a billion, but I was able to do a million really easily earlier. So yeah, I'm not even going to ask a question. I know how quickly this goes and I won't get through the question before it's done, which is really

Speaker 2 (38:36):

Awesome. I'll go fast with some things. Go ahead before this is done. So we have done some, one of the questions is how fast and we have done some performance testing. So if you're running on a pretty powerful workstation, for example, in the cloud, 32 CPUs for example, it'll generate about a hundred megabits per second. So incredibly fast that you can hit here using our cloud instances, they have four cores, they're not quite as fast, but one of the neat things you can do is just paralyze using the cloud. So we could fire off, if you wanted to generate 10 million records, you could kick off 10 jobs at the same time using this model that generate call, have them all execute in parallel and get your work done really fast.

Speaker 1 (39:19):

You beat it. You barely beat it

Speaker 2 (39:21):

By like a half second

Speaker 1 (39:22):

By like a half second. So yeah, we just generated 1,000,500. Did I make a

Speaker 2 (39:29):

Interesting, no, you did not make a bug there. That is an artifact of the speed improvements that we made. So across the threading, essentially behind the hood that the different CP workers are grabbing jobs off a queue and then sometimes it creates a few more records than you're asking for,

Speaker 1 (39:47):

Oh, what a lovely little race condition. I've never heard of those have been spoken about in a positive light. Okay, so I guess a question I have here for you about this is are there instances where you would use the lstm, where you would use two in a row? Would you train data on the LSTM and then maybe generate conditionally, see that and then use Amplify to make a whole lot more of that conditionally generated data? Can you chain these models together? Not necessarily with trainer as more of a hypothetical?

Speaker 2 (40:20):

Yeah, so some of the advantages we have with language models are, especially pre-trained language models, is they can bring in variations that never existed in your input data. So that's a really powerful concept. What if synthetic data could actually be better than the original data was trained on? And some things like our transformer models really open the door for you to introduce new variations. So our transformer model, it's a derivation of the GPT-3 architecture. So essentially just an open source implementation of GBT three, but it might be able to introduce new patterns in the data we see here. So we see the college degree being bachelor's, some college, our pre-trained model might even introduce something like PhD or things like that that didn't exist in the input data. So one pattern we've seen some of our users take is taking a data set here, augmenting it with new examples from a pre-trained model or a language model. And then really using Amplify when it comes down to, okay, now I want to generate billions of these records. So it's neat seeing the kind of pipelines people build chain these different models together, but completely optional. And I would say that's is a advanced Jedi level use case that people will go after.

Speaker 1 (41:30):

I want to be a Jedi. That sounds like fun. Yeah. Awesome. So what should we play with next? Are we good? Is there anything else you want to cover? There's a whole lot more to trainer that we didn't cover, which is a lot of the documentation, which you can find@trainer.docs.gretel.ai. I guess we can talk about the model. So we talked today about LSTM and Amplify and right now it says CT again. I know we've changed it to Atkin. So when would you use the Atan model over LSTM?

Speaker 2 (42:04):

So Atkin is a model that we're just launching right now and actually kind of funny story, we've been running it behind the hood, behind the scenes for a while that really dramatically improves on the original CT scan model that we launched with. So what we're looking at is 90% reduction in memory requirements, which is super exciting. And actually the increase in the synthetic quality score. So we see a 5% plus increase over the already really capable GaN based model that we were using. So it's an evolution of it, but a significant change in the sense that with the dismember reduction can run on much larger data sets, much more varied data sets. So a lot of users that have worked with gans, and I talked about this a little bit before, that all of the different text elements or categorical things with GaN need to be encoded into an integer variable. And what that does is that the more types of variations of categorical data you might have, dates, people's names, addresses, things like that, create more patterns. The model has to memorize and end up exploding your memory requirements. So the neat stuff about Acton is that you can run on a variety of data sets that were never possible even with our previous model, and we're seeing really exciting results and even better than expected accuracy on it.

Speaker 1 (43:26):

Fantastic. Oh, that's always better than the alternative, which is it's not working at all. Why is it broken? Awesome. Well, I do really, I've been playing around a lot with trainer lately. I've really enjoyed it. It definitely is a breath of fresh air when it comes to simplicity and I'm a big simplicity person. It kind of reminds me a lot of requests is one of the famous Python libraries that kind of just gets the simplicity of creating and doing just a human usable API. And I feel like trainer is definitely much more human usable than other APIs that I've seen. So I've been really happy with it and really enjoyed playing with it. I kind of want to go for a billion records, but I know that last time it took 20 minutes and it crashed CoLab. So I don't necessarily want to go with that. I don't think I have any other topics. Is there anything else you want to chat about before we wrap up, Alex?

Speaker 2 (44:19):

No. Maybe just for things to leave with, we would love any feedback for folks that are using Trainer, don't hesitate pop into our discord. Ask questions. We've had actually a pretty awesome discussion happening in there or hit it support@gretel.ai if you have any questions.

Speaker 1 (44:36):

Yes, yes, definitely. There's a link down below as you can see in the thing. But gretel.ai/discord will take you into our synthetic data community Discord. There's a lot of really fun things happening there, definitely seeing a lot more people moving in there. A lot of really great questions. If you want to know more about time series, we've been getting a lot of questions about our time series model lately, which has been really interesting. I would be remiss if I did not plug Synthesize 2023. So coming up on February 8th, 2023, we are having our first or the first Synthesize developer conference. We're going to have a lot of really cool speakers there to talk about synthetic data in all different types of fields. We're going to have some really cool presentations, keynotes panels, maybe some swag giveaways. It's going to be really awesome. We hope that you'll register.

(45:25)
It's a free conference, so if you just go to gretel.ai/synthesize 2023 or Synthesize Gretel ai, one of them is a 3 0 1 redirect, which one is it? Or 3 0 2 redirect. You can go ahead and register for free and we'll definitely be having more information about that going forward. One last thing is if you enjoy this and you want some Gretel swag, you want a couple stickers or something, fill out this link. grt.ai, that's our bitly link slash workshop trainer with your name and email address. We will email you and get you a redemption code for getting some stickers sent out to you. And I think that's all we have for today. No questions have come through, but I really enjoyed this. Thank you so much for taking time out of your day, Alex. I know that you're really, really busy, so the fact that you could give an hour to talk with us about this is really awesome and I know the community's really going to enjoy watching this video. Awesome. Thanks Mason. Awesome. Well you have a great day. I'll see all y'all later.

Generating Synthetic Data with Only Three Lines of Code

Video description

More Videos

Generating differentially private synthetic text using Gretel GPT

Anonymize Financial Data with a Fine-Tuned SLM

Video - Generate synthetic data for training LLMs and SLMs

Read the blog post

Transcription