ODSC East 2022 Workshop: Open-source Tools for Generating Synthetic Data on Demand
Video description
Gretel Senior Applied Scientist Lipika Ramaswamy hosted a workshop at this year’s Open Data Science Conference on how to use our open-source tools to democratize access to sensitive datasets.
More Videos
Read the blog post
Transcription
Hi, everyone. Welcome to this session. It's going to be a workshop and we're going to use some open-source tools to generate synthetic data.
So a little bit about me and just a quick intro. I work at a startup called Gretel and we help developers and data scientists build with synthetic data. I specifically focus on the privacy side of the house. So figuring out how deals like synthetic data, which is one among many, can help to solve privacy problems and enable wider and better data sharing.
All right. So today we're going to cover many things, but the broad topics are the what, how and why of synthetic data. So we're all sort of starting on the same page. Then we'll follow with two examples. One is to generate synthetic Census data and I'll motivate the problem and make sure to give you all enough background around why I think it's important and why I've selected it. The second example is around time-series data, which tends to be very popular out there in the wild. So we'll look at generating synthetic time-series data and we'll use web traffic data as an example. And finally, just towards the end of the session, it's going to be 90 minutes, give and take some, we might need a beverage at the end of it. So let's see what we can do with that in the synthetic world.
All right. So I wanted to take some polls before we get started. Okay, so the focus of these polls is I just want to get a sense for sort of what industry you work in, what your level is, are you a student, are you directly a data scientist, whatnot? So trying to understand what you work in, and then also your level of comfort with machine learning, Python, just synthetic data in general and generative modeling. And so that'll give me a good sense for how to sort of communicate with everybody through the rest of this tutorial workshop. So I'm going to make some educated guesses, or maybe not so educated guesses here. I'll try to explain things in certain amount of detail, but I'll assume that folks have sort of a foundational knowledge of data science and machine learning. But we'll certainly [inaudible 00:02:27] through motivation for why I'm suggesting we do things the way that we do them.
All right. So the big question, what is synthetic data? Maybe you've heard of it before, maybe you haven't, maybe you think of synthetic data as just sort of fake data that sometimes does not make sense. So what is synthetic data? There is a definition from NVIDIA that I really like, which is, "Synthetic data is annotated information that computer simulations or algorithms generate as an alternative to real-world data." And so this is not something new, it's been around for many decades. In fact, the first time that somebody built a dataset that didn't exist before in an artificial way to solve some problem, that's synthetic data. What's really awesome about synthetic data now in the recent, let's say decade or two decades, is the advances in machine learning, which allow us to build really convincing synthetic data.
And so how do we make synthetic data? So a very high level is you take a sensitive dataset or you take any dataset that you have and you apply some machine learning models to them. So that could be a language model. It could be a generative adversarial network, which is a type of generative model. It could be a model that says, "Hey, okay, I have some data. I'm going to see what the distribution of each variable is within this dataset. And then based on those distributions, I'm going try to sample new records." And so implicit here is my assumption of your understanding of the generative models. And I'll take a step back and maybe define that a little bit. What do I mean by a generative model? It's a model that takes its input data and also produce it as output data. So there's some process in the middle that takes data as an input and produces data as an output. That's in general what a generative model does, but how it does that is by attempting to learn sort of the underlying distribution of the data.
So if you assume that there's some generating process, there's some data generating process out there, that's sort of ground truth. So for example, there's some, let's say we're talking about some law of physics, you're collecting data to support that law. There's that law that governs what the relationships are between variables. And so that exists. There is a process. It's known. And that's sort of the best piece where you know what the process is. In most cases, we don't know what the data generating process is. We don't know what the ground truth of this data is. We've only just absorbed instances it. So after absorbing instances, we're trying to understand where does this data come from? What is the data generating process? Can we write it in equations? Or can we have, for example, a neural network, learn how this is done?
Maybe that went over your head, maybe you understood it, maybe you know a lot more about this. So hopefully that helped, but it's not super essential. Just knowing that what we're looking for is data as an outcome or as an output from a model. That's what a generative model is and that's what we use to make synthetic data.
So you might be thinking, "Well, why is there a need for synthetic data? Why do we even care about this problem?" There are a few different reasons. It depends on sort of what stage you're at. Are you a student? Are you working in a large organization? Are you working at a startup? And so your motivation for seeking synthetic data might be different, but let's cover a few of them. So one very common one is to make sensitive data accessible. So for example, let's say you're a student and you are trying to build some sort of prediction model on healthcare data. So you might go to Kaggle, the open source, sorry, the data science platform Kaggle. And you might say, "Hey, I'm looking for datasets in the healthcare field. Let me go see what I can find." And so that's data that's openly available, but it may not actually fit the problem that you're trying to solve, right? Maybe you're trying to find a dataset on a population with a rare disease, and maybe that doesn't exist in Kaggle. Maybe you can't find it in data world and it doesn't exist.
So synthetic data would be really useful in this case where, let's say a hospital has a population of people with a certain rare disease and they're able to produce a synthetic dataset and give you access as a student to build some model. And so you don't actually need the real data. You don't need patient information. You're just trying to learn what the relationships are, underline that. In that case, something like synthetic data would be really useful. Another really common example of this is, for example, in many large organizations, right? There can be a lot of different verticals. There can be a lot of different data silos. So you have to have sort of permissions to view certain types of datasets, right? Maybe they're collected from customers. And there are only a certain set of people who are sort of authorized to work with that data.
And maybe you are a software developer and you want to build some sort of application that uses that data. Or you're trying to sort of test an algorithm that's supposed to work in production, but you can't actually get access to the data upon which it's supposed to work. So something like synthetic data could be extremely useful in that case as well where you don't really need access to the exact data. You just need something that looks and feels like it and does the job. So you know it's going to work in production. So that's one really, really common example, but how do we just get over all these access issues and get some data that's going to help us accomplish our task?
Another one is to augment limited datasets. All right, it can be the case where you're trying to collect data and it's very, very expensive to collect this data. Maybe it is... You have a certain budget and you need to collect EEG data because you have to do a certain type of analysis. And that can get expensive really quickly. Really medical measurements get expensive. Even if you're trying to build an application, maybe you're very industrious and you started a startup and you're trying to build some sort of software tool and you need data. You can only collect a certain amount. How do you get more, right? Because there's an expense to that. So that's another example where you really need to augment limited datasets and they may be limited for any number of reasons.
Final one that's very important is the ability to balance data to reduce bias. And in many cases, improve predictions from models. And this is the one that we'll be focusing on today. So we'll go a lot more in depth into this particular point and we'll see how we can actually get better predictions from models by augmenting our dataset with synthetic samples.
Okay. So that's sort of the groundwork that I've laid for this session. Let's talk about the nuts and bolts, right? So today we're going to use Google Colab. We'll take about the next hour to do a few examples to run through them. I'll be sort of live coding with you all, showing you all the tools and resources that I use to build synthetic data. So we'll be using Google Colab. You will need a Google account to execute code. And the links to the notebooks are just on the next slide. So you'll all have it. The second point is we'll be using, for part of this tutorial, we'll be using Gretel Cloud. So a little bit of background. So the company that I'm at, Gretel, we build tools for open-source synthetic data generation. And our console is a place where you can submit a dataset and it sort of gets processed on our Cloud and we provide you with the synthetic dataset.
So this is not exactly fully open-source because there's a whole console or element to it, but all the underlying elements are open-source and we'll use those sources. Okay. So please sign up for a free account on console.gretel.cloud. The third point is if you really don't want to code in Python, you haven't done that in years, you don't feel like doing that today, that's totally be fine. Please just follow along with what I'm doing, but there will be a no-code example for which we'll leverage the Gretel Console. So stick around for that. It's going to be long, right? It's roughly 60 minutes from now. So we'll take a few three-minute breaks. Feel free to ask me questions during that. Please use the questions panel on the go to webinar and ask me any questions. We can about that during the breaks, or really anytime if you're super stuck. There are other folks on here. Some of my colleagues are on here who can help with that as well.
So let's get some breaks in, stretch your legs, but also, I want to make sure that you all have momentum to sort of execute a task from start to finish. So feel free to jump in with questions. And if some of this doesn't make sense or you think things are going too fast or you miss a step, don't worry. In a few days, we'll publish all these notebooks and have this all available to you so you could use it. All right.
So here are some links. These are the ones that you'll need. We'll start with example one, which is we're going to use Census data to classify individuals based on income. So you all might be familiar with... There's a dataset that's called UCI Adult. And it took data from, I want to say the [inaudible 00:12:57]. So some Census data and a researcher took this data and basically processed it and was trying to create a prediction task to predict whether an individual has an income of greater than $50,000 or less than $50,000. So the classification task for which this dataset was produced. What we're going to be using is a little update to that dataset that was published last year. It uses more recent data, so data from 2018. And we'll use it via a tool called Folktables that some researchers published.
So if you take this URL, I will try to drop in the chat. Oh, there we go. All right. That is the first one. And you should have access to this page. So I would highly recommend because I've authored this, I would recommend in Google Colab if you go and save a copy in your own drive. You'll be able to execute the code and save your notebooks and there wouldn't be any sort of version control issues. All right.
So first thing we'll do is we'll install all the things that we need here. I'm going to do the same. I'm going to take my advice and save a copy in my drive.
And let me do something here. Maybe this time URL doesn't work. So here's a direct link to the notebook. All right. Hopefully everyone has access. Just run the first cell.
Hopefully you should be able to connect with a GPU by the way, but we'll run through that. All right. Let me know if you're not able to access this notebook. I'm getting a lot of email notifications, which is totally fine, but just click on this link. And it's the one in the chat from 11:39 AM or 12:39 AM Eastern. All right. So the first few cells should be populated. And the first thing you'll do is download and store data from Folktables. And so what this is doing is it's going to the American community survey data. And for the survey of 2018 for each individual and for sort of a one year horizon. I mean, these means some things. You can look up the documentation for folk tables. Basically, we'll go and get all this survey data for 2018 and we'll process it in such a way that it has the same format as the UCI Adult dataset. So there's a certain set of variables like age, work class education, marital status, occupation, whatnot. And finally, the income bracket, which is a variable that's 0 or 1. And that's what this process is linked to.
I'm also choosing this state of Kansas to get this data for. So you can feel free to choose any state that you'd like. The reason I chose Kansas is because there's sort of class and balance that I want to demonstrate.
All right. So this is, if I look in the files here, this is downloaded, data for 2018. There's one-year data in here is the big CSV. And then this new USAdultIncome.CSV is the one that I'm going to use. And that's the one that we created to sort of fit this classification task. All right. So let's look at this, right? Everything is coded. It's all [inaudible 00:17:20] because that's how the ACS survey is coded. As you'll see, there's age. And then a lot of these are categorical. We want to go into what the categories mean, but there's sort of a code book that you can look up pretty easily. And I think I've provided links to where that can be found in this cell. All right. So here's the dataset and our goal is to create a synthetic version of this dataset. Before we go any further, I just want to make sure that everybody has access. So if you don't, please feel free to add a note in the chat or add a question and folks will help you.
And so everybody should have viewing privileges and you can just make a copy for yourself and your own drive. Okay, great. It looks like things are working. Thank you everyone for chiming in. Appreciate that. Little bumpy getting started, but hopefully things should go smoothly now since you all have access. All right. So while that data is running, let's talk about what we're going to do first. So we're going to use something called Gretel Synthetics. The link to the documentation is here in one of these cells. So we're going to use Gretel Synthetics, which is an open-source library that has tools to regenerate synthetic data. So as you can see here, you sort of have to provide a configuration. There's a tokenization element, and then you specify a model, you train it, you use it to generate records and then there's some utilities to sort of make things run faster.
Okay. So here is Gretel Synthetics. This is the GitHub page. I just got here by clicking edit on GitHub and going to the main [inaudible 00:19:25]. And so you can see everything that's in here. So if you're interested in the internals of the algorithm, you can find that within the source file and you can see basically everything that we do in order to generate synthetic data. And so the idea is... Sort of at a high level of what this model is is a next open prediction model. So it's a language model that uses an LSDM cell, which is sort of like a recorded neural network. And what it does is learn sort of temporal correlations in the data. And so if you treat the entire dataset, so the whole CSV, treat it as a string of text. The task is simply, given what you saw last, what's the next thing that you're going to see?
So in this dataset right here, given that I saw one, can I predict that the next character is nine? So that's a very basic explanation of what's going on here. But the way to interact with Gretel Synthetics is through a configuration file.
So let's go look for the configuration here. I'm using something called Local Config. Just basically a TensorFlow configuration if you're familiar with TensorFlow. And we can go through and set up basically everything that's required for training this model. So I know my input dataset is going to be this one up here that I've defined that's called NewUSAdultIncomeKansas.CSV. And then I'm going to define all the things that need to go in. So let's start this. We'll see our config is Local Config, and then we'll specify a ton of things. So you'll see in the doc string here, there's a lot that you can supply that's also available in the docs. But we'll add in a few things and I am going to copy paste some stuff. So it's a little bit quicker and I can doc through it, as you all are able to type this as well.
So the first one is called maximum line length, and that basically measures the number of characters in each line of the CSV. And here I'm saying that the line length can't be greater than 2048 when actually I can make it much smaller because these lines are going to be a lot shorter, right? This becomes important when you have three text and stuff in your data, because it does relate to performance slightly. Okay. The next variable is vocab size and that is sort of defining what type of tokenizer we'll use. So in any language model, like if you're looking at text, you can't just take text and pass it to a machine learning model because machine learning models are used to seeing some numeric values, right? So the whole idea of tokenization is can you take your entire text and can you identify, for example, words or maybe even alphabets? Very simple, which is denoted by zero here is character-based organization.
And it basically says, "Well, we're going to map each number and each letter and all punctuation to a certain index." So let's say the letter A gets indexed to 0, B, 1, 2 and so on. And that's what we call tokenizing. And that's important for this type of model to work.
The next is field delimiter. So I'm specifying that this is a CSV. So it's comma separated, I'm adding that. You could add a [inaudible 00:23:03] separated. Overwrite, just a functionality. If we're going to train this model many times, right? Do you want to just overwrite each checkpoint because we are training a model using TensorFlow? So it outputs TensorFlow checkpoints. Then there's the learning rate, which is sort of the rate at which grading updates are made during the training process of a neuro network. Dropout rate, which is a regularization parameter. We use dropout layers in our model architecture. If these things mean something to you, that's great and feel free to play around with them. If they don't, these are just generally good defaults and you can think of it as all sort of ways to get models that don't overfit. Gen temp, this is something that's sort of related to the diversity of examples that are generated by the model after its trained. So if you set a higher temperature, you get sort of more wacky kind of examples. If you set a lower temperature, it's the opposite.
And epochs is the number of passes over the data. So effectively, how long do you want to train this for? Well, let's keep 50 for now. I think this should go pretty quickly. Of course then there's the input data path, which is, it should just be this file right here and the checkpoint directory. All right, I'm going to run that.
Great. So I have my configuration stored and training a model is super simple. I'm just going to use this function called train R&N. I'm going to pass this config that we defined up here to do that and run it. And you'll see, it'll start outputting a bunch of stuff. So it's saying that it's using a character tokenizer. You're using the right input dataset. It's doing a bunch of stuff. It's shuffling the input dataset. It's creating validation dataset. It's doing all the things that we wanted to do. If you're interested, here is the architecture of the model. So it's super simple, right?
You might be looking at this and thinking, "That's simple. What can it do?" But you'd be surprised. LSDMs, I mean, they get a very bad rep these days, especially with the advent of transformers, but this still works pretty well for a variety of examples. So we'll start here. My training might be much slower just because I've been using Colab a lot trying to prep these notebooks. So you might get much faster results. So to do that, you might want to change your run time type to use a GPU. I'm apparently already using a GPU, but I'm out my credits. So, yeah, that's fine. I'll just train on PPU and it'll be okay.
So you can see what this model is doing. It's basically telling you the accuracy of next token prediction. And even with, let's say one epoch, I'm getting, and it hasn't even completed, I'm at about 36% accuracy. So things are moving along and we'll let this rest go. But in the meanwhile, we can sort of do the same thing, but in a slightly different way. So if you scroll down just a little bit, you'll find this section called create a model to generate synthetic data using Gretel Console and Gretel Client. And what we'll do is we'll use Gretel Console to do sort of a point and click way of generating synthetic data.
All right, I'm just reading through the questions now. Okay. So the first question that Madou has is, can we say it as a pre-trained model for creating synthetic data? This model here that we're training is not pre-trained at all. So I'm initializing it with random weights. There was no pre-training done on this model and it's simply trained from scratch. So it's only going to launch what's in the dataset that I provided. Now, can you use a pre-trained model? Yes, you can. So for example, if you're familiar with large language models, you could use, for example, GPD 2 or you could use GPDE or any of the ones on Hugging Face, right? Those are pre-trained models and you could find tune them on the dataset that you have in mind. So that is certainly a way of producing synthetic data when you have fewer examples. Yep.
Okay. The next question from Madou is, is there a way to identify which model is the best fit for a test data creation? So here we're only looking at one model, right? This is just a simple LSCM that seems to work well for a lot of used cases. Are there other models? There certainly are. So there are things like tabular GANs, there are simple sort of marginals-based methods that you can use to produce synthetic data. There's patient networks. There's a lot you can do. So the model universe is very vast, right? There's obviously no single model can produce the best data all the time. And so that's really great question. How do you identify which model is going to work the best? Sometimes there are heuristics. So for text data, you probably want to use something that has a recurrent nature. For entirely tabular numeric data, maybe you're actually, and maybe it doesn't have that many columns, maybe you're better off going with the marginals-based method.
But there is no sort of formula for it I would say. There are also trade offs, right? So there are some architectures that are very, very good for producing a certain type of synthetic data, but you might want to think about what is the compute cost of it, right? Is it going to be a very complex model to run? Is it going to take a lot of resources? So those are sort of trade offs that I would encourage you to think about when looking to identify a model that's the best for the particular dataset that you have.
Okay. I will [inaudible 00:29:39] back here for Max. So the config file, you use Local Config. You can also use TensorFlow Config, but I would suggest you sticking with Local Config and that wasn't the documentation. You can find this here in this code block. So these are all the things that I have specified, but really it comes from the documentation. Okay. I'm also going to change my screen slightly so we can see the documentation at the same time. So hold on. All right. So I've got the documentation pulled up here. Yeah. Honestly, when using anything, I always look up documentation because you don't know what updates people have [inaudible 00:30:39] and how old a notebook is and stuff. So would've gotten to always use documentation even for the best-maintained open source things. All right. So this model that I started, it's still training. It's at about 70% accuracy or it's getting there.
Okay. So question from on Arne about reading resources or reading resources for how to select models. You could start with, honestly, I would just do a Google search for generative models. So look at generative models for the specific type of data that you're interested in, because of course there's a whole universe for image data, for audio data, video data. Any other modality that is not represented in a tabular format, even text data, there's so much out there. So if you figure out what type of data you want to create a synthetic version of, I would honestly do a Google search and read papers. So I would read things that have been published in any sort of computer science, machine learning conference.
Even medical journals have a lot because a lot of this is motivated by problems in the medical field. So I would recommend just doing a Google search. I can sort of recommend the top things that I know for tabular data generation. And those are the ones that I talk through. So conditional tabular, again. There is of course GPD and all its variants. So you can look at the models and Hugging Face. There's diffusion models for images, and there's just regular GANs and variational autoencoders for images. So there's a lot out there.
All right. I'm going to keep moving and we'll leave my model to train at very, very slow pace. And here we go. All right. So let's move on to using Gretel Console. And for those of you who are really not wanting to code in Python, this is your time to shine. So if you head on over to Gretel Console, I'm going to go full screen again, you can see that I'm logged in. I have been working on Gretel Console before, so I have some projects, but you likely will not. So your interface might look slightly different, but let's go to projects and hit new project. Okay. We'll call this, let's see.
Okay. So I gave it a name. And the important thing to see here is in this interface, you even create models. You can sort of use our cloud offering to do kind of basically what we've done in this notebook, but there's a lot that's taken care of. So there's sort of a lot of output that you won't see. There's a lot of validation and stuff that you won't see as well, but it all uses the underlying open source library. And we'll do two things here.
So let's go to the files in Google Colab, and let's download this NewAdultIncomeKansas.CSV file. We'll go back here and go to data sources. We'll upload a data source, which is going to be this file. Wait for that to get up there. So this is all getting uploaded to Gretel's Cloud. There's an option to run basically all of this in local mode as well. And we can talk about that in a bit, but the distinction is that the thing that we did before with the Local Config and train R&N, that's fully open sourced. You just clone the GitHub repo or you Pip install it and that's all you need.
In this case, we are sort of using a little bit more abstraction on top of that. Okay. So I've got this file in here. I'm going to go back to models and I'm going to create a new model. So here, you'll see there are a few things that you can do, and we'll be focusing on generating synthetic data. But sometimes we use these other do in [inaudible 00:35:35] with generating synthetic data.
The first one is classify and label, and that uses named entity recognition to identify within a dataset if you have sensitive fields. So for example, you might be looking for whether maybe there's a free text field and there's an email address or social security number or a [inaudible 00:35:55] date in there and you wanted to add that. So that's what this thing does. Transform data does more like plastic data anonymization. So for example, if you had a bunch of names and you wanted to replace them with fake names, you could do that in here. There's a lot of sort of sophisticated things that you can do with Transform as well. But we want to focus on those for today. We'll select generate synthetic data and we'll go to data source and select this file, which is NewUSAdultIncomeKansas. Continue.
Okay. So the next thing it's doing is figuring out model configuration. And that is basically that same Local Config, but there are a lot more things in here, right? So let's go through this. This is the configuration file, anD it's just sort of like a YAML or pretty printed JSON representation of what we had in the Jupyter notebook. And all right, so we can see here we've got this default configuration, right? We have some sort of recommended configurations that you all can use. And those are... You can look at it in the documentation that's linked here. There's a lot of information in this documentation as well, but let's go through this page, right? So there's some epochs, there's batch size, vocab size. These are sort of all the things related to model training, things that we looked at before, like learning rate, dropout and whatnot.
The additional thing that you'll see here, which is actually also something that you can do in open source with that Local Config, is to use differential privacy. So a little bit, I'll take a step back. And what differential privacy enables us to do with models is to add some mathematical guarantees around privacy. And the basic idea is if you take a dataset and you add or remove one row, or if you change one row, differential privacy guarantees that if you were to, let's say, I don't know, do some sort of aggregate on it, right? So the aggregate can be account. It can be a sum. It can be even training a machine learning model or training a model like this that we're looking at. The guarantee is that the outputs don't change by all that much if you change inputs by one row. And that sort of provides this level of privacy or a guarantee of privacy that you're learning general things about the data. You're not learning anything too specific about one record off the data.
And so that's something that you can turn on. We have a ton of blogs on differential privacy. There's also a lot of literature out there on it. So if that's of interest to be able to generate synthetic data with differential privacy guarantees, I would recommend that. Some of the sort of state-of-the-arts methods are marginals-based methods. So that's where sort of you create a 2D histogram or an ND histogram, and try to sample from it. And that whole process to make it differentially private, it's a lot, maybe not easier, but it's a lot better documented how to do that and how to do it well than it is with neural networks. But you can certainly try it with this neural network. So you can say, "Okay, set differential privacy to true and select some type of parameters for this." All of this is explained in our documentation if that's something that's of interest to you. We won't be using it for now though.
All right. So that's on sort of the parameters for the model itself. The next thing is validators. And that becomes really, really important. So you can train any type of generative model, but I don't think for most cases you can guarantee that all the data that comes out of it is going to be good quality, it's all consistently going to be good. And you can't always guarantee that because models can sometimes produce really terrible outputs. It just depends on where they sample from. So what these validators do is they provide sort of like an additional level on top of the generation step to see, "Hey, does this make sense as an output for the model or doesn't it?"
So in set count basically says, "Okay, for example, let's say in this dataset, I have gender. And maybe it's just male and female and that's it. Or male, female unknown. Or male, female, not defined." If I set in set count to 10 here, or if I set it, for example, to three, it'll say, "Okay, so for that column, I'm going to check what the total possible options are in the original dataset. And I'm only going to provide you with rows where those options are provided." So either it's male, female, unknown or 0, 1, 2, if there's a 3 generated in that, it'll throw the sample out and generate another one. So that's sort of what this in set count does is for categorical variables. If you set it to a large enough number, for any categorical variable with the number of categories less than or equal to this number, it'll validate that you've got the right categories in there.
So for example, gender can't have a category from race. That's an example. Pattern count is kind of similar. So if we were working with date style columns that are formatted as MM/DD/YYYY, it'll identify what that pattern is and it'll make sure that when you're generating a date column, that's the pattern that's generated and that's what's returned. So that's an example. There's a lot more about this in the documentation, but usually these defaults work pretty good if you have a lot of categories. So for example, if you have a state column, you might want to make this 55 or 60 just to capture everything in there and American states I mean. But, yeah, that's how you use these things.
And the final few things I'll talk about that are generally important are, one is generate. And that basically says, "Okay, once the model is trained, let's generate 5,000 new records and that's going to be our synthetic dataset." And the model has to do that to know if the model was trained well. So you can train a model to really high accuracy and next token prediction. But there's some guarantee that the dataset that comes out is actually going to be valid. The model could be getting really, really good on doing certain things and sort of accuracy is just one measure of model performance. So once the model is trained, there's some generation process that needs to occur. And what we allow for is something called max invalid. And that basically says like, "Oh, this is the maximum number of invalid rows that's allowed when generating a new sample." So if you're only producing invalid rows, it probably means the model doesn't work. So let's just start afresh.
Okay. The final thing is privacy filters. And that's something that we add on top of the generated model. So one is outlier filters. And that basically says if something looks like an outlier in the generated dataset, then we'll remove that from the generated dataset. The similarity filter says if something looks too similar to a record in the original dataset, then let's remove that. And there are some algorithms that we use to figure these things out. All right. So we'll begin training here and you'll see it sort of goes through the thing. It'll print out a bunch of logs. It's printed out our config file here. There's telling us it started a worker up in the Cloud and so on. Okay. Let's check in on this model. I think. Well, I'm hoping you all have models that have trained much further than mine has. Well, this is unfortunate, but I'm going to pause training here. It'll take a few seconds.
Okay. So I stopped training. It saved a checkpoint. This is not going to be very good quality data, or at least I don't think it is. But let's see what we're going to do next to generate the text. And I apologize for the sound not being clear when I talk, I'll try to [inaudible 00:45:15] and be louder. All right. So to generate the text, we're talking about validating, whether a record actually fits with the dataset that we're trying to produce. And we'll use a function called validate record here that I've written. And basically what it does is it says, "Okay, for every line that's generated, right?" So you're telling this model, "Okay, start with the new line tag, which is [inaudible 00:45:47] and generate an entire line for me until you see that tag again."
And in that line that's generated, I want to validate whether it's actually doing the right thing. So I want to validate that it's of length 11, which means that there are 11 columns in there. And for each of those, I want to do some additional things. So one of the things I want to check is that everything is an integer, right? Because in that original dataset, all the columns were integers. So I'm going to check that. The other thing I'm going to check, which we talked about in Gretel Cloud, is checking whether the values that are generated, so any of these values that are integers, whether they actually belong to the original set. So for example, again, that gender example. If there's only 0, 1 and 2 or there's only 0 and 1, maybe it shouldn't produce a 3 so we'll kick any records like that out. And if those don't happen, then we'll raise an exception and say that the record is not 11 parts. All right. So let's try to just generate one line. Hopefully you all have better luck with this than I do. It's producing... Oh.
Hold on one second. The greater thing about live coding I don't know if this is entirely right. Okay. Oh, hold on. So when I do this, I just need to make sure that for each column, I'm checking that they're integers and that I'm checking that they're in the right set. All right. They should work now. Okay. So I had quite a few invalid records that were originally produced. And finally I got one that's valid, right? It has 11 parts, everything's an integer and everything falls in the correct range. Okay. So I've got that.
The next thing I might want to do is generate a large number of records. So maybe I want to generate 10,000 lines. Let's see what happens with that. And hopefully you all are also at this stage. Please feel free to stop me if you're not and we can take a little break. Actually looking at the time, this is a good time to take a break. So why don't we let this run wherever your model is. Just let it keep going. We'll take just a little break here. Three minutes. We'll be back at the [inaudible 00:48:54]. Well, thanks for your question, Methyl. So Methyl asked, what are the requirements for the training dataset in terms of size? Is there a better alternative to LSDMs if training data size is small? Yeah, that's exactly right. So for an LSDM, I mean, you probably want at least a few thousand examples. There are some configurations of our architecture that work with small dataset sizes.
So you try those and you should be able to find those in our blueprints, which is linked in the chat. As far as other architectures, yeah, Transformers are definitely the way to go if you are training on a limited dataset. So, again, those are pre-trained models that you can use. There are pre-trained models you can use. You can also train from scratch with the Transformer, architecture though I don't really recommend that just given the number of pre-trained models that are available. Again, it's important with pre-trained models, in my opinion, to find one that sort of fits the [inaudible 00:49:58], right? So you need to know what type of dataset they were trained on, right?
You're trying to produce code. You probably want you the model that's trained to work to generate code like [inaudible 00:50:13] or maybe you're just looking for a quick sort of BOC style thing. And for that, you use GPD 2, because that's open source. So it really depends, but I would highly recommend sort of being very careful about which one you choose. Of course there's no substitute for experimentations. So you could try literally everything. The only cost is compute, but you could certainly try it.
All right. So for [inaudible 00:50:44], hopefully your model ran and you were able to generate some new text. I tried to get 10,000 records and it looks like I was able to. I'm going to review the data distributions now, because I didn't train this model for very long, I don't expect this to be fantastic. But let's see.
Okay. So here's the comparison of data distributions for all the different variables. This is not beautifully formatted, but that's fine. You can see there's some overlap. It's not doing particularly well on this first one, but that's okay. I suspect if you train it for about 50 epochs, you should get pretty good results. So this is one example. We got that here. All right. And let's go back to Gretel Console and you'll see hopefully everything has trained, it's generated records and we have some results. So in my case, using the configuration that I did, I have a dataset 5,000 records, and I also got a quality report on this. So there's something called a synthetic quality score, which is at a 95. And this is a composite of a few different things. So let's look at it. I'm going to download the synthetic report
And let's take a look. All right. So it's showing me a few things. It's showing me scores of these three individual statistics. One is field correlation, which is other correlations in the original data maintained in the synthetic data. This deep structure stability, which looks at sort of if you were to look at the same dataset in a reduced dimensionality space to doing something like PCA, what does it look like between the original and the synthetic? And the final is distribution. Field distribution stability, which is you compare histograms of all the variables. So it's like a univariate comparison.
And here you can see we're doing pretty well. So on the data correlations, there's a little difference in correlations for OCCP. I forget what that is. I think it's occupation. And the rest of the variables. But it seems okay. Looking at the PCA plots. I mean, it looks roughly good. And then looking at the distributions, you can see it's better maintained than the other one that we ran. Of course, that could just be because my model didn't train very long. But, yeah, so we have a comparison of distributions.
We have something that we... Gretel Synthetic report says, "Okay, you have a quality score of 95. The other thing that I'm not spending a lot of time on is this privacy protection level. And it's just sort of a scale, right? So we did that outlier filter that said remove things that look like outliers. We said remove things that are too similar to your records and the original dataset. And the thing that we don't have is differential privacy enabled. That type of training takes, I think a little bit more parameter tweaking. It's a little finicky. So you can certainly try. Right. There's a question from Gary. Is my config variable from GitHub... Could you specify which one? Okay. Oh, we're looking at the one in Colab? So the config that I specified was imported from Gretel Synthetics.config. And this is the fully open source library Gretel Synthetics.
So if you go to Gretel Synthetics, I'm just going to go to master. If you look at the examples, just look at synthetic records. That's a very basic example, and you'll see some of this stuff in here. So I'll post a link to that. That's for a fully open source.
Okay. So maybe the next thing we should do is I was going to walk you all through all through also the same thing that we did in this point and click sort of way right here. We can do that programmatically. And we can do that by installing something called Gretel Client, which allows for interface with sort of our client offering. And that goes through and you can build the same startup synthetic dataset, run through it. I'll share what needs to be done there, but maybe we can ignore that for right now. Just know that that's an option. You can do this all programmatically as well.
And let's move down to performance on a machine learning task. And so this is something that I talked about earlier, which was how can we use synthetic data to reduce bias or to balance variables? So one thing I'm doing here is importing everything from the start. So hopefully you all look here. I have a function here that just prints performance of a model. We're going to use XGBoost. And all I'm doing is reading in the original data and the synthetic data, just in case you didn't run any of the other stuff. Hopefully you have some synthetic records already and you can use those. But if you don't, you can wait for everything to complete and still follow along and just see what I'm doing. So here we go.
I've got the original data and the synthetic data. I'm just getting the variable names to look like what they look like in Kaggle. Really just for giggles. I think mostly. And the next thing we'll do is a train test split. So really not much there. I'm doing a train size of 80% or 0.8. And I'm going to stratify it by income bracket, which is now the target variable. So if you recall this dataset, I'm just going to run this and let's take a look at it.
Okay. So if you recall in the original dataset, there's this income bracket, 0 means that of person owns below $50,000. And 1 means the person owns over $50,000. And then of course, there's all the other variables, age, work class, education, so on after race. Okay. So I've done a trade test split. I've got a training dataset, a test dataset that I'm going to treat as totally out of sample. We'll see how performance of the model is on that. And let's also take a look at the class imbalance.
So in the original dataset, right? The reason I'm suggesting that we augmented is because I believe there is class imbalance. So you'll see at some 11,000 examples for folks who earned under $50,000 and 5,000 examples for folks who earn over. And so the idea here is, can we use a synthetic dataset to augment this dataset? And as a result, do better on the minority class, which is the class that we're trying to predict, right? So the reason this could be important is this is very contrived example and sort of used a lot as a dataset or as a classification task that you do as an intro data science problem. But the reason it's important is because maybe you're trying to predict whether somebody has an income higher than a certain amount or a projected income higher than a certain amount. And you're trying to suggest some sort of policy intervention, or you're trying to suggest some sort of [inaudible 00:59:41] intervention or things like that. And that can become really important. So you want to make sure that you minimize or maximize the precision and recall. Let's use those as metrics for today.
So we're going to create and actually boost classifier. Sean, just to make sure that you can import it, hopefully it installed correctly. And let's define a classifier. So really simple. I think somebody is off mute. Take a look at your audio, please.
All right. So let's define a classifier. All right. That's fine, we'll leave all the defaults. No need to sort of over-engineer this and let's decide to fit it. So I think for XGBoost, things need to be in as [inaudible 01:00:44] arrays. So we'll do that. I basically taken everything. So this is X and this is Y. Everything up until the last column is the X. The last column is the Y. And that's it. Cool? So that's train and I'm going to look at the performance of this model. Okay. So here's what I'm looking at. I'm going to look at the test set performance [inaudible 01:01:17]. And I'm looking at an accuracy of 80%, but as you can see, I'm doing really badly on the minority class, right? I have an F1 score of 0.64. Really, really terrible recall. It's 60% recall. That is not very good. And you can see the support is already... It's not that large and it's certainly not balanced.
So let's see what we can do with the augmented dataset. Okay. So I'm going to make an augmented dataset, just going to concatenate these two datasets. The original training dataset, and then I'm going to look at the synthetic dataset. And I'm only going to select the samples in the minority class just as an example. You could use the whole thing, but the goal here is more to augment the dataset. Okay. So let's augment the dataset.
Let's look at income bracket. Okay. So we're getting closer to having balanced data. You could put the optimizers to have an equal number of examples per class. That's totally fine. And I'm going to do the same thing. So I realize this is a little fast, but I'm just going to copy paste this and drop in the correct dataset. So I'll call this classifier 2. I'm going to use the augmented dataset. Okay. Going to train this classifier. Cool. It's trained and let's see the performance. So I'm going to print performance. I'm going to provide the classifier, which is the model. The dataset and the test set.
Okay. So here I seem to be doing a little bit better on recall. I'm looking at the test set performance. I'll give it a minute so everyone can get here. Hopefully you'll have a dataset you're working with, and hopefully your results look better than mine do. All right. So what I did was I augmented the dataset, the original dataset, with minority class examples. And that was from the synthetic dataset that I produced. And what I was able to do was increase the recall. So I believe it was 59% and now I've reached a recall of 75% and that's quite an improvement, right? So I'm actually measuring that the model is correctly identifying through positives here at a much higher rate, right? That's recall. And so depending on sort of the used case here or what the downstream is of such a model is, you might want to optimize the different things and perhaps further balance the dataset or generate better quality synthetic data to be able to do this.
For example, another check would be, if you just download the 5,000 records from here that were produced in Gretel Console and drop that in here, seeing how performance changes, you might get results that are a lot better.
So that's sort of this example from start to finish. There are some steps that we didn't cover, but I'll publish the notebook with all the code that you can run through to sort of accomplish this in a programmatic fashion using Gretel Console as well. So just to recap, what we got through here was using a fully open source library, which is Gretel Synthetics right here. We generated synthetic Census data records to be able to augment a dataset and produce a machine learning model that's able to have higher recall in predicting whether somebody has an income of $50,000 or more. So that's the example that we went through. There was some back and forth. Hopefully, it was clear enough that you were able to follow along. But if not, please just look out for this notebook. We'll publish it. We'll add a lot of pros to it. So you should be able to run it from start to finish.
Okay. So that's one example. I'm going to break for about two to three minutes. We'll be back at the 28 minute mark. And we'll take about five minutes to run through really quickly a very, very simple notebook. And while we're taking a break, I'm going to update it. So, yeah. Feel free to stretch your legs. Ask questions. Yamin from my team is here as well to answer things. Thanks, Yamin. Appreciate your support. So please ask questions, take a break, whatever you need. We're almost there. And this is a really, really fun part that's coming up. I'm so excited about it. So please stick around. We take a few minutes to talk about synthetic time-series data. This kind of goes back to the question about what is a good model for... How do you choose a good model for a specific dataset type?
And for time theory data, things become very, very complicated because so much data out there, right? Is collected sort of on a time step basis. So if you have an application, you're collecting information about users, you'll have that information for a lot of different time steps. So this type of data is ubiquitous. And models that can handle time-series data are really important to get right. And it's important to have good ones for that because a simple LSCM often does not do it. So that end, we have this notebook that I just dropped in the chat. And hopefully you all can access it. I'm going to open it up here.
All right. I've added basically all the code in, so we're not going to go through and live code any of this. But I'm just going to walk you through what's going on here. So if you install Gretel Synthetics straight from GitHub, so not using Pip, just download the repo and then install it and install certain few other libraries. Again the dataset... Oops. It's fine. All right. This dataset is web traffic dataset from Wikipedia. And so what it has is for many different pages on Wikipedia, across different domains, different access [inaudible 01:08:31] agents. It's got the number of page views and that's across 550 times steps or 550 days. So once we're fully installed, let me show you the dataset itself. It's a time-series dataset, and there's a lot of data. There's a lot in here.
Okay. Hopefully they should install install soon. All right, so basically what we're doing here is we're using a new type of model. So I'll go in here and show you all where it is in the GitHub repo. It's called time-series DGAN. And this is something called DoppelGANger. It was published in 2020. There's a TensorFlow 1 implementation of it. But for anyone who uses TensorFlow a lot, I'm guessing you don't use TensorFlow 1 anymore. It's kind of difficult to wrangle with and doesn't always catch on easily that you have a GPU, which is pretty essential for models to train models with a lot of data, right? So what we did at Gretel was we re-wrote the whole thing in PyTorch. And so this is the repo that contains everything. So this is the entire model architecture in PyTorch and the notebook, oops, the notebook here basically runs through and it trains a model or this particular model type to produce synthetic time-series data.
Right. So we'll just look at what the dataset looks like. And we can talk about why it's so difficult to do this with an LSDM, but we can also just run through this. You could honestly run this notebook. It'll probably take a while, but you'll get some synthetic data that looks kind of like this. I guess there are still some outputs here, but you'll see for page number, let's say 24992. Here's how page views go across time. So this is from 2015 to 2016 and there's some variation in the deck... In the number of page views. Sorry. So anyhow, this is taking a while, but basically if you go through this, super exciting. This is a model that a lot of folks have been requesting and updated implementation off and we have it and we're going to publish some blog posts on it and how to use it.
But this is a very, very easy example to get started with. So if you were to just run this notebook from start to finish, you should be able to see this. So I'm doing a few things here. I'm just encoding the data. We have some categories here, so you'll see the domain, access an agent. I'm just encoding that ordinally, so it's not one [inaudible 01:11:27] encoded, but just as numbers. I've created a training input dataset, and I'm going to specify a few things, namely that there are some things that vary with time, which are these page views. There are some things that don't vary with time, which is the domain of the page, the access and the agent. So I'm going to specify that these three things are discreet. So time-series observations are continuous and here we go. I'm just formatting these as [inaudible 01:12:00] raised to epochs into the model. There's honestly nothing [inaudible 01:12:03] going on here.
And I specify another configuration file. I specify some normalization for the data as well. And then I go ahead and train this model. And this is likely going to take a while. I've suggested 400 epochs and yesterday that took me about an R and something. So you can go through and train this. But what's really cool, and sure, maybe I will. I'll just pull up what I have out here. I train this model for about 200 epochs and I generated some synthetic data from it. And you can see it's not perfect, but for the time in variant attributes, it's produced this thought of distribution.
I don't think I have any examples printed out of the synthetic sample itself, but let's just try this. Hold on. Nevermind. It won't let me run things, but basically let's not complicate life. If you run through this notebook and you run through all the way to the end, you should have a way to evaluate how the dataset was produced, what types of time-series you're getting, what types of distributions of the time in variant things you are getting and how does that compare? And there's a lot of tweaking that needs to be done because this is a generative adversarial network and they tend to suffer [inaudible 01:13:45]. And they're just very [inaudible 01:13:47]. So I would recommend having some patience with this, but we will be publishing a lot of guidance and how to do this well. And it'll soon be available as well to do it in a point of quick fashion like we did over here.
Okay. So there's been a lot here. I appreciate 69 if you hanging out with me for so long, thank you for joining. One of the fun things that we do here at Gretel is use our models for kind of silly things. And one thing that we did a while ago, that's unfortunately not still operational, is we took some cocktail recipes or at least a list of ingredients and then we trained our model. So it's that LSDM. We trained it to generate cocktail recipes. We made a Twitter bot for that. So that's one fun thing that we did. You could try it out. You could try making synthetic Pokémon, you could try making synthetic Harry Potter spells. You could try literally any type of text data, try it out. There's plenty of applications for this. A lot of really cool used cases that we've seen in the last few years.
We are really excited to share this with the open source community, because we are at the core of it, open source. Everything's out there for folks to take a look at, for folks to use and give us feedback on. So hopefully this was helpful. I believe this workshop is recorded and, yeah, that's all I got. We have about four minutes. So I'm going to go through the questions and see if there's anything that I haven't answered, but please feel free to send more questions. Ask anything and you can always shoot me an email or send anyone at Gretel an email. Use this email address. All right.
A question from Mikael: Would it be correct to say that the generated data in most cases would have more false results, which have to be filtered out? Honestly, that depends. It depends on the structure of the data that you're trying to learn. So, in this case, whoops, here we go. So I'll take this example here, right? The thing that I trained. I trained it for 11 epochs, just slightly over 11 or not even 11 epochs, sorry. Just slightly under 11 epochs. I had this validation accuracy of 72% and I went forth and generated synthetic data, right? By any means, I wouldn't treat this as the best possible model, but I did go through and I was able to still generate some valid results, right? The reason that this has more false results is because I just didn't train the model very well.
I was limited by compute and time. And that was a choice that I made and sure we are. But you could, in theory, train this model for 50 epochs and you'll find [inaudible 01:16:58] and you can calculate this pretty easily. How many invalid records were generated while you were trying to generate your 10,000 sample or 10,000 size sample? And so you can see all that and you should see that if you train a model to a higher accuracy for next token prediction, that you'll get more valid samples just right off the bat. You wouldn't have to do so much validation and checking and discarding of records. Sorry.
Ah, okay. I don't know if you answered this, Yamini, but Arne's question on, whether there is a documentation link for more background information on the metrics used in the quality report? Yes, there is. So if you look at... So this is our documentation page docs.gretel.ai. We also publish a lot on a blog. So if you go to the blog, there should be search for quality. Okay. Maybe I'm just not finding it right now. It's somewhere. I will find it [inaudible 01:18:31]. But we have a lot of information on the quality score that we use. It should be somewhere here, but I'll find it and be sure to add some information on that for you. But, yeah, if you-
Sorry, I did drop in a link in the chat to the docs, which breaks down all the metrics and so on. So that is there...Also, the thing I was thinking of was privacy filters and you can also have more information on that in the blog.
Okay. With that, I think we're at time. Thank you everybody for sticking around for 90 minutes. This is a very long time, but I hope that was helpful. Please reach out to me with any questions. We'll be publishing basically all these materials completed from start to finish. So you can run through the examples and hopefully you have a dataset or a problem that you want to apply this to and try it out. If you have any issues, reach out to us. Feel free to use the fully open source offering, feel free to use our Console or use our Client in Python as well. So many ways of accessing this. I really appreciate your time. And with that, I think we'll call this workshop.