Presentation

Developer Workshop: Synthetic Text Generation with Gretel GPT

July 13, 2022

Learn how to use Synthetic Data to augment AI/ML datasets to train chatbots and sentiment classifiers by generating your own text examples from scratch.

Read the blog post

Read the post

Transcription

Mason:

And hello, everyone. Welcome to another Gretel workshop. My name is Mason Egger, and I'm the lead developer advocate here. And joining me today is Andrew Carr. How's it going, Andrew?

Andrew:

It's going awesome. Yeah, I'm a senior scientist here at Gretel and we're pretty excited to talk to you about what we have cooked up with text data. Yeah, it's going to be good.

Mason:

Yeah. I'm super excited for it. So today, for those of you just tuning in, we are going to be covering synthetic text data generation using one of our most recently released models, Gretel GPT. So if you're here and you're in chat, go ahead and post in chat. Let us know where you're coming from. We always like saying hi to everyone. But if we ... Yay, LinkedIn is being cranky.

Mason:

Oh, well, we'll deal with that later. Anyway, watch it on YouTube and, yeah, I don't know why I said that, but we'll go from there. So let us know if you're here in the chat. If not, we'll go ahead and get started on our presentation day, talking about text data. So text data and synthetic text data.

Mason:

So Andrew, why don't you go ahead and tell us why ... Well, no, obviously, text data is data that is text, but why is it important? Why do developers and data scientists need it?

Andrew:

Yeah, that's a great question. So something like 90% of all business data is tabular in some sense. So you'll see that in Excel document or you'll see that in some CSVs, but a lot of the data in those tabular datasets is text. So you have messages from tech support or you have Jira tickets or customer descriptions. There's all kinds of stuff, product descriptions. And so text is everywhere and it's how we interact with the world.

Andrew:

And so to be able to properly synthesize text data is just paramount to getting business value out of your already existing tabular data and maybe your other kinds of data.

Mason:

Fantastic. Okay. So yeah, tabular data, text data, that makes a lot of sense. But what ... So I know a lot of people talk about how difficult it is sometimes to work with or just to get valuable text data. But it seems like the entire world around us is surrounded by text data. I feel like it's the number one type of data that humans consume.

Andrew:

Yeah.

Mason:

So what makes text data so hard?

Andrew:

Yeah. So you could imagine just arbitrary text, right? Just random words that you put onto a page. And that would probably be pretty easy to work with because it's random and they're just words. But language has a lot of structure and it has a lot of irregularity. So there's grammar and there's parts of speech and there's just all of this stuff that comes from the linguistics side of things. Just hard. I mean, I don't know, me as a human, I've been speaking most of my life and I still mess up all the time.

Andrew:

And so to expect a machine to figure that out, it's tough to do. But then not only is there that structure and there's that grammar, we have a lot of interpretive ability with our language. I mean, we could say whatever we want and most of the time the meaning still comes across, which makes things challenging for machines that are expecting the same thing every single time. So a great example, if I were to write a blog about how cool I think machine learning is, I could write this thing and it'd be great, but then you ask a hundred other scientists to write blogs about how cool they think machine learning is, every blog will be very different, but they will all convey the same idea. And so how do you tackle that variation all mapping to the same idea of machine learning being cool?

Mason:

Awesome. Yeah. It's like that weird ... You said about we've been speaking our whole lives and the text data can be difficult. It's like that weird thing that I get whenever I type one word too many times and suddenly I question that that word is actually spelled like that. So I think I got stuck up on the word school the other day.

Andrew:

Yeah.

Mason:

I had just sat there staring at it. I was like, "There's no way this is how you actually spell the word school." It doesn't make any sense for that to be [inaudible 00:04:26].

Andrew:

This happens to me all the time with the word oven. For whatever forever reason, oven just trips me up sometimes. I'm like, "There's no way OV, what is even happening?"

Mason:

Yeah. That is so weird. And if it trips us up, I can only imagine how computers must feel if they could feel.

Andrew:

Yeah.

Mason:

Or do they feel? That's a question for the ages. Cool. So what are some of the pain points that we're trying to solve with synthetic text data? I guess the obvious answer to me would be, we're just trying to help you create more text data that's more accessible to you, but is there something deeper to that?

Andrew:

Yeah. Good question. So text data is everywhere, but it's hard to generate at times. So you may need to have an army of highly trained writers typing continuously to get a bunch of high-quality text. And that's expensive and prohibitive and, I don't know, it's not very creative for the writers. There's lots of obstacles to that.

Andrew:

And so if you could just generate a bunch of text data, even if it's not the same quality as human-generated, it's close, that would just be awesome. Huge value add. So you can automate customer support, you can explore different aspects of sentiment. A really great example is you have reviews on your website for products and you want to make a review a little bit more positive, a little bit more negative, just to see how your machine learning models would respond. You can generate synthetic positive reviews. You can generate synthetic negative reviews that still have the same grounded idea as the original text. And that's just a superpower in some sense for your business.

Mason:

Yeah. It almost sounds like it's the thousand monkeys with a thousand typewriters kind of thing.

Andrew:

Yeah. So for those who are unaware, there's this infinite monkey theorem that essentially says if you have infinite monkeys typing on infinite keyboards, one of them will randomly produce Shakespeare just because infinite is really big. Luckily, in this case, we have finite computers, just a few GPUs and they're pretty good at reproducing Shakespeare-type stuff.

Mason:

Awesome. Okay. So I guess the last question about just text data in general before we move on is what industries use text data? And I think I know the answer to this, but what industries would you say would be some of the bigger consumers of text data?

Andrew:

Yeah. So the obvious answer that I would probably skip over is everyone uses text, right? It's ubiquitous, but I think there are a few where it's really valuable, and we've touched on some of them. So e-commerce with customer reviews and product descriptions. Then we have finance. So there's lots of banking statements or documents describing some company's performance, legal documents, and then technical field. So you have customer support, you have bug tickets, there's all sorts of things in that tech realm as well.

Mason:

Awesome. Yeah, lots of stuff. Imagine all these chatbots that I'm forced to interact with on a daily basis seem to probably use text data as well.

Andrew:

Yeah, that's right.

Mason:

Great. Okay. So we've talked about text data. So now, let's talk about the new model that we have here at Gretel called Gretel GPT. And GPTs, I think most people maybe have somewhat have heard about it. The acronym GPT is relatively popular in the programmer zeitgeist right now because I think it wound up on hacker [inaudible 00:07:51].

Andrew:

Well, I think I may have lost you [inaudible 00:08:04].

Mason:

Okay. Can you hear me now?

Andrew:

I can hear you now. Yup, we're good.

Mason:

Okay. So ... Yeah, I noticed that. Okay. So what is GPT and what has Gretel done with GPT?

Andrew:

Awesome, great question. So GPT stands for Generative Pre-Training and it's a type of models that was built on the transformer architecture, and it's pretty new. So transformers are recent as of 2017 and GPT just a few years later. And it's a pretty simple idea that's really hard to get right. So we take all of the internet. You could at least take all of the internet and push it through GPT. What does that mean? You say, "GPT, here's the sentence." I go to the, and you say, "What is the next letter?" And so ... Or the next word, maybe.

Andrew:

And so GPT will guess and it will say, "Cat. I go to the cat." That doesn't make any sense. We say, "No, no, I go to the store," or "I go to the movies," right? That's more likely. And so the data exists to show GPT. And so then it will try it again. And we'll say, "I love my ball," or something, right? And whatever data exists on the internet. And eventually, GPT will get really, really good at predicting the next word in a sentence. That's all we're doing. So we're pre-training GPT by having it generate stuff.

Andrew:

Well, now, that's awesome because we don't have to label any data. We can just gather text data or we can get customer data. So say you have chat records from customer support or you have reviews on your website, you can just put that data into the training process and GPT will learn how to predict the next word. And eventually, it can just generate new stuff. So you could ask it, generate a review with five stars, and it will do a five-star review based on the product category or whatever you want. And so that's the overall idea.

Andrew:

And what we have done is we have taken a GPT model trained on a large portion of the internet, cleaned internet data. And we have wrapped it up in Gretel's console cloud, SDK, and given you access. So now, you can either use it as is or you can train it on your own data and get pretty good performance for whatever your text use case is.

Mason:

Okay, cool. So the way that you described it to me sounds like whenever I play with my phone and auto-correct.

Andrew:

Yeah.

Mason:

Yeah, auto-correct. So just ... I know this is completely off topic, but do phones or do any of those predictive text models on any technology device we're using, are those utilizing a form of GPT or is it another model?

Andrew:

So you could easily replicate that functionality with GPT. I don't know if your Android phone or your Apple keyboard is using GPT under the hood. They easily could and they probably are. But, yeah, I mean, you can think of it as just the world's most fancy auto complete in some ways.

Mason:

Okay. Yeah. That sounds really good.

Andrew:

Sometimes in Gmail or in Google Docs, you get suggestions or whatever, that is the exact same thing that's happening with GPT. We just let it keep running. And so it'll write for you in some ways.

Mason:

Oh, that makes a lot of sense now. That makes a lot of sense with some of the data because sometimes it gets a little bit wonky towards the end, it's because we just left it running and it didn't know where to go from there. Okay. It's always fun whenever you answer my questions.

Andrew:

There we go. So there's a few drawbacks. The ... Because it's trained on the internet, there's some concerns that maybe you will ... The internet's not a great place sometimes. Some corners of the internet, you want to stay away from. And so what if you trained on the wrong data? Or what if you leak private data that existed on the internet? And so that's something that we and the community at large take pretty seriously and are working to avoid, but it comes with the territory.

Andrew:

There's a funny xkcd comic about the Google document auto-complete, where they put in as the prompt, meet at the warehouse at, and then let it complete. And the model says 3:00 p.m. behind the docks or whatever. So you're finding out about the secret plan just from auto-complete. And ... Yeah, risk, but hopefully mitigated and we're starting to work towards that.

Mason:

Oh, okay. So it's ... Because I know that Randall Munroe is usually pretty awesome about letting people use his stuff for educational. So it's this one.

Andrew:

This is one, yep.

Mason:

The ... Okay. Long live the revolution, our next week will be at. Okay. And then it auto-completes at aha, found them. That is hilarious.

Andrew:

So this is exactly what happened. And Randall Munroe is describing here, essentially, GPT architecture type prediction, where in the light gray, you see what's completed and in the dark black, you see the original data. In reality, this happens much less often. And if you train on customer data, it happens very rarely. And we have privacy filters in protections in place in Gretel, which is, I think, one of the benefits. You can ... There's GPT models out there that you can use, but we have privacy filters in place that at least give me peace of mind when I use models like this.

Mason:

Awesome. Yes. I ... There is an xkcd for almost everything and it's one of my favorite things to use to teach. I think that ... I've also bought his books if you've never bought his books on, what if you tried to do the high jump possible? And it's like, well, if you have the proper updraft and large enough synthetic wing span, you could totally catch the updraft and jump 60 feet in the air. And that is not what anybody was talking about, but I'm glad that we did the math here to prove that that was possible.

Andrew:

Amazing.

Mason:

So it's absolutely great. Cool. So how is ... So GPT is predictive text model. So how is it ... What do you say are probably its most common use cases? Where do you think people are getting the most use out of GPT right now?

Andrew:

Yeah. I mean, the possibilities are pretty amazing with GPT. So not with Gretel GPT specifically, but with GPT models across the industry, people use them often for assisted copywriting. So you were trying to write a blog and you maybe get stuck. You don't know what to write next. People use GPT as an idea generation tool. You can also use it to correct your own grammar. So it's in the internet, hopefully, it's learned what good and bad grammar is, and you maybe write something wrong and you can use it to correct.

Andrew:

People use GPT for summarization. So you give a large piece of text and a summary of it. And it learns almost to translate from the large text to the summary. If you've seen the code generation work, people use it to write code. I mean, you can do all kinds of stuff with it.

Andrew:

As far as what people use it for at Gretel, we have people in the financial services using it for customer service type calls. Today, we're going to be doing some fun chatbot stuff and we'll be doing some lyric generation stuff. So writing music, writing songs, and you can do all kinds of stuff with it.

Mason:

Fantastic. So let's go ahead and get into that. I think we've been jibber-jabbering for long enough. Now, let's actually pull out the code and have some fun.

Andrew:

All right.

Mason:

So we have two different notebooks that we're going to cover with you today. We're going to show you. And this is the ... I think one of the great benefits of this is that these are going to be completely different use cases. We're going to do some lyric generation, and then we're also going to do some chatbot text, that word, synthesization.

Andrew:

Got it.

Mason:

There should not be that many syllables. I disagree. Someone get me in touch with the person who created English. But I think the ... One of the things that you're really going to notice here in this code sample is that the code is probably 99% identical. Whenever you want to swap out what text you want to generate, the core setup for training the GPT model is almost the same.

Mason:

Now, there will be some fine-tuning and stuff that Andrew's going to help us with that will probably vary on a case-by-case basis. But the core setup for all of this is going to be almost identical. So you're going to see a lot of the same code, but it's going to do drastically different things, which is pretty awesome.

Mason:

So we're going to start off with our lyric generator. And we basically took a data set that we found that has a bunch of pop song lyrics from a handful of artists. I think it's like 6,000 rows in this data set. And we're going to throw it into GPT and we're going to see what happens whenever we ask it to generate song lyrics for us. Now, I've been playing with this for the last day. And let me tell you, Lady Gaga really messes up the GPT model because of all her vocalized sounds. So it's funny. It makes me laugh so much. I never thought that Lady Gaga would be hurting my model training. And that's not a sentence I ever thought I would say in my life, but here we are. So we're going to move forward with that.

Mason:

So the first thing we're going to do is we're going to go ahead and in our notebook, we're going to install the Gretel client. While we're waiting on that to go forward, I'm going to show you how to use the Gretel cloud to get an API key. And you'll see that these are really straightforward. So I'm just going to go ahead and log in with Google. And I'm actually going to delete my old projects because you can see I've been playing with this a lot. So let's click on this. We're going to delete these. So that way, we start from the ground up. I don't like having ... And I think chatbot text generator was the other one that I had.

Mason:

So these are my models and I'm just going to delete them, clear them out. But whenever you sign up for Gretel, you'll see something here. You won't have any base projects, but you'll have dashboard and you won't see projects here, but it'll be about the same. And then you'll have this API key. So you just come over here and copy your API key. And this is how you would access and use all of ... Well, basically, all of Gretel's model stuff, how you use the API key, whether or not you're using it in the CLI or the SDK or the rest of the API. That's obviously a little bit different, but it's the same key regardless.

Mason:

So what we're going to do is we have a data set stored in S3, which is clean lyrics, which Andrew went through. And as you can tell, pop songs sometimes, like he said about the internet earlier, not the nicest place. Pop lyrics, also not always the nicest things to say about people. So we've cleaned up the lyrics for it. And then we're going to do a couple of imports. We're going to import JSON, Pandas. Everybody loves Pandas. And then from our library, our configure session, which basically allows us to authenticate. Poll, which allows us to keep track of how our model training is going. Create or get unique project. I love long method names, basically. Create the project if it's there or create the project or get it if it's already there and then get project. So we're just going to go ahead and run that relatively straightforward.

Mason:

And then now, we come down here to our login to Gretel. As you can see, my previous results are here. We basically configure the session. We could supply the API key as an environment variable, but we actually can say prompt, so it will prompt you for it, which I really like. It's actually one of my more favorite features about this because if I'm using it in a notebook ... I've never tried to use environment variables in notebooks. And I don't know how. So the prompt is nice. Are you going to cash the credentials? What endpoint? Validate, clear.

Mason:

And then we're going to create a new project called lyric generator. So we do that. And then we paste in our key and it says we've logged in. So relatively straightforward. Now, we have a log-in authenticated notebook within Gretel, and it's indicated session, I should say.

Mason:

Okay. So this is where I'm going to turn it back over to Andrew. And Andrew's going to tell us a little bit about the model configuration here.

Andrew:

Yeah. So here at Gretel, we use this idea of configs. And so you can have a lot of flexibility with how you set up your models. Additionally, we choose same defaults. And so if you don't want to mess with it, you don't have to. And I think our defaults are actually pretty good. And we have defaults for different use cases.

Andrew:

In this case, we're going to be training the GPT model. So you can see here, we just have a dictionary. We can also do this via YAML and pull it in, whatever you like. We set up the type of model, it's called GPTX. And the pre-trained model we're using is GPT-Neo 125 million parameters. And so this is, excuse me, pre-trained by EleutherAI on the pile, which is a huge data set. And you can see the model card about this if you're interested in some of the technical details, drawbacks, what have you. And then you get to choose now what machine learning parameters, hyperparameters you want to set up.

Andrew:

So in this case, we're going to do batch size of one and epochs of one. So the GPU that we have on our cloud supports batch size one, batch size two, it depends on how big your data is. You can run this locally. If you have a super beefy GPU, and we're in the process of getting beefy or GPUs as well, and then there's all kinds of stuff you can set up, these are pretty good defaults, [inaudible 00:21:58] warmups, learning rate, scheduler, and the learning rate.

Andrew:

Mason, I'm curious, do you want me to talk through some of these because we can get in the weeds if we want to?

Mason:

So I actually ... Top level on them, I actually have already talked about doing another one of these workshops where we just talk about all of these because I feel like I understand very little about it and I'm loving learning about it. So brush over them and then we'll come back in a couple of months and we'll do a deeper dive into all of these.

Andrew:

Okay, perfect. We'll start from the bottom and up. So learning rate is a parameter that determines how much of the current data you show GPT it absorbs. And so we train these things via gradient descent and we use Adam in this case, and it just says essentially for each step, how much do you pull in. The learning rate scheduler, you can change the learning rate over time. So in this case, you can think of a cosine wave just wiggling. The learning rate oscillates up and down, and that's to keep the model from overfitting too much, keep it from memorizing the data.

Andrew:

We do a hundred warmup steps because this model's pre-trained. If we start right away doing gradient steps, then things just get confused. And then weight decay, again, because this model's pre-trained and we're not training on a lot of data, we don't want it to overfit. And so we need to clamp the weights just a little bit. So these are all things that are a function of the model being pre-trained already.

Mason:

Awesome. So in this part of the code, though, what we do is we basically set up a dictionary with all these in here. As Andrew said, you can use JSON or you can use YAML. I feel like that's the new Vim or Emacs and JSON is the correct answer.

Mason:

As someone who loves Python, for some reason, I loathe space deliminated config files. I always mess them up. I can't write YAML. It's awful. It's a bad day. So we'll basically create this JSON file here. And then we have this function down here, which allows us to load and preview the training set. So one of the things you'll see is, we basically create the data set and what we do is we can create a combined column of artist, title, and then the clean lyrics. And then we truncate it by 256.

Mason:

So I'm going to let Andrew explain this because I think he'll explain it better. So Andrew, why do we need to do this concatenation and then truncation?

Andrew:

Yeah, that's right. So the GPT model operates purely on text. And so we need to treat our data as if it were a sentence. And so the separator is actually just a comma. And so we've said that at the top. And so to train GPT on a CSV, that's a lot of acronyms, to train this text model on tabular data, we need to first make it look like a sentence. And then we truncate just to train faster and space on our GPU. So 256 characters is pretty good truncation. You'll get to see the first verse and then the chorus and maybe the bridge. And that's about all we need to verify that our model is working.

Mason:

Yes. And then we have this dropna, which I had to learn about Googling because, apparently, there was some random non-data in here and the model does not like non-data.

Andrew:

Yeah.

Mason:

So I had to figure out, get rid of any field that had non-data in it, which is fine. It wasn't that many. And with how big the data set is, missing a couple of them, wasn't going to be a problem.

Andrew:

Exactly.

Mason:

So then we create the data set and then it basically will create a local copy of this data set called finetune.csv. It just creates a ... We can call it whatever we want to call it. We call it kittens if we want to, but it creates just a CSV for us here locally. That way, we can submit this to the Gretel model.

Mason:

So all this code did right here is just go through and get a song, take the lyrics, and then create this ... We created this combined ... Whoop [inaudible 00:25:59], that's ... I'm still learning how to scroll in notebook. Sometimes I scroll in the data set. Sometimes I scroll up here and I lose my place.

Mason:

Well, as you see, we've created this combined column and that's what's going to get submitted to the model. So there we go again, scrolling in the data. It's 6,000 records, that'll take a while to get through.

Mason:

So the next thing we're going to do is we're actually going to train the model. So we've ... These first steps, what we've done so far is we've logged in and we've basically made the data look the way it needs to for the model so we can use it. And now, we're going to train it. And this is actually probably my favorite part about Gretel and about everything in general. Whenever I hear someone say, "Train a model," I think this Herculean task, we're going up Everest. And as you can see, it's like four lines of code.

Mason:

So all we have to do is create a model object with our configuration. We're going to tell it what the data source is, tell it what the name is, and then we're just going to run, submit it to the cloud. And it's going to do the rest for us. This is the advantage you get of GPT with Gretel, is that, yes, there are other GPT models in the ecosystem. You get the ease of use of our APIs to use it, along with all of the other models. So if you're trying to do ... While we're waiting on this, because these things take some time, I would love for Andrew to talk about generating data with multiple models.

Andrew:

Yeah. So there's no one size fits all solution for generative modeling or for synthetic data generation, which means you need to have the model fit for your use case. So here, we have GPT that's really good at text. We have a couple of other Gretel models. Some of them are good for numerical data. Some of them are good for sparse data, small data, big data, whatever. And so our documents were always a work in progress, but they're pretty good at outlining which models are good for which tasks.

Andrew:

I mean, honestly, it's about as simple as swapping out the name in your config file and you can just train a new model, which I think is pretty cool.

Mason:

That is really cool. And I think right now, we currently have two models that are available to the general public, which is our GPT model that we're talking about today. And then our LSTM. Is that long short-term memory? Is that what that's ... I can never ... Okay, good. Acronyms. I thought I had all the acronyms whenever I was in the DevOps space and now I'm in AIML and all my acronyms have changed. So there's two current models with a lot more coming, which is really exciting.

Andrew:

Yeah.

Mason:

And now, we're just waiting on this model to train. So is there anything else you want to just chat about while we wait? This usually only takes about three to five minutes.

Andrew:

Yeah. I mean ... Yeah, we won't be here for very long. We have ... Our engineering team has set up a pretty nice training infrastructure in the background. So these go quick.

Andrew:

Yeah. So there's all stuff we can talk about here. One thing that I think is interesting to talk about is that while the GPT model is really good at text, it's also really good at things that are not text, surprisingly enough. So if instead of lyrics and whatever, you had numbers in there, or you had dates in there, or you accidentally had, I don't know, something else, names, latitudes, whatever you want. The model is general purpose enough that it actually wouldn't do a bad job and you could generate whatever you want, which there's some danger there, but I've always been just impressed and somewhat shocked.

Mason:

Yeah.

Andrew:

[inaudible 00:29:30]. So there's a question in the chat.

Mason:

Yes. We have a question from Saeid who asks, "What if we cannot share our raw data due to privacy?"

Andrew:

Awesome question. So you can run all of this on your own infrastructure. And so we provide a Gretel worker, the Gretel client and everything, and the code you write would look the same, but the endpoint that you point at would not be our cloud. It would be whatever your local cloud is. So, yeah, you can do totally private internal training.

Mason:

Yes. You still will need a Gretel account and API key because it still ties into the API. It doesn't submit any of the data, but it sounds like telemetry data and usage data, but you 100% can generate synthetic data, both with GPT and with LSTM, and I think probably with every one of our models going forward on your own infrastructure.

Mason:

Now, I assume you'll probably need GPU for that, correct? If you're training the model locally, you're going to need GPU. So it can get expensive and difficult. Anytime you say the word G as in GPU, in a DevOps context, the words expensive and difficult come to mind. I did this for a while whenever I was in a past life, in a previous role. I helped set up machine learning infrastructure at a Fortune 500 company and it can get pricey really fast. So that's one of the benefits to using the Gretel cloud is you don't have to worry about setting up your own stuff, but 100%, we knew going in that there were going to be data privacy laws, that there are going to be people who would not be legally allowed to or just would not want to share their data into the cloud, rightfully so. I completely understand that. So we have a solution there for you.

Andrew:

Yep. That's a great question.

Mason:

Yeah, great question. Love answering the questions. So-

Andrew:

Yeah, it's trained.

Mason:

It's trained. So as you see, it only took about four minutes based on wall time. Pretty good. We only ran this through one epoch and one with a batch size of one, which we'll talk about later, because I trained this model a lot yesterday and I learned a lot about what happens. And you train with different ... Is it epochs or epochs? Is there actually a [inaudible 00:31:34]-

Andrew:

Depends on who you ask.

Mason:

Okay, good. Then I'm going to say epoch because epoch makes me think of like JIRA and I don't want to think about JIRA, right? Okay. So we've trained the model and now all we have to do whenever we want to actually just create more synthetic data for ourselves, this create record handler object. So basically, it just is going to create some records for us.

Mason:

We're going to tell that we want 20 records and we're going to give it a maximum length of 128. We set this up here to ... This is just Pandas overhead, setting how Pandas will display the data. Then we basically just take those record handle that we've created. We submit it to the cloud and we wait and then we just read the CSV of what we get back and we'll see it come up.

Mason:

So let's go ahead and submit this. I'm going to log in to LinkedIn real quick because I don't think we're getting ... If there are any questions in LinkedIn, they're not coming in via StreamYard. So if you have questions on any of the platforms, please feel free to ask. I don't see any on this one. But, yeah, so that was a great question from Saeid and now we wait.

Mason:

This also usually doesn't take too long. As you can see, it has allocated a worker in the backend for us and now it's going to create some synthetic lyrics. And ... Okay. So now, this is going to be like synthetic lyrics with, I want to say, no guidance, but we just said, "Give us what you got." We didn't say, "Oh, I want something that has" ... We didn't start with the auto text predict, like I want to go to the. We didn't do that.

Mason:

So would you say, Andrew, that these are probably going to be more of the Wild West responses since we didn't give them any prompts?

Andrew:

They certainly could be. Now, you could prompt it with something totally crazy and the model wouldn't know what to do. But in this case, it's just going to go with the flow in some sense. It's going to start from a blank slate and it's going to generate what it thinks lyrics should look like. And so sometimes they rhyme. Sometimes there's recognizable bits and it's pretty cool. I'm excited to see what we produce here.

Mason:

Yes. Every time we ran this yesterday, we got some really interesting stuff. Some of them were ... Okay, I don't know if they would win any Grammys. It's also hard to evaluate lyrics without music for me. But ... And then there were somes where it literally just sounded like Lady Gaga was making noises into the microphone. And those made me laugh.

Andrew:

Yeah, [inaudible 00:34:03]. You mentioned at the beginning that her vocalization is pretty good. We saw a lot of just a, a, a, a, a, but it's funny that it was like three a's and then one a and then two a's and two a's. So there's a beat to it. So I bet if you would to, as you say, get music to it, it probably would sound pretty good.

Mason:

Yeah. Yeah. That's fascinating. This stuff. I'm still in the phase where this just feels like magic to me. So I just enjoy reading it. Okay. So ... Okay. Are we-

Andrew:

I think we skipped one. So we're up.

Mason:

Did I do ... Yeah, I always do that. Okay. Again, I'm still learning how to scroll on ... Okay, here we go. We've got some ... Oh, no, we've got some weird code stuff. So I told myself I'd never lose the day of my life and the world would tell you the truth and all my life would be wrong for me. That's actually not bad. I could ... That sounds very much what angsty Mason listened to in eighth grade.

Mason:

But as you can see, some of them, when you leave things to their wild devices, they go a little bit off. But, yeah, I'm pretty happy with this. This one decided it wanted to swear. I'm glad we cleaned that up.

Andrew:

Yeah. So something that's really nice about this is you can generate as many as you want and you can do quality controls on your side. So you can maybe have something that detects rhymes based on rhyme dictionary and you can generate just thousands and thousands of these and choose the best ones that you want. And we say it didn't take that long. And so, yeah, it's a really nice workflow.

Mason:

And I think you had mentioned it earlier or maybe I just read it somewhere, but usually when you're doing GPT generative stuff, there still needs to be a human in the middle to filter out stuff. Is that still the case? You need to verify that the text data that you've generated is actually valuable, and that's pretty much true of all GPT models at this point, right?

Andrew:

That's right.

Mason:

Okay. Okay, cool. So Saeid has another question for us. How different are Gretel models to the OpenAI GPT-2 or GPT-3 ones?

Andrew:

Yeah. Again, great question. So the GPT-2 model is 1.5 billion parameters trained on a proprietary data set that OpenAI holds. And GPT-3 is 175 billion parameters, again, trained on a proprietary data set.

Andrew:

The model that we have is 175 million, so it's a 10th of the size and it was trained on an open data set. So you can expect the data that it was trained on. The combination of those two factors simply means it's much faster to fine-tune on your own data, much faster to generate, and you have a better idea of what's going to come out of it, but there's no difference architecturally aside from size.

Mason:

Fantastic. Awesome. So now, let's go play with the prompts. So ... And we're going to do something different today because I feel like this is weird. I feel like I don't know why corn kept coming to my brain. So I would say let's do a Lady Gaga song here, but I'm really afraid of getting just all the vocalizations.

Mason:

So I think the next thing we're going to do is we're going to actually prompt the model. So Andrew, can you explain just what prompting the model is really quick?

Andrew:

Yeah, we've done some examples of it so far, but to get a little bit technical and exact with it, if we want the model to have some context, just have something that it's already thinking about, and thinking is maybe the wrong word to use here, but we need to prompt the model. And so you prompt it in the same way you structured the data. So we have artist, song name, and then the clean lyrics from before.

Andrew:

And so in this case, we say, "Well, we want a Taylor swift song and we want the song name to be something that we choose." And the first few words will be something else, right? Yeah. So we want Taylor Swift to sing a song about happy ducks. And you can say nothing or you can add a few lyrics.

Mason:

Okay, let's try this. Let's go with ... We're going to say Bilbo was a happy duck ...

Andrew:

Yep.

Mason:

... who, and I want to see if it's going to verb properly here. This is what happens when you make me do things off the top of my head. You get weird stuff like this. So now, let's go ahead and just run this and see what we get.

Andrew:

So you can see in the config now, there's a few extra options. So we already talked about the number of records and the maximum text length, but then there's top P, top K, number beams, what have you. We try to expose as much control as we can to the user of these models. And so if you have some specific hyperparameters that you want to try out, you can do that through the Gretel config as well.

Andrew:

So in this case, top P, we've chosen a number that we've found to be really, really good in general. But if you think top P of 0.9 is better, you're welcome to use that in your generation.

Mason:

Sounds good. I have no idea what top P and top K are.

Andrew:

Yeah. It's just a way to steer the model a little better to hopefully keep it from going off the rails. But sometimes it works, sometimes it doesn't.

Mason:

But I definitely would say that one of the benefits of this is, is like ... So I'm able to generate synthetic data that you can see is actually pretty valuable. I'm not a machine learning or a data scientist. This ... One of the really big benefits of the Gretel platform itself is it is allowing more people access to synthetic data that probably wouldn't have understood it before. That's why we have Andrew here because I don't know enough to explain all this deep-level stuff to you. I'm just an average old DevOps developer. But as you can see, I can create synthetic data and I can see use cases where I would've used this if I was still in an engineering world.

Mason:

So it's one of my favorite parts about Gretel is whether you're an expert or a novice, you can still get tremendous amounts of value out of it. It's like the ... You get a high level of value and then it only goes up from there the more that you know.

Mason:

Okay. So Bilbo was a happy duck who got a chance to shine in the sunlight, so happy. They can be happy. Okay. This does sound like Taylor Swift. Whenever I can ... I don't know, it's still because [inaudible 00:40:14]-

Andrew:

So I'm a huge Taylor Swift fan. And so [inaudible 00:40:15].

Mason:

You tell me, you go right ahead. Go right ahead.

Andrew:

Yeah, I would jam to this. This is so good. Bilbo was a happy duck who could talk about love like a duck who could kiss a happy duck [inaudible 00:40:26] world.

Andrew:

So I think there's something important here that I want to point out. So we're doing this fun example, it's a little bit silly, but the idea here is you can get the model on topic and it can generate things on topic. So these are song lyrics. They're repetitive, of course, but they are about a duck, they're by Taylor Swift, and they're happy, right? And I think we're so used to magic on the internet that we skim over just how groundbreaking this is in the past few years. This was impossible just five years ago. And now, you can just add the touch of a button, get it to work.

Andrew:

And so there's tons of other use cases. I highlighted one at the beginning, but say instead of artist name, song name, whatever, you have number of stars in a review, sentiment of that review, and then the review text. So you can change the sentiment how you want. You can generate a three-star review that's positive. You can generate a three-star review that's negative. You can do anything like that.

Andrew:

Really, it's up to your imagination. And it's just cool that it works. I mean, as you say, it's not going to win any awards, but it's coherent and it's relevant.

Mason:

I would like it known that if Taylor Swift wins a Grammy for any of these songs, everyone keep an eye out.

Andrew:

Came here first.

Mason:

We at least get the credit. And maybe half, maybe we cut the Grammy in half, put it somewhere. But if she comes in and steals these lyrics, then we know that ... That would actually be really interesting. I would be really curious to see. I know there has to be some techno grunge musician out there that's generating all their lyrics via GPT or something. But that would be really interesting to do. But, yeah.

Mason:

So we have this. Now, there is one question that I wanted to ask because this is something that I experienced yesterday and we briefly touched on earlier is that epochs are ... I think you had mentioned earlier, but I'm going to reiterate, epochs are the number of times that the model sees the data.

Mason:

So in my brain, my naive brain was like, "Oh, that would make the model better." But yesterday, as I added more and more epochs, the quality of the text actually devolved to a point where ... I went on one, five, and 10. And by the time I got to 10 epochs, literally, it was spitting out source code legit if bullying statements and stuff. So why would it's seeing the model more? Why would that cause the text quality to maybe decrease in some instances?

Andrew:

Yeah. I'm glad you asked that. So the first answer is a lot of the machine learning world is still somewhat of an art instead of a science. And we're working really hard to develop first principles approaches to understanding why models do what they do. But we have a lot of intuition, a lot of answers there.

Andrew:

In this case, you mentioned there's about 6,000 rows. In the grand scheme of things, that's not very many. And so we're training this new model that was pre-trained already in the internet on a small amount of data. And the learning rate is also pretty high. And so essentially, what's happening is we are forcing the model to forget a lot of what it's already learned. And if you do that over and over and over again, there's not enough data to be reabsorbed into the model. And so it will just degenerate.

Andrew:

Now, in this case, just doing one pass over the data, it still gets to remember a lot of what it knows about the world that it got from its pre-training. And then we just add a teeny bit of extra lyrics, Taylor Swift, blah, blah, blah, so that it can use all of that together to generate this stuff.

Andrew:

Now, if we just push its brain full of lyrics, it will get confused and there's not enough lyrics. And so it degenerates in that way.

Mason:

Oh, that's absolutely fascinating. This is still magic to me. And it's a pleasure every day to get to play with this, but I never would've thought of that. It's so cool.

Andrew:

Yeah. I think it's surprising how some of these models behave at times.

Mason:

So basically, just play with the parameters, see what happens. If it goes the wrong direction, then go back the way you came. Maybe one pass is better than 10 passes as we could see here. We don't have time to retrain because the more epochs you add, the longer it takes to train.

Mason:

So ... Okay, so we have this question here from Laura just came in. Is there a minimum and/or maximum data size you recommend using and still being effective?

Andrew:

Yeah, really great question. So one of the surprising findings of the original GPT paper was that you maybe don't have to actually train the model at all. You can just prompt it. And so they found that as few as 32 examples was sufficient to prompt the model in a certain direction. So we got down to 25 and it still was working a little bit, but, yes, you can prompt a model and use very few examples.

Andrew:

Now, if you're actually getting to train the model instead of just prompting, yeah, we recommend something in the high hundreds, low thousands as a very, very minimum. And then you want to do low, really low learning rate, really low number of efforts. As far as a maximum, in some sense, as long as you're willing to wait. But ... Yeah. So adding more data just increases the training time. And so if you're doing it locally, really, as long as you want, there are some time limits on the cloud that are in the documentation. And I don't have a good number there for you, I'm sorry, but you could do quite a few.

Mason:

Yeah. We actually saw that yesterday when we first started playing with this model. I tried ... And so you saw how we truncated. We didn't really talk about this, but we truncated at 256. I originally sent all of the song data to it and it started allocating gigs upon gigs upon gigs of data, which was just too much for what we were trying to do. And we were getting errors because we had sent too much data. So you can choke it with too much data for certain.

Mason:

I would imagine the more data you add, the more GPU and more hardware resources you're going to need, which makes it an interesting problem to solve, for sure.

Andrew:

Yeah. But the original model was trained on a ton of data. And so, really, the answer is however long you're willing to wait.

Mason:

Yes. How long did ... Out of random curiosity, do you know how long it took for us to train Gretel GPT?

Andrew:

Yeah. So Gretel GPT was trained EleutherAI, which is an open-source collective of researchers who train these models. And ... Yeah. So this particular model was trained for 300 billion tokens, which you can think of as words, which typically takes something on the order of weeks to train.

Mason:

Okay.

Andrew:

So it was trained for quite some time on quite a lot of data.

Mason:

That's really cool. Okay. So now, we're going to move on. We're done making Taylor Swift say silly things. But what we're going to do really quickly, I ... And we're going to go ahead and go over the chatbot example that we have. And as you're going to see, this code looks almost identical. I mean, this line, obviously, changed because we changed what CSV we're getting.

Mason:

So we're going to go ahead and do the exact same thing, but we're going to do it with some slightly more relevant or maybe more useful chatbot data. And it's not even really chatbot data as you'll see. This data is a full conversation. So it's basically a call and response back and forth. Maybe think like Facebook messenger or something.

Mason:

So we have that. We've done the pip install. As you can see, none of this code here ... We're actually going to do side-by-side comparisons. I love doing side by side.

Mason:

So if you look at this code from this code, it basically, excuse me, doesn't change at all. I need to get my Gretel key again. I probably still had it in my clipboard, but never just paste random stuff into a thing. Always verify you have what you need. We're going to log in to Gretel. Again, we haven't changed the config either for this. So it's the exact same config. So we're going to go ahead and create that. The data set is the same.

Mason:

Now, the part of the code that we had to change here is we did have to change this line of code. For one, we didn't have the same columns, so you're going to have to change this to create your own columns. So instead of it being artist and song, it's call and response. And then we ... Because these are relatively short, we didn't have the need to truncate this time. And because there was no random null data that was making me cry, I decided that we didn't need to do it here either. But as you can see from the rest of it, it's the exact same code.

Andrew:

Yeah. And we have blueprints on our website on our GitHub for these sorts of things.

Mason:

Yes. And I will definitely link and go over those here in a second. Well, actually, we'll probably do that right now. We're going to train the model and we're going to go ahead and do that right now. So if we go to, Let me pull it up over here, github.com/gretelai/gretelblueprints, we have a whole plethora of notebooks that are, basically, almost production ready that you can use to replace with your data set. And you can do local ... So Saeid had asked earlier about what to do if you can't use your data to do the row privacy. Well, we have these local ones which allow you to do it locally on your machine.

Mason:

We have notebooks on doing walkthroughs, using differential privacy, retain values with conditional data, and a whole bunch of other really interesting ones. Is there anyone in here that you want me to click on or explore [inaudible 00:50:15]?

Andrew:

We don't have to click on any, but we had a community question earlier.

Mason:

Oh, yes. Which one ... From our chats.

Andrew:

Yeah. So the community question about using three years of data to generate 10 years of data.

Mason:

Okay. Yeah. Let me go ahead and get that here really quick, so people can ... There we go. So we have this question, how can we use three years of data to create the previous 10 years of data? For example, I have data from 2019 to 2022, and I want to be able to create synthetic data from 2010 to 2022.

Andrew:

Yeah. So we have something in our model or in our suite of tools called seeding. And it's essentially just conditional generation or, as we've seen today, prompting. And so what you do is you train what's called a seed model on your data, where the seed field is your dates. And then you prompt the model with new dates from 2010 to 2022.

Andrew:

And so the model can learn all of the distributions and patterns and everything from your three years of data and then generate back to 2010. And it will look realistic and it will have a lot of the same qualities that you care about. And so it's as simple ... And so you use the blueprint on conditional synthetic generation by maintaining values or by seeding, and it works great. So, yeah, this was a great question and, honestly, a really great use case for Gretel.

Mason:

Yes, that's awesome. I love it. And then let's go ahead and check back and see. It looks like we're still training a little bit, so we'll continue to go through. So we have these notebooks. So gretel.ai/gretelblueprints, you can also find them from our docs, docs.gretel. ... I would think I could type, but that's a skill that's beyond me.

Mason:

So for anything you want to do, so for synthetics and stuff, we have these SDK notebooks in here, which basically link to the same thing. So these are another way of finding them. You can find all of our documentation notebooks on both synthetics, transforms, and classify, which we haven't really talked about transform and classify. These are privacy, basically, helper ... You can think of them as helper functions or helper products. Transform allows you to transform PII into a different format. So that way, you're not leaking any data. And then classify helps you detect it, I think, using natural language processing.

Mason:

So these are really cool. They're really cool actually. I love transform. Whenever I was working in DevOps, that's the one I would've loved the most is just transforming customer data so I don't have to deal with it. And ... Yeah.

Andrew:

So there's a natural question here. When would you use transforms over synthetics?

Mason:

Yes, it's a great question.

Andrew:

Because it seems like they do the same thing.

Mason:

Yes.

Andrew:

Transforms is just so much faster because you're using your data already and just transforming it directly into new data. The drawback is, of course, you can't generate more of the same, but it's guaranteed in some sense to be protected.

Mason:

Fantastic. And then ... But synthetics is you want more of the same or when you use them together, you get an extra layer of privacy, right? Transforms is great by itself. Synthetics is great by itself. Together, they're even better by themselves. That's a ... It didn't make sense, but it worked in my brain. So that's all that matters. So we want to use those. And then something like classify, I would've loved to have used classify on incoming log streams and stuff. It would've been really cool.

Andrew:

That's great. Yep.

Mason:

So ... And then another repository, where you'll be able to find this lyrics generator, which I haven't uploaded yet. We have fun with synthetic data repository. We have another GPT model in here, which you can use what we call Gretel Bartender, which allows us to create synthetic cocktail recipes. It's pretty fun. It was a fun little one. But I will also be uploading this lyric generator to this repository as well. So if you want to get that after the fact, so the Gretel blueprints repository, I will drop this in chat for anyone who wants it.

Mason:

So it's there in the YouTube chat. And then see if there's ... We'll drop it here in the LinkedIn chat as well.

Mason:

Okay. Let's go back and check on our model. Looks like we're done. Okay. So we've created our model. And now, all we're going to do is we're just going to generate five data sets, just to make this go a little bit faster, because we're almost coming up on time. But we're going to go ahead and generate five rows of this data and then we could also prompt it. We'll also try to prompt it here in a second, too.

Mason:

If you have any questions, please feel free to post them in the chat. We're happy to answer them.

Andrew:

Yeah. I just want to call back to something you said at the very beginning. You said we're going to go through two examples. The code's essentially the same.

Mason:

Yeah.

Andrew:

And I just think I have seen these before, but I'm like, "Oh, wow, the code really is-

Mason:

Yeah, no, I 100% copy pasted. And I was like, "This would be ... If I got five of these assignments in school, I'd be like hahahahaha."

Andrew:

Yeah, exactly.

Mason:

Change CSV file, change couple of variables done.

Andrew:

That's it.

Mason:

They really are exactly the same. Okay, I want to go ... Okay. So really, I want to go to the beach. That sounds like a great idea. I like it. I don't want to go to the beach like that. I love it. It sounds so bad. Okay.

Mason:

So as we see, as it gets a little bit longer, gets a little bit more loopy, but this is legitimately valid. I mean, are you going to win orator of the year? No. But is it valid text data? 100%. So now, let's come down here and let's prompt it with, how are you today? I'm doing fine. Thank you. And then let's see what it spits out after that.

Andrew:

Yeah. So one use case we've been seeing is people want to generate text data for their logging infrastructure. And so logs are text, right? So you have something in your application that is spitting out, this event happened at this time, blah, blah, blah. If you wanted to generate more logs that fit some criteria, this is absolutely the way to do it. And they may not be the perfect logs, but they will look like logs. And that's really what matters in this case.

Mason:

Very valuable, too. That's awesome. What if ... I'm just going to keep hitting with the hard questions. What if there's a really rare log event that happens and I want more of that specific log event so I can test against it? Is that something that can happen here, too?

Andrew:

Yeah, absolutely. So you can just prompt it to generate more of the rare log event. So say info is a really common log event you see all the time, but maybe error is less common, you can just prompt it to generate error logs. And given that there are sufficiently many error logs in your actual training data, it will easily be able to generate more varied error logs for you.

Mason:

Yeah. Awesome. I love that. It's so cool. I'm going to play with that next. So as we can see, we've preceded the conversation with people of like, how are you today? I'm doing fine. Thank you. And we just get more basic ... Well, this poor one got laid off. I feel sorry for this synthetic data worker. But everything else is pretty spot on, pretty straightforward.

Andrew:

Yeah, we're not going to lay you off, GPT. Don't worry.

Mason:

No, we're not. No, you're good. Okay. So we have one more question from the community and then we have some swag giveaway for those of you that have stayed or those of you that are watching this after the fact. So the last question that we have in the community, and as you saw, we tried to sneak these in, but this one didn't really fit in, but we're going to answer it anyway we got. Can data be loaded onto the platform to constantly update our input into the synthetic data sets?

Andrew:

Yeah. So I think this community member is talking about something like active learning or online learning. So let's say, for example, that people are still writing new songs and how do you incorporate those songs into your data set? And so the easiest way is to append them to your CSV and train your model. And we saw it only took four minutes. And so this is something that you can run a cron job every hour to update the model. And it takes almost no time.

Andrew:

And honestly, that's what I would do if I were trying to build a system like that. Just append to your CSV, retrain, and you're good to go.

Mason:

Fantastic. Okay. So let's zoom this in. Thank you for all your help today, Andrew. And we're almost out of time. So everyone, thank you for attending. We hope ... If there's any more questions, we've got about two minutes left, you can drop them in the chat. But if you don't have any questions, you can go to this QR code, you can scan it, or go to the link that you'll see here in the banner box below, grtl.ai/gpt-workshop. And we'll send you some stickers.

Mason:

You may have to wait a teensy bit because we're still getting them printed. But if you've enjoyed it today and you want some Gretel stickers, we have a lot of really fun designs that we've come up with that people, I think, are really going to enjoy. I've taken them to one conference already and people loved them. So go ahead and you can either scan this QR code or go to grtl.ai/gpt-workshop. You have a week.

Mason:

So before July 20th at 12:00 Eastern Time. So that way, for anyone who's watching this, if you didn't get a chance to watch it live, but you did get a chance to watch this and made it by the end of the week, you're still able to get some stickers as well. And if you aren't able to get them this time or you miss out, we're going to be doing more of these live events in the future. So keep coming back, we'll probably keep giving away some stickers and some cool swagger. Who knows what fun things we'll come up with? I enjoy different things from time to time.

Mason:

Ah, well, we made it, Andrew. That was pretty fun. How's ...

Andrew:

I love it. See, the thing is, I could talk about this for five more hours. This is what I've lived and breathed. So I really appreciate you having me on. I appreciate you asking me the tough questions.

Mason:

Hey, I'm glad that the questions that I asked actually made sense. I feel like every day, a little bit, I understand more and more about this space every day, which is really fun because I've been in DevOps for so long. You learn some new things, but when you do a hard shift over into machine learning and data science, it's a completely new world. It's literally the Aladdin gift of a whole new world.

Andrew:

There we go.

Mason:

So that's how it's been for me, but I'm glad that I was able to do this. I don't see any other questions in any of the chat. So I think we're going to go ahead and exit out.

Mason:

Thank you, everyone, again for attending. And we will see you ... Keep an eye out. If you haven't already followed the YouTube channel, follow the YouTube channel, get notifications. You'll get notified when we go live, when we add more stuff. We'll definitely be going live again here within a couple of weeks to a month. So we'll be back around and we hope to see you next time. See you all later.

‍

Developer Workshop: Synthetic Text Generation with Gretel GPT

Video description

More Videos

Generating differentially private synthetic text using Gretel GPT

Anonymize Financial Data with a Fine-Tuned SLM

Video - Generate synthetic data for training LLMs and SLMs

Read the blog post

Transcription