Conditional Text Generation with Gretel GPT
Video description
In this video, Alex walks through an example of how to use a large language model to create additional text that you can use to augment a machine learning dataset.
More Videos
Read the blog post
Transcription
Alex: Hey, today we're going to walk through an example of using a large language model to create additional text, which we can use to augment a machine learning data set.
Alex: Go ahead and take a look here. First, let's start with the data set. We're loading a data set here called Banking77. Go hover and look at this at the hugging face data sets repository. We see these examples of a label which could be card arrival, or card is late, or things like that. And the corresponding text that's part of this data set.
Alex: So let's say we don't have enough text examples for a given intent or label, like what you have here. This is where we can use models that are trained, like this model that we're using today, trained on millions of documents to create realistic, additional examples we can use to add to that data set in a much more cost effective manner than having a human generate and label these data sets themselves.
Alex: So here we go ahead. We log into Gretel. We configure our project. Grab an API key is really all you need to do. Go back to your notebook and click run all. Goes through. It creates our project. Look at our model configuration. You can read more about these parameters on docs.Gretel.ai. But fundamentally, we're selecting a language model here to run on. We're having it run for single epoch with a relatively low learning rate. So what we're trying to do is just very, very finely tune this model on the new data set that we're working with.
Alex: The GPT models don't really understand tabular data. So what we're going to have to do here is combine these two fields, card arrival and am I waiting on my card, into a single field. So here we've got intent, comma, and then the text that we're seeking to generate. This is what we're going to train the model on right here.
Alex: And the next step we call the Gretel APIs. We save this fine tuning data set to a CSV file, and we tell the model, the cloud, essentially to fine tune this language model on our particular data set. What that's going to do is basically teach that language model about the structure that we're looking for. And see a train there? It took about five minutes to run through. Next, we have our newly trained and fine tuned model. What we need to do is prompt the model to generate more records of a certain class. And this is one of the harder things to get language models to do but incredibly powerful in practice when you get it working. What we're going to do is we are going to create a prompt. So essentially we are going to seed or bootstrap the model with some text that looks just like what we're looking for.
Alex: So it can continue generating meaningful and realistic patterns that match that text. To do that, we're taking our original data frame. We're going to grab 25 records from it for the given intent, which is card arrival. And we'll use that to create a single string. We can see that below. So we create the prompt. We submit it to our model. Can see this. So essentially as the worker loads up the model, we're going to prompt that model with a bunch of examples of card arrival. And we're going to ask that model to generate more examples for us to work with. So it's completed here, it's prompted the model. The model will return another text string. So just like we compressed two columns into one, now we need to separate those two columns out. We have a feature here called get intense and what it does, this splits by the intent and separator, and then pulls out examples.
Alex: So here we can see this newly generated synthetic text. How do I track my card? Will I be able to see my new card from my sites? I'm still waiting on my card. Can you track it? So these are examples that are being generated by our model that are labeled with a given intent. Sometimes it takes a level of tuning here. Sometimes you see some repetitive data being created by the transformer based model. That's very normal. We have parameters that you can tune to minimize that or generate more examples. Hopefully this has been helpful.