Conditional Text Generation by Fine Tuning Gretel GPT

Copyright © 2022
Copyright © 2022

Augment machine learning datasets with synthetically generated text and labels using an open-source implementation of GPT-3.

Whether you are training a chatbot, creating a model to perform sentiment analysis on customer reviews or even building the next voice model for Alexa, there will always be new commands or utterances that the model struggles to understand.

In this post, we will demonstrate how synthetic data creates a scalable, fast, and cost-effective alternative to manual techniques for data augmentation. To get started, we will fine-tune a Generative Pre-trained Transformer (GPT) on a financial intent classification dataset called `banking77`. With some clever encoding and a technique called conditional generation, we can then use the GPT model to generate new and unique annotated examples for any of the intent classes. 

Why not just collect more real-world data?

Using synthetically generated data is more privacy-preserving, scalable, and orders of magnitude less expensive than previous approaches of collecting and labeling data from customers, or manually generating and annotating data. In fact, Gartner estimates that by 2030, the use of synthetic data will completely overshadow the use of real-world data for AI.

Training setup

Let’s start with loading the `banking77` dataset into a Pandas Dataframe. The dataset comprises 13,083 customer service queries labeled with 77 intents from the banking domain. There are two columns, “intent”, and “text”. For example, for the user text query “What can I do if my card has not arrived after two weeks?” the corresponding intent would be “card arrival”. 

Sample of the banking77 dataset

Our goal is to generate coherent text examples for any of the 77 different intent labels in the training data. We’ll need to encode the intent labels and text into a single field to train the GPT model. To do this, we will concatenate the two fields with a special character as a separator.

For example:

[intent] + “,” + [text]

Pro-tip: This example relies on the keywords in the intent to prompt the model to generate related text. One way to improve this model could be to use BERT to extract relevant keywords for each class, which could be automatically added to the intent keywords. Let us know on our Slack community if you would be interested in seeing a more detailed example!

Model Configuration

In this experiment, we will use GPT-Neo as our base model via’s APIs, but you can run any GPT-based model in the HuggingFace model repository that will fit on an Nvidia T4 GPU- including OpenAI’s `gpt2`, `distilgpt2`, and others.

config = {
 "schema_version": 1,
 "models": [
     "gpt_x": {
       "data_source": "__",
       "pretrained_model": "EleutherAI/gpt-neo-125M",
       "batch_size": 4,
       "epochs": 1,
       "weight_decay": 0.1,
       "warmup_steps": 100,
       "lr_scheduler": "cosine",
       "learning_rate": 1e-6

Train the synthetic model

Save the combined intents + texts to a single column CSV, and submit to the Gretel API service using the config above.

model = project.create_model_obj(model_config=config)
model.data_source = "finetune.csv" = "banking77-gptx"

Generate annotated data from the model

Finally, we can use our newly fine-tuned model to generate new text examples for a given intent by seeding model generation with examples from the class. Hint: We have found prompting the model with ~25 examples for the class you wish to generate to work well in practice. Once again, since GPT does not natively handle columnar data, we will need to encode our intents and texts using a comma as a separator, and an `\n` carriage return between records.

prompt = create_prompt(df=df, intent=INTENT, recs=25)
record_handler = model.create_record_handler_obj(
   params={"num_records": 1, "maximum_text_length": 1000, "prompt": prompt}

In this case, each `record` is prompted with 25 examples from the source data and then tasked to create up to 1000 additional tokens of similar examples. We create a record handler to submit the job to the Gretel APIs.

Load GPT results into a tabular format

We can now pull down the data generated in the previous step. It is returned by the Gretel API as a text string, which we can now load into a DataFrame (tabular) format. To do this, split the text record by new lines and the intent.

gptx_df = pd.read_csv(record_handler.get_artifact_link("data"), compression='gzip')
syn = get_intents(df=gptx_df, offset=len(prompt))

There you have it! The final result and synthetic texts are coherent and human-like. Quite impressive! If you’d like to take this a step further, try generating additional keywords for each intent category, and try running with different GPT models to compare their outputs. Here is a snapshot of our final, synthetically generated and annotated data.

Sample of Gretel’s synthetically-generated and annotated dataset.


We are super excited to launch support for generative pre-trained transformer models at Gretel, and found the capabilities of these models to be really promising! If this is exciting to you, feel free to reach out at or join and share your ideas in our Slack community.  If you’d like to explore other sample notebooks check out Gretel Blueprints