Conditional Text Generation by Fine-Tuning Gretel GPT

Augment machine learning datasets with synthetically generated text and labels using an open-source implementation of GPT-3.

Published by

Alex Watson

No items found.

•

Updated

May 26, 2022

Updated on June 15, 2023 to reflect updates to Gretel GPT.

Whether you are training a chatbot, creating a model to perform sentiment analysis on customer reviews, or even building the next voice model for Alexa, there will always be new commands or utterances that the model struggles to understand.

In this post, we will demonstrate how synthetic data creates a scalable, fast, and cost-effective alternative to manual techniques for data augmentation. To get started, we will fine-tune a Generative Pre-trained Transformer (GPT) model on a financial intent classification dataset called `banking77`. With some clever encoding and a technique called conditional generation, we can then use the GPT model to generate new and unique annotated examples for any of the intent classes.

Why not just collect more real-world data?

Using synthetically generated data is more privacy-preserving, scalable, and orders of magnitude less expensive than previous approaches of collecting and labeling data from customers, or manually generating and annotating data. In fact, Gartner estimates that by 2030, the use of synthetic data will completely overshadow the use of real-world data for AI.

Training setup

Let’s start with loading the `banking77` dataset into a Pandas Dataframe. The dataset comprises 13,083 customer service queries labeled with 77 intents from the banking domain. There are two columns, “intent”, and “text”. For example, for the user text query “What can I do if my card has not arrived after two weeks?” the corresponding intent would be “card arrival”.

Our goal is to generate coherent text examples for any of the 77 different intent labels in the training data. We’ll need to encode the intent labels and text into a single field to train the GPT model. To do this, we will concatenate the two fields with a special character as a separator.

For example:

[intent] + “,” + [text]

‍Pro-tip: This example relies on the keywords in the intent to prompt the model to generate related text. One way to improve this model could be to use BERT to extract relevant keywords for each class, which could be automatically added to the intent keywords.

Model Configuration

In this experiment, we will use MPT-7B as our base model via Gretel.ai’s APIs. Check out our Gretel GPT docs for a list of models we support. We'll use our default natural language configuration, though you can modify parameters as you need.

config = read_model_config("synthetics/natural-language")
config['models'][0]['gpt_x']['pretrained_model'] = "gretelai/mpt-7b"
config['models'][0]['gpt_x']['epochs'] = calc_epochs(len(df))
config['models'][0]['gpt_x']['generate'] = {'num_records': 1}

Train the synthetic model

Save the combined intents + texts to a single column CSV, and submit to the Gretel API service using the config above.

model = project.create_model_obj(model_config=config, data_source=df)
model.name = f"{PROJECT}-mpt-7b"
model.submit_cloud()

poll(model)

Generate annotated data from the model

Finally, we can use our newly fine-tuned model to generate new text examples for a given intent by seeding model generation with examples from the class. Hint: We have found prompting the model with ~25 examples for the class you wish to generate to work well in practice. Once again, since GPT does not natively handle columnar data, we will need to encode our intents and texts using a comma as a separator, and an `\n` carriage return between records.

prompt = create_prompt(df=df, intent=INTENT, recs=25)
 
record_handler = model.create_record_handler_obj(
   params={"num_records": 1, "maximum_text_length": 1000, "prompt": prompt}
)
record_handler.submit_cloud()
poll(record_handler)

In this case, each `record` is prompted with 25 examples from the source data and then tasked to create up to 1000 additional tokens of similar examples. We create a record handler to submit the job to the Gretel APIs.

Load GPT results into a tabular format

We can now pull down the data generated in the previous step. It is returned by the Gretel API as a text string, which we can now load into a DataFrame (tabular) format. To do this, split the text record by new lines and the intent.

gptx_df = pd.read_csv(record_handler.get_artifact_link("data"), compression='gzip')
 
syn = get_intents(df=gptx_df, offset=len(prompt))
syn

There you have it! The final result and synthetic texts are coherent and human-like. Quite impressive! If you’d like to take this a step further, try generating additional keywords for each intent category, and try running with different GPT models to compare their outputs. Here is a snapshot of our final, synthetically generated and annotated data.

*Sample of Gretel’s synthetically-generated and* *annotated* *dataset.*

Conclusion

We are super excited to launch support for generative pre-trained transformer models at Gretel, and found the capabilities of these models to be really promising! If this is exciting to you, feel free to reach out at hi@gretel.ai or join and share your ideas in our Discord community. If you’d like to explore other sample notebooks check out Gretel Blueprints.