Synthesizing dialogs for better conversational AI
In the world of natural language processing (NLP) and conversational AI, having access to high-quality training data is essential. This data fuels the training of language models, chatbots, and virtual assistants, enabling them to understand and generate human-like text. However, obtaining real conversational data with the required diversity and privacy considerations can be challenging.
This blogpost shows how Gretel-GPT can be leveraged to create realistic synthetic dialogs, turn-takings and QA datasets, enhanced with metadata tags or labels. These synthetic conversations serve a multitude of purposes, from training language models to enhancing conversational agents and providing valuable insights into transcribed conversationsâall while preserving privacy.
Datasets đď¸
In this blogpost, we demonstrate the capabilities of Gretel-GPT on the following conversational datasets (note: all datasets are available on HuggingFace):
Daily-dialog: this dataset contains high-quality multi-turn dialogs. The dialogues in the dataset reflect our daily communication and cover various topics about our daily life. The dataset was manually labeled with communication intention and emotion information. The intention and emotion labels are defined as follows:
- act: a list of classification labels, with possible values including __dummy__ (0), inform (1), question (2), directive (3) and commissive (4).
- emotion: a list of classification labels, with possible values including no emotion (0), anger (1), disgust (2), fear (3), happiness (4), sadness (5) and surprise (6).
Commonsense-Dialogues: this dataset is a crowdsourced dataset of ~11K dialogues grounded in social contexts involving utilization of commonsense. The social contexts used were sourced from the train split of the SocialIQA dataset, a multiple-choice question-answering based social commonsense reasoning benchmark.Â
âCounsel-chat: this dataset is a scrape of Counselchat.com's forum, an expert community and platform that helps counselors build their reputation and make meaningful contact with potential clients. Therapists respond to questions posed by clients, and users can like responses that they find most helpful. The dataset contains expert responses by licensed clinicians to questions posed by individuals.
Gretel-GPT training đ
Getting Started: Installation
Follow along with our complete notebook in Colab or GitHub.
First, install dependencies.
Paraphrasing preprocessing step
Gretel-GPT is a powerful tool for generating synthetic conversations. It allows creating realistic dialogues, maintaining the structure and order within a paragraph while generating text that sounds convincingly human. However, to work effectively with Gretel-GPT, a preprocessing step is necessary, especially when your source data is in structured formats like JSON objects or tabular data.
The below examples show how we converted each record of the dataset into a single paragraph for Gretel-GPT model training.
Daily-dialog
Commonsense-dialoguesÂ
Counsel-chat
Configure the Gretel-GPT model
Now, we will configure the Gretel-GPT model. In this case, we will use gretel/mpt-7b and specify the amount of epochs such that we train for approximately 60 min.
Training the Gretel-GPT model
We kick Gretel-GPT model fine-tuning. Note that for each dataset, we limited the training data to 1000 randomly sampled records.
Generate records
Generate as many records as you want. Note you can run Gretelâs record handler in parallel to speed up model generation time.
Results đ
Here are some examples of synthetically generated paragraphs for each dataset. As can be seen in the examples, the Gretel-GPT model preserves the paragraph structure allowing to convert them back into the original datasource format in an automated fashion.
daily_dialogÂ
commonsense-dialogues
counsel-chat
Conclusion âď¸
In the rapidly evolving landscape of NLP and LLMs, the significance of clean high-quality training data is imperative, yet challenging. Gretel-GPT offers a solution by generating realistic synthetic conversations and datasets enriched with metadata, all while preserving privacy. These resources serve diverse purposes, from fine-tuning language models to improving conversational agents and analyzing transcribed dialogues. Here, we showed how you can leverage Gretel-GPT on structured conversational data with the goal to generate convincingly human text.