Synthesizing dialogs for better conversational AI

Create high-quality synthetic datasets of conversational dialogs safely tuned on your private, sensitive data with Gretel.

In the world of natural language processing (NLP) and conversational AI, having access to high-quality training data is essential. This data fuels the training of language models, chatbots, and virtual assistants, enabling them to understand and generate human-like text. However, obtaining real conversational data with the required diversity and privacy considerations can be challenging.

This blogpost shows how Gretel-GPT can be leveraged to create realistic synthetic dialogs, turn-takings and QA datasets, enhanced with metadata tags or labels. These synthetic conversations serve a multitude of purposes, from training language models to enhancing conversational agents and providing valuable insights into transcribed conversations—all while preserving privacy.

Datasets 🗃️

In this blogpost, we demonstrate the capabilities of Gretel-GPT on the following conversational datasets (note: all datasets are available on HuggingFace):

Daily-dialog: this dataset contains high-quality multi-turn dialogs. The dialogues in the dataset reflect our daily communication and cover various topics about our daily life. The dataset was manually labeled with communication intention and emotion information. The intention and emotion labels are defined as follows:

  • act: a list of classification labels, with possible values including __dummy__ (0), inform (1), question (2), directive (3) and commissive (4).
  • emotion: a list of classification labels, with possible values including no emotion (0), anger (1), disgust (2), fear (3), happiness (4), sadness (5) and surprise (6).

Commonsense-Dialogues: this dataset is a crowdsourced dataset of ~11K dialogues grounded in social contexts involving utilization of commonsense. The social contexts used were sourced from the train split of the SocialIQA dataset, a multiple-choice question-answering based social commonsense reasoning benchmark. 

‍Counsel-chat: this dataset is a scrape of Counselchat.com's forum, an expert community and platform that helps counselors build their reputation and make meaningful contact with potential clients. Therapists respond to questions posed by clients, and users can like responses that they find most helpful. The dataset contains expert responses by licensed clinicians to questions posed by individuals.

Gretel-GPT training 🚀

Getting Started: Installation

Follow along with our complete notebook in Colab or GitHub.

First, install dependencies.

!pip install -Uqq gretel-client

Paraphrasing preprocessing step

Gretel-GPT is a powerful tool for generating synthetic conversations. It allows creating realistic dialogues, maintaining the structure and order within a paragraph while generating text that sounds convincingly human. However, to work effectively with Gretel-GPT, a preprocessing step is necessary, especially when your source data is in structured formats like JSON objects or tabular data.

The below examples show how we converted each record of the dataset into a single paragraph for Gretel-GPT model training.

Daily-dialog

Figure 1. Paraphrased daily dialog records for model training.

Commonsense-dialogues 

Figure 2. Paraphrased commonsense dialogues for model training.

Counsel-chat

Figure 3. Paraphrased counsel chat records for model training.

Configure the Gretel-GPT model

Now, we will configure the Gretel-GPT model. In this case, we will use gretel/mpt-7b and specify the amount of epochs such that we train for approximately 60 min.

def calc_steps(num_rows, batch_size, minutes=60) -> float:
    """Estimate the number of rows that can be trained within a time period"""
    rows_per_minute = 102.0
    epochs = (rows_per_minute * minutes) / num_rows
    return int(epochs * num_rows / batch_size)


def calc_text_length(df, max_tokens=2048) -> float:
    tokens_per_word = 3
    max_string_length = int(df.str.len().max()/tokens_per_word)
    return min(int(np.ceil(max_string_length/100)*100), max_tokens)


config = read_model_config("synthetics/natural-language")
config['models'][0]['gpt_x']['pretrained_model'] = "gretelai/mpt-7b"
config['models'][0]['gpt_x']['steps'] = calc_steps(len(data_source), config['models'][0]['gpt_x']['batch_size'])
config['models'][0]['gpt_x']['generate'] = {
    'num_records': 3, 
    'num_beams': 5,
    'maximum_text_length': calc_text_length(data_source["text"]), 
    }

Training the Gretel-GPT model

We kick Gretel-GPT model fine-tuning. Note that for each dataset, we limited the training data to 1000 randomly sampled records.

# Create project
GRETEL_PROJECT = “project-gretel-gpt”
project = create_or_get_unique_project(name=GRETEL_PROJECT)


# Create and submit model
model = project.create_model_obj(model_config=config, data_source=data_source)
model.submit_cloud()
poll(model, verbose=False)

Generate records

Generate as many records as you want. Note you can run Gretel’s record handler in parallel to speed up model generation time.

# configure the generation config
NUM_SYNTHETIC_RECORDS = 100
params={
  "maximum_text_length": calc_text_length(data_source["text"]),
  "top_p": 0.95, 
  "num_records": NUM_SYNTHETIC_RECORDS
}


# generate synthetic records
record_handler = model.create_record_handler_obj(params = params)
record_handler.submit_cloud()
poll(record_handler, verbose=False)


# retrieve the synthetic data
synth_data = pd.read_csv(record_handler.get_artifact_link("data"), compression='gzip')

Results 🔎

Here are some examples of synthetically generated paragraphs for each dataset. As can be seen in the examples, the Gretel-GPT model preserves the paragraph structure allowing to convert them back into the original datasource format in an automated fashion.

daily_dialog 

Figure 4. Synthetically generated daily dialog records for model training.

commonsense-dialogues

Figure 5. Synthetically generated commonsense dialogues for model training.

counsel-chat

Figure 6. Synthetically generated counsel chat records for model training.

Conclusion ☝️

In the rapidly evolving landscape of NLP and LLMs, the significance of clean high-quality training data is imperative, yet challenging. Gretel-GPT offers a solution by generating realistic synthetic conversations and datasets enriched with metadata, all while preserving privacy. These resources serve diverse purposes, from fine-tuning language models to improving conversational agents and analyzing transcribed dialogues. Here, we showed how you can leverage Gretel-GPT on structured conversational data with the goal to generate convincingly human text.