How to Create High Quality Synthetic Data for Fine-Tuning LLMs

Gretel Navigator’s synthetic data generation outperformed OpenAI's GPT-4 by 25.6%, surpassed Llama3-70b by 48.1%, and exceeded human expert-curated data by 73.6%.

We're at a bit of an AI crossroads, where publicly available data has largely been consumed by existing LLM training. When looking to fine-tune, or even add to pre-training data to improve task performance, access to diverse, labeled datasets is limited. This challenge is particularly acute for organizations seeking to adapt models for specific domains or tasks.

In this blog, we'll show you how Gretel Navigator, a versatile compound AI system, enables users to easily create high-quality synthetic data for AI/ML model training. Utilizing an agent-based approach, including techniques such as co-teaching and evolutionary iteration, Navigator’s synthetic data can outperform its own underlying LLMs and even much larger models such as OpenAI’s GPT-4 as shown below. For a quick intro, Navigator is the same service that we use at Gretel to create open datasets like the popular Text-to-SQL dataset

Can’t wait to jump in? Try our low-code synthetic data generation Streamlit app using Navigator.

Today we'll cover:

  1. The challenge of data scarcity in AI training
  2. An introduction to Gretel Navigator and its key features
  3. Step-by-step implementation of Navigator for generating synthetic question-answer pairs
  4. Keys to success and best practices
  5. Interactive StreamLit app for easy iteration and experimentation

The Data Scarcity Challenge in AI Training

The scarcity of high-quality, diverse datasets for training and fine-tuning LLMs presents a significant hurdle in the advancement of AI capabilities. This is particularly problematic when trying to adapt models for specific domains or tasks where specialized data is required.

Synthetic data has emerged as a promising technique for enhancing performance of LLMs, be it when pre-training or fine-tuning for a particular task. Teams behind many leading models, such as DBRX, Phi-3, Aya, and Gemini, are beginning to recognize synthetic data as a crucial element in developing more advanced and capable models:

Figure 1. Quotes from leading AI teams leveraging synthetic data.

Generate High Quality Synthetic Data with Gretel Navigator

Gretel Navigator is a compound AI system designed to generate high-quality synthetic data for training AI and LLMs. Navigator leverages agentic workflows, task planning, and a multitude of tools & models to iteratively review and improve synthetic data. As was called out in the AgentInstruct paper, “Agentic workflows can generate high quality data that surpasses the capabilities of the underlying LLMs, by using flows with reflection and iteration, where agents can look back at solutions, generate critiques and improve solutions”. Combined with inexpensive token generation, "agentic workflows for large language models (LLMs) open up intriguing new possibilities for training LLMs on synthetic data," as correctly predicted by Andrew Ng and now implemented in Navigator.

Navigator can be used to create a wide variety of synthetic data, including:

In this blog post, we'll demonstrate generating high-quality input/output pairs for LLM training.

Key features of Gretel Navigator include:

  1. Compound AI System: An approach where many tools work in tandem. (see BAIR blog)
  2. Evolutionary Algorithms: An iterative approach to improve data quality over multiple generations. (see Wizard LM 2's evolutionary improvements to synthetic data)
  3. Multi-LM Collaboration: Integration of multiple language models for co-teaching and self-improvement. (inspired by papers like Textbooks Are All You Need II and Meta's Self-Play)
  4. Agent-based Generation: as outlined in the AgentInstruct paper
  5. Self-alignment: in which the model can fine-tune itself without human annotation. (See StarCoder2)
  6. Comprehensive Governance: Rigorous evaluation metrics and quality assurance at each step, controlling for bias, toxicity, accuracy, utility, and groundedness.
  7. Customizable Design: Easily tailored to specific needs, supporting various output formats.

Getting Started

Let's walk through an example of generating synthetic question-answer pairs based on a given context. For this demonstration, we'll be using the Databricks Dolly 15k dataset, specifically focusing on the closed question-answering task.

This dataset is particularly valuable for our example because:

  1. It provides ground truth data in the 'context' field.
  2. We can compare our synthetically generated results to the human expert-generated instructions and responses created by Databricks employees.
  3. It allows us to demonstrate Gretel Navigator's ability to generate high-quality synthetic data that matches or even exceeds human-curated datasets.

Step 1: Configuration

First, we set up the configuration using the InstructionResponseConfig class:

config = InstructionResponseConfig(
        input_fields=["context"], # Note we are not providing access to human instructions/responses
        output_instruction_field="synthetic_instruction",
        output_response_field="synthetic_response",
        num_generations=3,
        population_size=5,
        mutation_rate=0.5,
        temperature=0.7,
        max_tokens=200,
        api_key=GRETEL_API_KEY,
        navigator_tabular=NAVIGATOR_TABULAR,
        navigator_llm=NAVIGATOR_LLM,
        co_teach_llms=CO_TEACH_LLMS,
        system_prompt="You are an expert in generating balanced, context-rich questions and comprehensive answers based on given contexts. Your goal is to create question-answer pairs that are informative, detailed when necessary, and understandable without prior knowledge, while not revealing the answer in the question.",
        instruction_format_prompt="Generate a specific and clear question directly related to a key point in the given context. The question should include enough background information to be understood without prior knowledge, while being answerable using only the information provided. Do not reveal the answer in the question. Ensure the question is focused and can be answered concisely if the information allows, but also accommodate for more detailed responses when appropriate.",
        instruction_mutation_prompt="Refine this question to include necessary context for understanding, without revealing the answer. Ensure it remains clear and can be comprehensively answered using only the information in the given context. Adjust the question to allow for a concise answer if possible, but also consider if a more detailed response is warranted based on the complexity of the topic.",
        instruction_quality_prompt="Evaluate the quality of this question based on its specificity, inclusion of necessary context, relevance to the original context, clarity for someone unfamiliar with the topic, and ability to be answered appropriately (either concisely or in detail) without revealing the answer:",
        instruction_complexity_target=0.7,
        response_format_prompt="Generate an informative answer to the given question. Use only the information provided in the original context. The response should be as concise as possible while fully addressing the question, including relevant context and explanations where necessary. For complex topics, provide a more detailed response. Ensure the answer provides enough background information to be understood by someone unfamiliar with the topic.",
        response_mutation_prompt="Refine this answer to balance conciseness with comprehensiveness. For straightforward questions, aim for brevity while ensuring accuracy. For complex topics, provide more detail and context. Add relevant information from the context as needed. Verify factual accuracy and correct any inaccuracies or missing key information. Ensure the answer can be understood without prior knowledge of the topic.",
        response_quality_prompt="Evaluate the quality of this answer based on its accuracy, appropriate level of detail (concise for simple questions, comprehensive for complex ones), relevance to the question, clarity for someone unfamiliar with the topic, inclusion of necessary background information, and whether it provides a satisfactory response using only the information from the given context:",
        response_complexity_target=0.8,
        use_aaa=True,
    )

Understanding InstructionResponseConfig Parameters

The parameters in InstructionResponseConfig are inspired by concepts from evolutionary algorithms for improving language model outputs, as introduced in papers like Meta’s Self Play and Wizard LM. Let's break down each parameter:

  1. input_fields: List of column names from your input data to use as context. Example: ["context"].
  2. output_instruction_field: Name of the column for generated instructions (questions). Example: "synthetic_instruction".
  3. output_response_field: Name of the column for generated responses (answers). Example: "synthetic_response".
  4. num_generations: Number of evolutionary generations to run (integer, typically 1-10). More generations can lead to better results but increase computation time.
  5. population_size: Number of candidates in each generation (integer, typically 3-10). Larger populations increase diversity but also computation time.
  6. mutation_rate: Probability of mutation for each candidate (float between 0 and 1, typically 0.1-0.9). Higher rates increase diversity but may lead to less stable results.
  7. system_prompt: High-level instructions for the LLM about its role and task. This sets the overall context for the generation process.
  8. instruction_format_prompt: Specific instructions for generating the initial question. This guides the initial population creation.
  9. instruction_mutation_prompt: Guidelines for mutating/improving questions. This is crucial for the mutation step in the algorithm.
  10. instruction_quality_prompt: Criteria for evaluating question quality. This informs the selection process.
  11. instruction_complexity_target: Desired complexity level for questions (float between 0 and 1). This parameter helps guide the evolution towards a specific complexity level.
  12. response_format_prompt, response_mutation_prompt, response_quality_prompt, response_complexity_target: These parameters serve the same functions as their instruction counterparts but for the response generation process.
  13. use_aaa: Boolean to enable/disable the AI Align AI process for further refinement. This incorporates additional AI alignment techniques including co-teaching, suggestions, and self-teaching.

By adjusting these parameters, you can fine-tune the evolutionary process of synthetic data generation to meet your specific quality and diversity requirements. The interplay between these parameters allows for a sophisticated balance between exploration (generating diverse candidates) and exploitation (refining high-quality outputs) in the data generation process.

Step 2: Data Preparation and Generation

Next, we prepare our input data and generate synthetic data:

import pandas as pd
from navigator_helpers import TrainingDataSynthesizer

# Load the Databricks Dolly dataset
df = pd.read_csv(
    "https://gretel-public-website.s3.us-west-2.amazonaws.com/datasets/llm-training-data/databricks_dolly_instruction_set.csv",
    nrows=10,
)

synthesizer = TrainingDataSynthesizer(
    df,
    config,
    output_file="results.jsonl",
    verbose=True,
)
new_df = synthesizer.generate()

print(new_df.head().to_json(orient="records", indent=2))

Benchmarking Synthetic Data Quality

To assess the quality of synthetic data generated by Gretel Navigator, we conducted a benchmark study comparing it to human expert-generated data and outputs from state-of-the-art LLMs.

Methodology

We used LLM-as-a-Judge, leveraging OpenAI's `gpt-4o` as an impartial judge (though we could have used any LLM to act as a judge), to compare the quality of data generated by the models below against human expert curated data from the Databricks Dolly-15k dataset on all 1629 Closed Question and Answer task examples.

Why use LLM Judges? Recent LLM-as-a-judge research demonstrated that “strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans”. This makes LLM-as-a-judge a scalable approach to approximating subject matter expertise. More importantly, in this case, it provides a way to impartially evaluate quality.

  1. Gretel Navigator 
  2. OpenAI's gpt-4 
  3. OpenAI's gpt-3.5-turbo 
  4. Llama3-70b  

The AI judge was asked to choose a winner and provide an explanation for each comparison. Link to AI-Feedback generation prompt.

Comparing AI vs. Human expert curated data

Figure 2. Comparing compound AI generated datasets vs. human expert curated data.

Key Results

  • Gretel Navigator exceeds human expert-generated data 73.6% of the time
    • This is huge, we are clearly able to generate high-quality data outperforming expert-generated data. But we can’t look at this in a vacuum. How does this compare to other tools out there?
  • Comparing to other solutions, Gretel Navigator outperforms
    • GPT-4 by 25.6% in synthetic data generation comparisons 🚀
    • GPT-3.5-turbo by a staggering 97.3%
    • Llama3-70b by 48.1% in the same benchmarks

Performance Comparison

Figure. 3 Comparison of Navigator performance to current frontier models.

Versatile Applications
Navigator can take text and tabular formats as inputs, and excels in generating high-quality synthetic data for:

  • Instruction fine-tuning
  • RLHF/RLAIF dataset creation
  • Retrieval-augmented generation (RAG)
  • Natural language to SQL
  • Code generation

How It Works

  • Employs multiple specialized LLMs, each <10B parameters
  • Orchestrates 20-30 LLM calls per generation via evolutionary and co-teaching approaches
  • Balances longer processing time with superior output quality

Why It Matters

  • Demonstrates intelligent agent coordination outperforms raw model size
  • Generates synthetic data surpassing human experts and leading LLMs
  • Ideal for tasks requiring premium training datasets

Gretel Navigator proves that smart orchestration of smaller, specialized models can outperform even the largest language models in synthetic data generation across a variety of crucial AI tasks.

Keys to Success

To get the most out of Gretel Navigator, consider these best practices:

  1. Know Your Task: Clearly define the objective of your synthetic data generation. In our case, we're creating question-answer pairs based on provided contexts.
  2. Refine Your Prompts: The quality of your output heavily depends on the prompts you provide. Iterate on these to achieve the desired results:some text
    • Format Prompts: Guide the initial generation of text. For questions, you might specify "Generate a clear, concise question based on the key information in the context."
    • Mutation Prompts: Direct how the text should be varied or improved. For example, "Rephrase the question to be more specific while maintaining clarity."
    • Quality Prompts: Define what constitutes high-quality output. For instance, "Evaluate the question's relevance, clarity, and potential to elicit an informative response."
    • Complexity Targets: Balance simplicity and depth. A target of 0.7 might aim for moderately complex output that's still accessible.
  3. Iterate quickly with our Streamlit App in HuggingFace Spaces: Use the provided space to iterate on your prompts and configurations quickly.
  4. Compare with Ground Truth: Since the dataset we are using provides ground truth data (seed data for generation), compare your synthetic output with the seed data to ensure that the generated responses are on topic and relevant.
  5. Scale Thoughtfully: Once you've refined your approach in Streamlit, export the code to run in your own environment for larger-scale data generation.

Your Experimentation Playground

Gretel provides a space that simplifies the process of iterating on prompts and generating diverse training data. Here's how you can try it out:

  1. Access the space at https://huggingface.co/spaces/gretelai/enhance-ai-training-data
  2. Upload your seed data (in this case, the Dolly dataset) in CSV, JSON, or link to your Hugging Face Dataset
  3. Experiment with different prompts and configuration settings
  4. Inspect the logs to understand the generation and iteration process
  5. Compare the synthetic output with the original Dolly dataset to assess quality
  6. Export the SDK code when you're satisfied with the results

This interactive environment allows you to quickly refine your approach before scaling up your synthetic data generation.

Conclusion

Gretel Navigator offers a versatile and powerful solution to the challenge of generating high-quality, domain-specific data for AI training. Whether you need synthetic text, instruction-response pairs, or conversational step-by-step data, Gretel Navigator can adapt to your needs. 

By leveraging advanced techniques inspired by recent research and providing an intuitive interface for experimentation, it enables researchers and developers to create the large, diverse, and high-quality datasets needed to train and fine-tune advanced AI models.

In our example with the Databricks/Dolly15k dataset, we've demonstrated how Gretel Navigator can generate synthetic question-answer pairs that even outperform human-expert created data, and even SOTA models like GPT-4 by up to 25.6%. This capability is invaluable for creating or augmenting training datasets for a wide range of LLM applications.

Remember, the key to success lies in clearly defining your objectives, iteratively refining your prompts, and leveraging the provided tools to experiment and scale effectively. With Gretel Navigator, you're well-equipped to tackle the data scarcity challenge and push the boundaries of AI capabilities in your specific domain.

If you'd like to work with us on creating high quality synthetic data for your use case, send us a note