How to Create Synthetic Data at High Quality for Fine-Tuning LLMs
We're at a bit of an AI crossroads, where publicly available data has largely been consumed by existing LLM training. When looking to fine-tune, or even add to pre-training data to improve task performance, access to diverse, labeled datasets is limited. This challenge is particularly acute for organizations seeking to adapt models for specific domains or tasks.
In this blog, we'll show you how Gretel Navigator, a versatile compound AI system, enables users to easily create high-quality synthetic data for AI/ML model training. Utilizing an agent-based approach, including techniques such as co-teaching and evolutionary iteration, Navigator’s synthetic data can outperform its own underlying LLMs and even much larger models such as OpenAI’s GPT-4 as shown below. For a quick intro, Navigator is the same service that we use at Gretel to create open datasets like the popular Text-to-SQL dataset.
Can’t wait to jump in? Try our low-code synthetic data generation Streamlit app using Navigator.
Today we'll cover:
- The challenge of data scarcity in AI training
- An introduction to Gretel Navigator and its key features
- Step-by-step implementation of Navigator for generating synthetic question-answer pairs
- Keys to success and best practices
- Interactive StreamLit app for easy iteration and experimentation
The Data Scarcity Challenge in AI Training and Fine-Tuning LLMs
The scarcity of high-quality, diverse datasets for training and fine-tuning LLMs presents a significant hurdle in the advancement of AI capabilities. This is particularly problematic when trying to adapt models for specific domains or tasks where specialized data is required.
Synthetic data has emerged as a promising technique for enhancing performance of LLMs, be it when pre-training or fine-tuning for a particular task. Teams behind many leading models, such as DBRX, Phi-3, Aya, and Gemini, are beginning to recognize synthetic data as a crucial element in developing more advanced and capable models:
Generate High Quality Synthetic Data for Fine-Tuning LLMs with Gretel Navigator
Gretel Navigator is a compound AI system designed to generate high-quality synthetic data for training AI and LLMs. Navigator leverages agentic workflows, task planning, and a multitude of tools & models to iteratively review and improve synthetic data. As was called out in the AgentInstruct paper, “Agentic workflows can generate high quality data that surpasses the capabilities of the underlying LLMs, by using flows with reflection and iteration, where agents can look back at solutions, generate critiques and improve solutions”. Combined with inexpensive token generation, "agentic workflows for large language models (LLMs) open up intriguing new possibilities for training LLMs on synthetic data," as correctly predicted by Andrew Ng and now implemented in Navigator.
Navigator can be used to create a wide variety of synthetic data, including:
- Synthetically generated text
- Instruction data or question-answer truth pairs that are popular for fine-tuning LLMs
- Step-by-step evaluation data
- Generating Question-Answer pairs from documents
In this blog post, we'll demonstrate generating high-quality input/output pairs for LLM training.
Key features of Gretel Navigator include:
- Compound AI System: An approach where many tools work in tandem. (see BAIR blog)
- Evolutionary Algorithms: An iterative approach to improve data quality over multiple generations. (see Wizard LM 2's evolutionary improvements to synthetic data)
- Multi-LM Collaboration: Integration of multiple language models for co-teaching and self-improvement. (inspired by papers like Textbooks Are All You Need II and Meta's Self-Play)
- Agent-based Generation: as outlined in the AgentInstruct paper.
- Self-alignment: in which the model can fine-tune itself without human annotation. (See StarCoder2)
- Comprehensive Governance: Rigorous evaluation metrics and quality assurance at each step, controlling for bias, toxicity, accuracy, utility, and groundedness.
- Customizable Design: Easily tailored to specific needs, supporting various output formats.
Getting Started
Let's walk through an example of generating synthetic question-answer pairs based on a given context. For this demonstration, we'll be using the Databricks Dolly 15k dataset, specifically focusing on the closed question-answering task.
This dataset is particularly valuable for our example because:
- It provides ground truth data in the 'context' field.
- We can compare our synthetically generated results to the human expert-generated instructions and responses created by Databricks employees.
- It allows us to demonstrate Gretel Navigator's ability to generate high-quality synthetic data that matches or even exceeds human-curated datasets.
Step 1: Configuration
First, we set up the configuration using the InstructionResponseConfig class:
Understanding InstructionResponseConfig Parameters
The parameters in InstructionResponseConfig are inspired by concepts from evolutionary algorithms for improving language model outputs, as introduced in papers like Meta’s Self Play and Wizard LM. Let's break down each parameter:
input_fields
: List of column names from your input data to use as context. Example: ["context"].output_instruction_field
: Name of the column for generated instructions (questions). Example: "synthetic_instruction".output_response_field
: Name of the column for generated responses (answers). Example: "synthetic_response".num_generations
: Number of evolutionary generations to run (integer, typically 1-10). More generations can lead to better results but increase computation time.population_size
: Number of candidates in each generation (integer, typically 3-10). Larger populations increase diversity but also computation time.mutation_rate
: Probability of mutation for each candidate (float between 0 and 1, typically 0.1-0.9). Higher rates increase diversity but may lead to less stable results.system_prompt
: High-level instructions for the LLM about its role and task. This sets the overall context for the generation process.instruction_format_prompt
: Specific instructions for generating the initial question. This guides the initial population creation.instruction_mutation_prompt
: Guidelines for mutating/improving questions. This is crucial for the mutation step in the algorithm.instruction_quality_prompt
: Criteria for evaluating question quality. This informs the selection process.instruction_complexity_target
: Desired complexity level for questions (float between 0 and 1). This parameter helps guide the evolution towards a specific complexity level.response_format_prompt, response_mutation_prompt, response_quality_prompt, response_complexity_target
: These parameters serve the same functions as their instruction counterparts but for the response generation process.use_aaa
: Boolean to enable/disable the AI Align AI process for further refinement. This incorporates additional AI alignment techniques including co-teaching, suggestions, and self-teaching.
By adjusting these parameters, you can fine-tune the evolutionary process of synthetic data generation to meet your specific quality and diversity requirements. The interplay between these parameters allows for a sophisticated balance between exploration (generating diverse candidates) and exploitation (refining high-quality outputs) in the data generation process.
Step 2: Data Preparation and Generation
Next, we prepare our input data and generate synthetic data:
Benchmarking Synthetic Data Quality
To assess the quality of synthetic data generated by Gretel Navigator, we conducted a benchmark study comparing it to human expert-generated data and outputs from state-of-the-art LLMs.
Methodology
We used LLM-as-a-Judge, leveraging OpenAI's `gpt-4o` as an impartial judge (though we could have used any LLM to act as a judge), to compare the quality of data generated by the models below against human expert curated data from the Databricks Dolly-15k dataset on all 1629 Closed Question and Answer task examples.
Why use LLM Judges? Recent LLM-as-a-judge research demonstrated that “strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans”. This makes LLM-as-a-judge a scalable approach to approximating subject matter expertise. More importantly, in this case, it provides a way to impartially evaluate quality.
- Gretel Navigator
- OpenAI's gpt-4
- OpenAI's gpt-3.5-turbo
- Llama3-70b
The AI judge was asked to choose a winner and provide an explanation for each comparison. Link to AI-Feedback generation prompt.
Comparing AI vs. Human expert curated data
Key Results
- Gretel Navigator exceeds human expert-generated data 73.6% of the time
- This is huge, we are clearly able to generate high-quality data outperforming expert-generated data. But we can’t look at this in a vacuum. How does this compare to other tools out there?
- Comparing to other solutions, Gretel Navigator outperforms
- GPT-4 by 25.6% in synthetic data generation comparisons 🚀
- GPT-3.5-turbo by a staggering 97.3%
- Llama3-70b by 48.1% in the same benchmarks
Performance Comparison
Versatile Applications
Navigator can take text and tabular data formats as inputs, and excels in generating high-quality synthetic data for:
- Instruction fine-tuning
- RLHF/RLAIF dataset creation
- Retrieval-augmented generation (RAG)
- Natural language to SQL
- Code generation
How It Works
- Employs multiple specialized LLMs, each <10B parameters
- Orchestrates 20-30 LLM calls per generation via evolutionary and co-teaching approaches
- Balances longer processing time with superior output quality
Why It Matters
- Demonstrates intelligent agent coordination outperforms raw model size
- Generates synthetic data surpassing human experts and leading LLMs
- Ideal for tasks requiring premium training datasets
Gretel Navigator proves that smart orchestration of smaller, specialized models can outperform even the largest language models in synthetic data generation across a variety of crucial AI tasks.
Fine-Tuning LLMs > Keys to Success
To get the most out of Gretel Navigator, consider these best practices:
- Know Your Task: Clearly define the objective of your synthetic data generation. In our case, we're creating question-answer pairs based on provided contexts.
- Refine Your Prompts: The quality of your output heavily depends on the prompts you provide. Iterate on these to achieve the desired results:some text
- Format Prompts: Guide the initial generation of text. For questions, you might specify "Generate a clear, concise question based on the key information in the context."
- Mutation Prompts: Direct how the text should be varied or improved. For example, "Rephrase the question to be more specific while maintaining clarity."
- Quality Prompts: Define what constitutes high-quality output. For instance, "Evaluate the question's relevance, clarity, and potential to elicit an informative response."
- Complexity Targets: Balance simplicity and depth. A target of 0.7 might aim for moderately complex output that's still accessible.
- Iterate quickly with our Streamlit App in HuggingFace Spaces: Use the provided space to iterate on your prompts and configurations quickly.
- Compare with Ground Truth: Since the dataset we are using provides ground truth data (seed data for generation), compare your synthetic output with the seed data to ensure that the generated responses are on topic and relevant.
- Scale Thoughtfully: Once you've refined your approach in Streamlit, export the code to run in your own environment for larger-scale data generation.
Your Experimentation Playground
Gretel provides a space that simplifies the process of iterating on prompts and generating diverse training data. Here's how you can try it out:
- Access the space at https://huggingface.co/spaces/gretelai/enhance-ai-training-data
- Upload your seed data (in this case, the Dolly dataset) in CSV, JSON, or link to your Hugging Face Dataset
- Experiment with different prompts and configuration settings
- Inspect the logs to understand the generation and iteration process
- Compare the synthetic output with the original Dolly dataset to assess quality
- Export the SDK code when you're satisfied with the results
This interactive environment allows you to quickly refine your approach before scaling up your synthetic data generation.
Conclusion
Gretel Navigator offers a versatile and powerful solution to the challenge of generating high-quality, domain-specific data for AI training. Whether you need synthetic text, instruction-response pairs, or conversational step-by-step data, Gretel Navigator can adapt to your needs.
By leveraging advanced techniques inspired by recent research and providing an intuitive interface for experimentation, it enables researchers and developers to create the large, diverse, and high-quality datasets needed to train and fine-tune advanced AI models.
In our example with the Databricks/Dolly15k dataset, we've demonstrated how Gretel Navigator can generate synthetic question-answer pairs that even outperform human-expert created data, and even SOTA models like GPT-4 by up to 25.6%. This capability is invaluable for creating or augmenting training datasets for a wide range of LLM applications.
Remember, the key to success lies in clearly defining your objectives, iteratively refining your prompts, and leveraging the provided tools to experiment and scale effectively. With Gretel Navigator, you're well-equipped to tackle the data scarcity challenge and push the boundaries of AI capabilities in your specific domain.
If you'd like to work with us on creating high quality synthetic data for your use case, send us a note.