Teaching AI to Think: A New Approach with Synthetic Data and Reflection
Introduction
Artificial intelligence (AI) models, especially large language models (LLMs), have made significant strides in solving a variety of tasks—everything from basic question-answering to complex decision-making. But when it comes to multi-step reasoning, there's still a gap. How can we teach AI to think step by step, just as a human would solve a complex math problem or make a nuanced decision?
In this blog, we’ll explore a new technique called Reflection, which helps AI simulate a human-like thought process. We’re releasing a synthetic dataset generated using Gretel Navigator to demonstrate this technique. By capturing not just the final answer, but the AI’s reflections along the way, we can better train models to reason in more robust and explainable ways.
Check out the dataset here:
The Problem: AI Struggles with Complex Reasoning
While models like Llama 3.1 and beyond are excellent at handling straightforward tasks, they can fall short on more complex, multi-step reasoning tasks. Think of a classic algebra problem, where the solution requires multiple operations to be performed in sequence. Models trained only on simple data tend to either shortcut or guess without fully processing the reasoning.
This is where synthetic training data comes in. By generating a training dataset that simulates these multi-step problems and carefully curates the AI’s thought process, we can give the model a better foundation for tackling complexity. Beyond just creating more data, we need a way to teach the model how to think, not just answer.
The initial results for Reflection-based synthetic data generation are quite promising, full details below:
- Our evaluation, using GPT-4o as a judge, shows that Reflection-based synthetic data meets or exceeds the performance of the same synthetic data generation pipeline that was not using Reflection by 84.3% (63.7% wins, 20.6% ties, 15.7% losses) of all comparisons across various criteria and difficulty levels.
- For problems classified as "very hard,"-- an area where current LLMs most struggle in benchmarks today, the Reflection approach showed even more pronounced advantages- even without including the reflection tags in the output:
- 92.3% wins in problem complexity
- 82.7% wins in educational value
Introducing Reflection: Capturing the AI's Thought Process
Reflection is a technique that makes the AI simulate step-by-step reasoning as part of its internal process. Instead of jumping straight to the answer, the AI is prompted to “think out loud” using <thinking> and <reflection> tags.Â
Example Synthetic GSM8k (Grade School Math) Example
OpenAI released the Grade School Math, 8000-example dataset in 2022 on HuggingFace, which they describe as “a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.” This has become a core AI dataset for both training and evaluating LLM performance around AI reasoning, and a great starting point for synthesizing new examples to show how Reflection works and to improve on AI reasoning.
‍Problem:
Maya is making a collage for her art class and wants to mix paint to create a specific shade of purple. The shade requires a ratio of blue to red paint of 3:5. If Maya needs a total of 24 ounces of paint for her project, how many ounces of blue paint should she use?
AI’s Step-by-Step Thought Process with Reflection:
Final Output:
Why Reflection is Important
This example shows how the Reflection technique guides the AI through step-by-step reasoning before arriving at the final answer. The AI breaks the problem down into manageable parts and verifies each step with reflections before generating the final solution. This method improves both explainability and accuracy, making the AI’s thought process more transparent and reliable.
By capturing the model’s intermediate reflections, we don’t just end up with an answer, we also capture how the model arrived at that answer. This offers several advantages:
- Improved Explainability: Users can see the reasoning behind the answer.
- Enhanced Fine-Tuning: With detailed reasoning steps, future models can be fine-tuned more effectively.
- Better Handling of Complexity: Models are nudged toward thoughtful, multi-step reasoning, rather than relying on memorization.
Introducing gretelai/synthetic-gsm8k-reflection-405bt: A Synthetic Alternative to GSM8K
If you’re interested in training or evaluating language models on mathematical reasoning tasks, we’ve made our synthetic GSM8K-inspired dataset publicly available on Hugging Face. You can find it here.
This dataset was constructed from a prompt with Navigator using a synthetic data pipeline and does not use any real-world data from the GSM8K dataset. While OpenAI's GSM8K dataset has been a cornerstone for AI benchmarks in natural language reasoning, our synthetic GSM8K dataset offers several key improvements:
- Enhanced Complexity: Our dataset is designed to include more complex reasoning problems, pushing the boundaries of what AI models can tackle.
- Increased Diversity: We've expanded the range of examples to include:
- More diverse demographics and ethnicities
- A broader spectrum of topics and real-world scenarios
- Labeled Difficulty Levels: Unlike the original GSM8K, our dataset includes difficulty labels for each problem. This feature allows for:
- More nuanced evaluation of model performance across different complexity levels
- Better insights into where models excel or struggle‍
- Reflection-Based Generation: By incorporating the Reflection technique, our dataset encourages step-by-step reasoning, potentially leading to more robust and explainable AI solutions.‍
- Synthesized Nature: As a high quality synthesized dataset, it can (in theory) be even smaller than the original gsm8k dataset, allowing for more efficient training of models.
Our dataset is an alternative to OpenAI’s GSM8K (https://huggingface.co/datasets/openai/gsm8k, (paper)) and provides unique insights into how high quality synthetic datasets can be leveraged for AI reasoning.
How the Dataset Was Designed
To ensure diversity and accuracy, we used:
- Reflection for all field generations, but only stored tags for the “answer” field.
- Contextual tags for varied problem settings and real-world diversity.
- LLM-as-a-judge to verify that the generated outputs were of high quality.
- The SymPy Python library to validate all annotated calculations.
This combination of techniques ensures that the dataset contains high-quality, diverse, and mathematically accurate problems.
Token Length Comparison: What’s the Overhead?
When using Reflection, the natural question arises: What overhead does it introduce during inference? Since the Reflection technique forces the model to process step-by-step thoughts, the generated tokens are longer. The graph below compares the token lengths of reflections vs outputs in our dataset.
As you can see, reflections tend to be longer than the final outputs. This makes sense—reflection involves breaking down each decision step, whereas the output only includes the final answer. The additional overhead can be viewed as a cost of obtaining greater transparency in the model’s decision-making process.
Despite this overhead, the benefits of capturing the model’s reasoning can be invaluable for tasks requiring explainability, multi-step reasoning, and debugging complex AI systems. By training models on this reflective reasoning process, we can shift models from answering questions to thinking before they answer.
Reflection in Action: Evaluation Results
To validate the effectiveness of our Reflection-based approach, we conducted a comprehensive evaluation using GPT-4o as an impartial judge. The results are quite promising:
- Overall Dominance: Reflection won or tied 84.3% (63.7% wins, 20.6% ties, 15.7% losses) of all comparisons across all criteria and difficulty levels.
- Enhanced Problem Complexity and Educational Value:
- Problem Complexity: Reflection won 76.3% vs 17.2% for non-reflection.
- Educational Value: Reflection dominated with 77.9% wins vs 14.0% for non-reflection.
- Scaling with Difficulty: In "Very Hard" problems, Reflection's advantage grew even more pronounced:
- Problem Complexity: 92.3% wins for Reflection vs 3.8% for non-reflection.
- Educational Value: 82.7% wins for Reflection vs 9.6% for non-reflection.‍
- Consistent Performance: Reflection maintained its lead across all difficulty levels, from medium to very hard problems.
These results demonstrate that the Reflection technique not only enhances the quality of AI-generated problems but also scales effectively with increasing difficulty. This is particularly crucial for teaching LLMs advanced reasoning skills, a key challenge in AI development.
Here's a video version of this tutorial, so you can watch the reflection process in action:
Next Steps: What’s Coming Next?
This is just the first part of our blog series. In future posts, we will:
- Evaluate Model Performance: Using models fine-tuned on this reflective dataset, we’ll explore whether Reflection improves model performance on complex reasoning tasks.
- Compare Different Models: How do smaller base models for generation (8B parameters) compare to larger models (405B) when it comes to reflective reasoning?
- Real-World Applications: We’ll discuss how Reflection can be applied to more than just math problems, exploring domains like code generation, data science tasks, medical diagnosis, and legal reasoning.
Conclusion: A New Era of Thoughtful AI
With the Reflection technique and this synthetic dataset, we aim to take AI reasoning to the next level. Instead of focusing solely on generating answers, we’re encouraging models to reflect, process, and think through their steps. By embracing this human-like approach to problem-solving, we’re setting a foundation for AI systems that are not just reactive, but thoughtful and deliberate in their responses.
Stay tuned for the next part of this series where we’ll dive deeper into the results of fine-tuning models using this dataset.