Teaching AI to Think: A New Approach with Synthetic Data and Reflection

Gretel's synthetic GSM8k dataset shows an 84% improvement for AI Reasoning tasks vs synthetic data generated without reflection.

Introduction

Artificial intelligence (AI) models, especially large language models (LLMs), have made significant strides in solving a variety of tasks—everything from basic question-answering to complex decision-making. But when it comes to multi-step reasoning, there's still a gap. How can we teach AI to think step by step, just as a human would solve a complex math problem or make a nuanced decision?

In this blog, we’ll explore a new technique called Reflection, which helps AI simulate a human-like thought process. We’re releasing a synthetic dataset generated using Gretel Navigator to demonstrate this technique. By capturing not just the final answer, but the AI’s reflections along the way, we can better train models to reason in more robust and explainable ways.

Check out the dataset here:

The Problem: AI Struggles with Complex Reasoning

While models like Llama 3.1 and beyond are excellent at handling straightforward tasks, they can fall short on more complex, multi-step reasoning tasks. Think of a classic algebra problem, where the solution requires multiple operations to be performed in sequence. Models trained only on simple data tend to either shortcut or guess without fully processing the reasoning.

This is where synthetic training data comes in. By generating a training dataset that simulates these multi-step problems and carefully curates the AI’s thought process, we can give the model a better foundation for tackling complexity. Beyond just creating more data, we need a way to teach the model how to think, not just answer.

The initial results for Reflection-based synthetic data generation are quite promising, full details below:

  1. Our evaluation, using GPT-4o as a judge, shows that Reflection-based synthetic data meets or exceeds the performance of the same synthetic data generation pipeline that was not using Reflection by 84.3% (63.7% wins, 20.6% ties, 15.7% losses) of all comparisons across various criteria and difficulty levels.
  2. For problems classified as "very hard,"-- an area where current LLMs most struggle in benchmarks today, the Reflection approach showed even more pronounced advantages- even without including the reflection tags in the output:
    • 92.3% wins in problem complexity
    • 82.7% wins in educational value

Introducing Reflection: Capturing the AI's Thought Process

Reflection is a technique that makes the AI simulate step-by-step reasoning as part of its internal process. Instead of jumping straight to the answer, the AI is prompted to “think out loud” using <thinking> and <reflection> tags. 

Example Synthetic GSM8k (Grade School Math) Example

OpenAI released the Grade School Math, 8000-example dataset in 2022 on HuggingFace, which they describe as “a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.” This has become a core AI dataset for both training and evaluating LLM performance around AI reasoning, and a great starting point for synthesizing new examples to show how Reflection works and to improve on AI reasoning.

‍Problem:
Maya is making a collage for her art class and wants to mix paint to create a specific shade of purple. The shade requires a ratio of blue to red paint of 3:5. If Maya needs a total of 24 ounces of paint for her project, how many ounces of blue paint should she use?

AI’s Step-by-Step Thought Process with Reflection:

 
  To find the amount of blue paint Maya should use, we first need to determine the total parts of the ratio.
  
    The ratio of blue to red paint is 3:5, which means there are 3 parts blue paint and 5 parts red paint, making a total of 3 + 5 = 8 parts.
  
  
  Next, we need to find out how many ounces each part represents.
  
    Since Maya needs a total of 24 ounces of paint, and there are 8 parts in the ratio, we divide the total ounces by the total parts to find the value of each part: 24 / 8 = 3 ounces per part.
  
  
  Now, we calculate the amount of blue paint needed.
  
    The ratio requires 3 parts of blue paint, and each part is 3 ounces, so we multiply the number of parts by the value of each part: 3 parts * 3 ounces/part = 9 ounces.
  
  
  Therefore, Maya should use 9 ounces of blue paint for her project.

Final Output:


  To solve this, let's break it down step by step:
  1. Calculate the total parts of the ratio: Total parts = 3 (blue) + 5 (red) = 8 parts.
  2. Determine the value of each part in ounces: Each part = Total ounces / Total parts = 24 / 8 = 3 ounces per part.
  3. Calculate the amount of blue paint needed: Blue paint = Number of parts * Value of each part = 3 * 3 = 9 ounces.
  
  Therefore, Maya should use 9 ounces of blue paint.
#### 9 

Why Reflection is Important

This example shows how the Reflection technique guides the AI through step-by-step reasoning before arriving at the final answer. The AI breaks the problem down into manageable parts and verifies each step with reflections before generating the final solution. This method improves both explainability and accuracy, making the AI’s thought process more transparent and reliable.

By capturing the model’s intermediate reflections, we don’t just end up with an answer, we also capture how the model arrived at that answer. This offers several advantages:

  1. Improved Explainability: Users can see the reasoning behind the answer.
  2. Enhanced Fine-Tuning: With detailed reasoning steps, future models can be fine-tuned more effectively.
  3. Better Handling of Complexity: Models are nudged toward thoughtful, multi-step reasoning, rather than relying on memorization.

Introducing gretelai/synthetic-gsm8k-reflection-405bt: A Synthetic Alternative to GSM8K

If you’re interested in training or evaluating language models on mathematical reasoning tasks, we’ve made our synthetic GSM8K-inspired dataset publicly available on Hugging Face. You can find it here.

This dataset was constructed from a prompt with Navigator using a synthetic data pipeline and does not use any real-world data from the GSM8K dataset. While OpenAI's GSM8K dataset has been a cornerstone for AI benchmarks in natural language reasoning, our synthetic GSM8K dataset offers several key improvements:

  1. Enhanced Complexity: Our dataset is designed to include more complex reasoning problems, pushing the boundaries of what AI models can tackle.
  2. Increased Diversity: We've expanded the range of examples to include:
    • More diverse demographics and ethnicities
    • A broader spectrum of topics and real-world scenarios
  3. Labeled Difficulty Levels: Unlike the original GSM8K, our dataset includes difficulty labels for each problem. This feature allows for:
    • More nuanced evaluation of model performance across different complexity levels
    • Better insights into where models excel or struggle‍
  4. Reflection-Based Generation: By incorporating the Reflection technique, our dataset encourages step-by-step reasoning, potentially leading to more robust and explainable AI solutions.‍
  5. Synthesized Nature: As a high quality synthesized dataset, it can (in theory) be even smaller than the original gsm8k dataset, allowing for more efficient training of models.

Our dataset is an alternative to OpenAI’s GSM8K (https://huggingface.co/datasets/openai/gsm8k, (paper)) and provides unique insights into how high quality synthetic datasets can be leveraged for AI reasoning.

How the Dataset Was Designed

To ensure diversity and accuracy, we used:

  • Reflection for all field generations, but only stored tags for the “answer” field.
  • Contextual tags for varied problem settings and real-world diversity.
  • LLM-as-a-judge to verify that the generated outputs were of high quality.
  • The SymPy Python library to validate all annotated calculations.

This combination of techniques ensures that the dataset contains high-quality, diverse, and mathematically accurate problems.

Token Length Comparison: What’s the Overhead?

When using Reflection, the natural question arises: What overhead does it introduce during inference? Since Reflection forces the model to process step-by-step thoughts, the generated tokens are longer. The graph below compares the token lengths of reflections vs outputs in our dataset.

As you can see, reflections tend to be longer than the final outputs. This makes sense—reflection involves breaking down each decision step, whereas the output only includes the final answer. The additional overhead can be viewed as a cost of obtaining greater transparency in the model’s decision-making process.

Despite this overhead, the benefits of capturing the model’s reasoning can be invaluable for tasks requiring explainability, multi-step reasoning, and debugging complex AI systems. By training models on this reflective reasoning process, we can shift models from answering questions to thinking before they answer.

Reflection in Action: Evaluation Results

To validate the effectiveness of our Reflection-based approach, we conducted a comprehensive evaluation using GPT-4o as an impartial judge. The results are quite promising:

  1. Overall Dominance: Reflection won or tied 84.3% (63.7% wins, 20.6% ties, 15.7% losses) of all comparisons across all criteria and difficulty levels.
  2. Enhanced Problem Complexity and Educational Value:
    • Problem Complexity: Reflection won 76.3% vs 17.2% for non-reflection.
    • Educational Value: Reflection dominated with 77.9% wins vs 14.0% for non-reflection.
  1. Scaling with Difficulty: In "Very Hard" problems, Reflection's advantage grew even more pronounced:
    • Problem Complexity: 92.3% wins for Reflection vs 3.8% for non-reflection.
    • Educational Value: 82.7% wins for Reflection vs 9.6% for non-reflection.‍
  2. Consistent Performance: Reflection maintained its lead across all difficulty levels, from medium to very hard problems.

These results demonstrate that Reflection not only enhances the quality of AI-generated problems but also scales effectively with increasing difficulty. This is particularly crucial for teaching LLMs advanced reasoning skills, a key challenge in AI development.

Next Steps: What’s Coming Next?

This is just the first part of our blog series. In future posts, we will:

  • Evaluate Model Performance: Using models fine-tuned on this reflective dataset, we’ll explore whether Reflection improves model performance on complex reasoning tasks.
  • Compare Different Models: How do smaller base models for generation (8B parameters) compare to larger models (405B) when it comes to reflective reasoning?
  • Real-World Applications: We’ll discuss how Reflection can be applied to more than just math problems, exploring domains like code generation, data science tasks, medical diagnosis, and legal reasoning.

Conclusion: A New Era of Thoughtful AI

With the Reflection technique and this synthetic dataset, we aim to take AI reasoning to the next level. Instead of focusing solely on generating answers, we’re encouraging models to reflect, process, and think through their steps. By embracing this human-like approach to problem-solving, we’re setting a foundation for AI systems that are not just reactive, but thoughtful and deliberate in their responses.

Stay tuned for the next part of this series where we’ll dive deeper into the results of fine-tuning models using this dataset.