RAG Model Evaluation with Azure AI and Gretel Navigator

Leveraging Gretel Navigator to Create Diverse and Quality-Driven Question-Truth Pairs for RAG Evaluation

In the evolving landscape of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) models have emerged as powerful tools that combine the strengths of retrieval-based and generative AI approaches to generate responses to user queries. However, the efficacy of these models is heavily dependent on the quality and diversity of the training data. In this notebook, we explore how Gretel Navigator can be leveraged to generate diverse and high-quality evaluation datasets of question-truth pairs to build and evaluate more robust RAG models.

Background: The Need for Quality Evaluation Data

Evaluating RAG models effectively demands thorough analysis across several key metrics, including coherence, groundedness, relevance, fluency, and similarity. Tools like Azure Copilot offer a solution to this task, but creating the right evaluation datasets poses its own set of challenges. It's a task that requires both time and effort to generate datasets that are both high in quality and diverse. 

Gretel Navigator comes into play here, offering a streamlined solution for creating question-truth (Q-T) pairs of the desired variety and depth. This process doesn't just fill a gap—it also revolutionizes the way we approach the creation of training data, aiming to mimic the nuanced and rich content found in documents. By breaking down a larger knowledge dataset into smaller, manageable chunks, we can use Gretel Navigator for the generation of relevant question-answer pairs. Moreover, we could tailor these pairs to specific needs by considering various attributes such as the user's expertise level, topics of interest, and the preferred language. This customized approach ensures our synthetic datasets are not only broad in scope but deeply aligned with real-world use cases. The synthetically generated datasets offer a great testing ground for RAG models, challenging them to demonstrate their capability in fetching relevant information and crafting precise answers in a wide range of scenarios, including adversarial and ambiguous cases.


Before diving into the notebook, there are a couple of prerequisites:

  1. Gretel API Key: You'll need an API key from Gretel. If you don't have one already, you can obtain it from Gretel's console. This key will enable us to use Gretel's services for generating our synthetic datasets.
  2. Domain-specific training data: To try this approach with your own data, you'll need a domain-specific dataset in a standard text format. Or, you could get started quickly with the example below using the product-info dataset, provided in the aistudio-copilot-sample repository.

Let's get started!

You can watch my brief walkthrough of this notebook during Gretel’s recent workshop with Microsoft where we explored how Gretel Navigator can be used with Azure AI Studio to create an end-to-end RAG chatbot using an open-source framework and public datasets. Our novel approach combined the power of synthetics with human-in-the-loop review to design synthetic datasets that effectively test RAG performance. You can do the same with your own data and this notebook. Happy synthesizing.