Presentation

Fast Tracking RAG Model Evaluation with Synthetic Data

February 5, 2024

Explore how synthetic data allows for the safe evaluation of RAG techniques while safeguarding data privacy

Read the blog post

Transcription

Alex Watson:

Hello, everyone. I'm Alex Watson, co-founder and chief product officer at Gretel. Welcome to our workshop on fast tracking retrieval augmented generation with synthetic data. Today we will give an introduction and an overview of synthetic data and retrieval augmented generation. Next, we'll build an end-to-end enterprise search example using Azure AI Studio Copilot. We'll use Gretel to quickly generate synthetic evaluation data for the app. This lets us test search performance across a wide range of scenarios we may see in production with just a few lines of code and at the fraction of the cost and time for traditional data labeling and curation techniques.

Let me briefly introduce our speakers today. Next slide. We have Sam Kemp, principal AI platform architect from Microsoft. Nicole Pang, product manager for our Tabular LLM model at Gretel. Martin Van Segbroeck, principal applied scientist at Gretel.

Next slide. For a brief introduction to Gretel. We're a synthetic data platform that uses in-house developed generative AI models, including LLMs and GANs, along with privacy-enhancing technology to help anyone from individual developers to some of the biggest companies in the world to build with synthetic data. Please drop any comments or questions that you have into the Q & A section. We'll try to answer interactively, and also we have dedicated time for Q & A at the end. For` now, I'll hand it over to Nicole to start with an overview of synthetic data.

Nicole:

Thanks, Alex. Next slide please. So here at Gretel, we hear from customers over and over again that issues and data, challenges and data become a bottleneck to their enterprise use cases. So these use cases are varied. They include things like improving ML models, creating rich dev or dev environments, and sharing data with other teams or using data for analytics. But that data has private information or PII in it, and so they're unable to do so because of privacy or other compliance regulations.

So these might be familiar to you. And in these collected data, some of the data problems you can see here on the slide include data that is missing classes. So these missing classes make data sets imbalanced and therefore they can lead to poor performance in ML models. Other incomplete data sets could be data sets that are missing values or in ML use cases unlabeled data sets can often take a team of data scientists days, weeks, months to label that, to process data, and then to make the data usable. So if you've encountered challenges like this, let us know. Maybe drop a note in the chat. I'd love to hear some of the issues and data that you might've encountered. Next slide please.

So what is a solution for all those data problems that you saw in that slide? Well, synthetic data is a really common and popular solution. So synthetic data is data that's generated by a model as opposed to data that's collected, and often it can actually be better than collected data in many ways. So it can address the problems that we saw in just the slide above. So for one example, in the example of being able to share data that has personal identifying information, or PII, a synthetic version of that data would make that data shareable and also usable for things like analytics because correlations or distributions in the real data is kept in the synthetic data.

So up until now, most customers of Gretel have had some existing data. We've seen customers use Gretel to great success synthesizing their data sources or even entire databases based on existing data. And so they create a synthetic version of it using Gretel. But recently along with the growth of generative AI, we've seen a new growing need come up, and that's needing more and more specific data, satisfying a schema or a description. So basically the need is generating data from a cold start where there isn't existing collected data to begin with. So also let us know if this is something that you've heard of and you're probably here for the RAG workshop, so I would love to hear if you've heard of this or you're experiencing this need. So next slide please.

Okay, so this is why we created Tabular LLM here at Gretel. Tabular LLM is an AI system that generates high quality realistic data at scale from just a prompt. So in this prompt you can use natural language like how you would use text to describe data. You can also use a combination of data schema and description or even code in this prompt, and Tabular LLM allows for zero shot prompting, so you can generate data from scratch or you can also add new data to your existing data sets. It's available for use through the Gretel SDK, which you'll see a demo of that coming up shortly by my colleague Martin. But also I wanted to just briefly show you the power of Tabular LLM, is this something you can try right now. It's an open preview, so feel free to follow along with me if you want to go to Gretel AI and sign in to the dashboard, or you can just watch. This demo will be really short just to introduce you to what Tabular AI can do.

So I will start sharing my screen. This is going to be a live demo, one of many today. So bear with us. Live demos are always great fun. Okay, so I'm here on our Gretel dashboard. This is what you'll see if you log on to Gretel. So I'm going to go to the model playground. So this is a really fast way to show, and for you, you can also follow along on your computer to see what Tabular LLM can do. Since in the next workshop we're going to be doing some question answer pairs, I thought might be fun to show a quick demo of what that could look like in the model playground.

So I'll go over here, we're all probably familiar. We're all in the AI space. I'll just go ahead and copy, just do a control A of all the text here on the H200 page and I'll paste it here. And for a quick prompt I'll just say something really quick. So bear with me while I type this out.

Okay, so as you can see I've written really, really quick prompt. So hopefully this goes well, but if it doesn't, I'll mention what's also really cool about this model playground. Okay, so great. So this is giving me a two column table, but this didn't give me what I wanted. What I can do with the model playground is I can see the results come through. Oh, actually I think something might be... Okay, since this is a live demo, I'm just going to try generating again because... Oh, okay. So live demo issues. Well usually you should see about 50 rows come in immediately. Maybe... Let me just see what's going on here. Let me just try again. Just going to paste the same exact thing here and generate. Okay, well I guess we [inaudible 00:07:48].

Alex Watson:

Nicole, try moving the two column table question part you just asked it to do and put output pairs.

Nicole:

Sure. Thanks, Alex. Yeah, it's so odd that worked earlier but something's going on. Well this is... Okay.

Alex Watson:

I think the model thought that you only wanted two responses.

Nicole:

Oh, two rows. Yeah. Okay, so this isn't two column. But yeah, this is a great... Thank you, Alex. So as you can see, it's easy to get a lot of descriptions here. So this is giving me a bit more than what I asked for, but I can also iterate on the prompt. So this is why we created the model playground for both of course for dogfooding as well as users to iterate on the prompt to see the dataset that they want. So in this case, I've got a few more columns here because the model is interpreting what I'm asking here, but yeah, it looks like it's correct information about the H200 GPU.

So okay, I'll stop sharing now with this demo. Thank you for bearing with live demo fun. And that's a really quick look at Tabular LLM and now we'll kick it off over to my colleague Martin to talk more about RAG.

Martin:

Yeah, thanks, Nicole. So I'm going to talk a little bit more about what this webinar is about, using Tabular LLM for RAG based modeling. Let's go to the next slide, please.

So this slide shows the architecture of a RAG model. So just a little recap here. So RAG combines the power of large language models with real time data retrieval from external knowledge sources. And with external, I mean external to the training data on which the LLM was trained. So based on the user query, the retrieval process will search and select the most relevant context from these knowledge databases. The retrieved context is then augmented to the prompt to provide an enriched input to the large language model. The LLM will then use that context to generate a more accurate and context rich response to the user. Next slide, please.

So what are some benefits of RAG models over LLMs? So first there is the response quality. So even LLM is provided with a question for which it has only seen limited information about during training or it often tries to come up with an answer which may not be accurate. And so RAG models on the other hand can source up-to-date data or domain specific data, which makes the generator responses more accurate and less prone to hallucinations.

So secondly, with LLMs, each time you want to add new information or data, you need to retrain or fine tune the LLM again, which can be costly. RAG models more easily adapt to the new information. You just need to add or replace the external data source with new data to improve the quality and make the responses up-to-date.

So lastly, RAG models also evolve into a modeler approach allowing to improve each of those models in isolation for a better overall experience. Next slide, please.

Okay, so this slide shows now how synthetic data can play an important role in improving and building RAG models from initial data collection all the way to evaluating RAG models. Just highlighting a few things here. So synthetic data can jumpstart the development process. You can generate diverse scenarios where real data may be limited or too sensitive to use. Synthetic data can also help to optimize the retrieval, the fine-tuning process, and the response generation. So allowing for those responses to be more accurate and more insightful while maintaining privacy.

And then the third point I want to make is during the evaluation stage, which is actually the part of the seminary, where we are going to dive deep a little bit more on is we can generate high quality synthetic query response pairs to assess direct model performance across various scenarios, including edge cases or adversarial challenges with the goal to improve the generalizability of the model and its robustness. So in short, synthetic data can make those rec models more dynamic, accurate and versatile, ready for real world deployment.

I'm going to hand it over now to Sam who's going to talk about how to build RAG models in Azure AI and how to evaluate them.

Sam:

Thank you, Martin. Yeah, so for RAG in Azure AI Studio, this really is a unified interface for including an SDK and CLI and it unifies all the Azure AI portfolio. So that includes things like Azure OpenAI, content safety, we also have a bunch of vision and speech models as well. It integrates AI search, which is a vector DB and that kind of enables this rag pattern and it also has the Azure machine learning service elements of that as well. So to evaluate these RAG patterns and LLMs. And underneath all of this is kind of using all the Azure infrastructure to do this. So if we go to the next slide, please.

So what I'm going to show you is just in a few lines of code in the CLI in a second how we do this kind of RAG ingestion, but as Martin points out, it's kind of like two parts to RAG. There is a cold path which is creating the index. That can take time. So you are taking in various data sources, you are cracking those open, you might be doing custom transformations, chunking that up into text, sending that off to an embedding model, getting the embedding back, and then stuffing that into a vector DB like Azure AI Search. But then we have this query in the index. This is what happens at inference time. We're going to get a user query comes in, we need to go and embed that query, go find the three nearest or the pay nearest neighbors to that in our vector DB and then pull that back and then stuff that into the context back into the prompt to give an answer. So what I'm going to just demonstrate now, if we go to the next slide, I think it should say demo, is I'm just going to show you how we build this index. I'm just going to share my screen.

Okay.

Cool. So hopefully you should see my screen. I'm on the Azure AI Studio homepage and here's where you can kind of see all the different models. So we have Azure OpenAI models, but also you can go and deploy Llama 2 models, Falcon, there's a whole bunch of open source models as well. And when you come in here you can go and deploy these models. So you give it a kind of endpoint and you can deploy that out into what we call a project, which is a container for really the application that you are building. So when I hit that deploy, it will go off and I'll get deployment pretty quickly back actually. And you can set the kind of capacity like the tokens per minute limit on each of these deployments. And you can see here I've got two, so I've got one for a gpt-4 and I also have one for doing this text embedding ad model, which is going to do the embedding that we'll eventually go off into our vector DB.

So I'm going to show this, like I said before, we support this in a UI so you can create indexes with a few clicks. We also have a Python SDK, but what I'm going to show you is the CLI of how we do this in a CLI. And I've kind of prepared some of these bits earlier.

So what I have on my desktop here is I have a bunch of markdown files. And these markdown files, they could also be in cloud storage, but for this purpose it's just local on my desktop. And what these markdown files contain is product information for a [inaudible 00:16:54] example of outdoor clothing and outdoor gear like tents and stuff like that. Here's just running ahead on one of those markdown files. You can see the brand and it's got all the features in there. So there's a bunch of these files here.

And can I move this? Oh yes, I can. That makes it more easier. So I'm just going to move that over to file screen.

So here in this single command line here, I'm doing this. So AI is the command. So it's AI search index updates, we give it a name and the files we want to go and index. And what it does here is it goes through and it runs that cold path pipeline. So it takes those documents, it cracks them open, it chunks them up, does the embedding and puts that into a vector DB all in one command line. So in the end of this I have that all sitting in a vector DB, and then once that's done we can go and chat on that data effectively. And again we supply in our CLI like the simple command line, AI chat, the index name, and here it's interactive.

So you can see I've asked this a bunch of questions. So what 10 options are available from the retrieved documents? It goes in, it says these are the options available to me. It also sources the documents where they came from. So like doc4, doc9. So I can ask you all these different questions. How much is the tent? And it comes back to $200. Where is the warranty? All of these things. But this is very manual. So if I'm trying to assess the quality, you've got to sit here and enter questions, and I can kind of look at it, but it's a bit subjective.

One of the things that we kind of need is how we can do this at scale. How can we test potentially different user profiles, whether it's a beginner user or an advanced user. Can we also test for different languages and how people might interact with our chatbot? So all of these things are quite hard to manually do and this is where Gretel comes in to generate this. So I'm going to hand this back to Martin who's who's going to show you how to do this in Gretel and also then make a more of an objective evaluation around is this RAG done a good job or not? So handing back to you, Martin.

Martin:

Thanks, Sam. Let me... I'm going to share the screen. So let me... [inaudible 00:19:56]. Okay, now I can do it. There we go.

So if all is well, you can now see my screen showing a notebook. And so the first step I'm going to do in that notebook is basically going to clone the... Can you see my notebook, Nicole?

Nicole:

Yeah.

Martin:

Okay, so I'm going to clone the AI Studio Copilot sample here, which is basically what Sam has been just talking through where actually all these markdown files are stored, and I'm going to use that for feeding it into our Tabular LLM model to create question-truth pairs. Going to install the Gretel Client in our SDK. And then the first step is basically take this bunch of markdown files and read and chunk it into pieces. So we found 20 files and extracted 41 chunks of those files.

Next step would be for each of those chunks create me a meaningful question or a set of meaningful questions and the corresponding response based on a text snippet that it finds in those chunks. In order to do that, we first authenticate with Gretel and we initialize our Tabular LLM model, and then we basically go into prompt to model. And so I construct my prompt in two ways here. First I'm going to ask the model to create a data set with the following columns and then I provide some columns I'm interested for the model to generate. So it'll give me a tabular dataset where I have a question where it's unique questions about that snippet of text and the corresponding response or truth that I'm going to use. So this question-truth pair we'll then use as an evaluation dataset for my RAG model. I'm also going to ask you to provide the context, like the exact paragraph where the Tabular LLM model has generated that question-truth response.

But let's say I want to go a step further. I don't want to have just random question-truth responses. I want to provide a little bit more guidance to the model I would like to do... So I'm going to update my prompt with three different sections. First, I want the model to first select the topic, which can be basic pricing, basic information pricing and warranty usage or technical details or anything you want. I also would like to generate a user profile. So whether that question is a beginner user has never used that product that's in the database or an expert user who already has used that product before and wants to have some additional questions. And then thirdly, I want the user to ask a question in a particular language and get a response back in a particular language. Remind you that all these documents are written in English, so it should be having some translation here.

So this is the final prompt that I actually generated for this use case, is actually we're going to ask the model, we provide it in a text, and we're going to generate a pandas data frame in this case where it adds topics, a topic column, user profile column, and language model. And based on those three it generates me a question and truth pair.

So we're going to run our Tabular model. We can provide some parameters like temperature to make the model more creative or more consistent. And then once it has generated, it returns me a tabular data set as follow. So first if we go into look into this, we see a topic around technical details that is a user profile, it's a beginner and in the Dutch language, so it's asking if the product is waterproof. It finds that with the context where it finds just a little paragraph of sentence that it found into those documents and its responds back, it generates truth in Dutch.

Another question here is around sustainability from a beginner user in English. It's just ask if the backpack is made eco-friendly, right? In this case the model find, it's very interesting, it finds that the model is made of durable nylon material in the documents, but it actually responds in very smart way. It says information about the eco-friendliness of the backpack is not provided in the given text. So it only says it's durable nylon material.

Another expert version is will it write any future updates. So also here it didn't find any and it provides you with that answer. So there are other answers if there is security measures, and it will provide those security measures in the truth file.

Next step I want to do is a little bit on human evaluation here, where we actually go over all these examples one-by-one and decide if we want to keep them or not. So remember this is a process where we basically go all the way from the stage where a user has to read those documents and ask questions and generate a truth, just a sanity check on the questions that it just generated.

Once I generate those questions, once I go through all these question-truth pairs and I am satisfied with it, I can write them out to JSON file. And now we go to the part what's actually Sam presented, where I actually can use those samples now with the deployed RAG model that I have done with Azure AI and I can take that generated question-truth pairs and just assess the RAG model based on those. And it generates me a score that is coherence, groundedness, and relevance score for that particular question-truth pair that I just asked the model to generate.

So a very easy way to generate test evaluation sets for direct model. I'm going to quickly stop sharing, but I'm just going to switch screens here.

I will share my screen here. And now actually going to the next step is once you have or Tabular LLM model and we have our Azure AI evaluation tool, you now can create easy applications around it. So this is just a [inaudible 00:26:54] app that we can build. It starts by ingesting those markdown files here. We can chunk it here. So we chunk it in pieces that we can actually provide in the prompt to the model. And so showing some random prompts here, you can see these are snippets, chunks of those data sets, that will be used as an input for the Tabular LLM model. Can show a couple of those chunks here. And then we can generate question-truth pairs.

Now we go to the Gretel part where actually take my API key and paste it into the session here. So once I have done that, I can now continue and basically ask the model to generate me question-truth pairs around pricing and warranty, use the technical details, and they all have to be from an expert user.

Now we can start generating those synthetic records here, start generating them. Now I already run this beforehand, don't want to wait too long here, but now you can see all these topics here, the user profile and the language. And then we see all the question and truth pairs. So there's a question about what's the return policy? What's the temperature rating of the CozyNight sleeping bag? And so all these questions are then answered accordingly here in a well=paraphrased sentence.

Next step that I can do is then I can use those questions, those generated question response pairs, and then evaluate my RAG. So this will actually done under the hood, going to call AI chat, API and then create the relevance correctness on coherence scores and it's like a histogram where we can show for each of the questions that are generated, how much are those that have a score four? How much are there score two? And we also can show the average numbers here.

So that's a very easy way right now that we have done from a very manual and labor-intensive task to a fully-fledged application in just a couple of hours of deployment work that we can create that.

I think that's the end of my demo here. I'm going to hand it back over to the rest of the panel.

I believe that's the final part of the demo. So I think I hand it over to Alex now to lead the question and answer part of the webinar.

Alex Watson:

Thanks, Martin. Thanks, Sam. We had a few questions come in. One, maybe I'll start here off the top with how do we think, or can we list some differences about how Gretel might compare to ChatGPT, which is a common question that we hear very commonly as we've built. Maybe I'll go through a couple differentiators there, like how to think about it in a separate way. They're both as I would consider them to be very complimentary, different tools. One area, I think the most obvious with Gretel Tabular LM is that it is people are taking data as an input and data as an output. So as you saw here, even when working with natural language examples like Martin was using there, we are able to take labeled data such as the expertise level and the language in the Tabular format, and then output natural language text in the Tabular format as well.

Other areas too is that under the hood Gretel's model has been fine-tuned on industry-specific data. So when it comes to understanding specialized terminology or data patterns that you might have, like healthcare, finance, e-commerce, Gretel is going to give you really high quality output for those particular domains.

The third and fourth, I would say, areas to to talk about I think, the Gretel model has been built from the ground up to ensure diversity and responses. And what we were doing here in this example, sometimes you might want to create a thousand different input-output pair examples for your evaluation data. What Gretel can do is it can scale to any size you need it to do. So it's not limited to the context window of the LLM. Plus, you can be sure that the high levels of adversity and responses. So if you give it the same prompt, you're going to get different answers each time versus eventually repeating data or narrowing in on one particular area.

Last piece and we didn't get a chance to show it today, but we'll provide some links on this, is around evaluation of that data. So when you're generating, for example, using Tabular LM to generate synthetic valuation data at scale or input-output pairs for your RAG model, you don't want to have hallucinations in there and you want to make sure that you can reason about each individual one. This is something that we've really worked on since the beginning of Gretel, generating, at times, millions of records. You need to know that each one of those records, you can reason about it because you may not have the resources to go through and manually inspect each one. So we have both synthetic quality score, which is a reporting that helps you look at all the outputs you have versus the original intent or inputs as well as validation. What validation does is drop records that are too far outside of the distribution of your prompt. So it gives you much more high confidence in the quality of the output data.

It looks like there's a few questions there, so maybe I'll take this next one. Are there other Tabular LLMs out there? How do you differentiate? So a question from Emile. Nicole is answering this, maybe you have more information. We have not seen it. I do think it's the first of its kind Tabular LM out there. So the idea of a LLM application built from the ground up to work with Tabular data. Nicole, did you have any other context?

Nicole:

No, I was just going to ask or mention the same. You're probably familiar with other LLMs out there, they're great in chatbots, but as you saw from both demos, we really are focusing on the scalable high-quality, tabular data. Yeah, I was just kind of wanting to add that part.

Alex Watson:

Thanks. John was asking a question about LinkChain Has the ability to generate question-answer pairs from documents or Q-A purposes, how does that compare? We're big fans of LinkChain. We've done a couple webinars with them as well and we'll be launching eventually Gretel Tabular LM as a synthetic data generator for the LinkChain platform. So we're fans, I would say it's complimentary as well as Nicole just mentioned. I think the differentiation is that it is built to extract high-quality, diverse Q-A pairs that you don't have to spend that kind of manual human review time going over and curating yourself. So it should be simpler, more high quality data. And then output in a tabular format that you can work with. You don't need to kind of extract the result into a tabular format.

Steven Clemons was asking a question, how do you ensure that the LM can conclude that the requested information is not found in the vector database? Often the LM starts hallucinating instead of admitting that an answer cannot be found. I refer to the sustainability question that Martin was referring to. Maybe, Sam, you could jump in here on how the Azure AI Studio application is.

Sam:

Yeah, so we do that by default. So you can limit the responses to the vector DB. So if it doesn't retrieve anything from there, it will come back and say, "I do not have that information to answer the question." And that's done through some meta prompting.

Alex Watson:

From Hinish. How about data privacy? Is it approved for HIPAA and GXP? I would assume GXP is... Is that a government certification? I can speak to HIPAA. HIPAA has an 18 step identification process. So really they have a very prescriptive, it's really nice, set of rules you can use for working with private data, anonymizing it for wider distribution. So you can use Tabular LM among with Gretel's other suite of tools to anonymize the data to meet or exceed HIPAA standards.

We'll link to a blog. There's a natural language text can be trickier than working with tabular data where there's really a kind of clear definitions for things. We recently did a blog on creating what's called differentially private synthetic text. So essentially what we're doing is we're training a model on the data in a way that guarantees it won't memorize secrets inside of the data and we can have it create synthetic text that gives you really high confidence.

For example, even the cutting edge named any recognition techniques, things like that, still only have a 90% accuracy rate. So with traditional deification of text, you have to do a second pass, kind of a manual review of it. The use of synthetic data has a promise of needing to remove that second step. So we'll include a link to that as well.

Nicole, we had a question coming in. Does Gretel allow for performing complex NLQ that requires multi table join or running subqueries to answer a plain text question?

Nicole:

Sorry. I don't see that question.

Alex Watson:

It is in the chat section. Typically, and this is a question from Nikhil, typically, when interacting with multiple tables from an LLM, you use an agent-based application. So there's several frameworks that people use to do this, but essentially the agent searches across the tables, does a quick description of the tables, and then formulates a SQL query you can use to answer that question.

So in this sense, Gretel works well with an agent that would be querying across the multiple databases. Also, Tabular LM in one of the roadmap items that we're working on is the ability to fine tune it on your own data. So just like Gretel's other models, this model could be adapted to your company's specific data in a model that you own. So you have the ability to work with it there and in which case you can answer questions. But I think that when you're trying to get an exact response, for example, going across tabular, numerical, categorical data, the agent-based approach works very well because it uses the SQL engine underlying to do aggregations or sums and things like that that are nice.

Martin, do you want to take the question from Christian here in the webinar chat?

Martin:

Sure. So question is how do models made with synthetic data compare with federated learning approaches? So federated learning is a different way to tackle privacy. So you basically train and fine tune models on devices and then share the weights in a central model over the cloud.

Now what synthetic data does, it basically allows you to generate data sets that are private and eliminate all the privacy concerns there and those data sets can then be used in any machine learning model application or not. So I think what synthetic data does differently than federated learning is it allows you to be more flexible in terms of the downstream use cases. Federated learning is a deployment or an implementation strategy to deal with privacy, but not all applications or not all use cases allow federated learning to be a solution strategy for them. So although both approaches deal with tackling privacy concerns, the way that synthetic data works is basically it takes your data set and basically generated a private version of that dataset. In case you don't have data, synthetic data can or Tabular LLM for instance is a model that just can by providing a few examples of data can upscale your data set and generate more records to that. So that's also some capability that synthetic data can do that federated learning is not responding to.

But if you want to build a very highly private application, federated learning and synthetic data are two complimentary things and should not be seen as two competing approaches, in my opinion.

Nicole:

Just looking back through chat history, it looks like there was a question and Gretel generates synthetic questions and answers from coding examples. For example, Python functions, Java classes. I think that's a great question. From our experimentation, I think that's definitely possible. You saw Martin show an example just now with question-answer pairs and also in the demo on the model playground you saw you can paste a lot into that context window into the input prompt. So definitely recommend you try. We're also in open preview with Tabular LLM, so we want to get your feedback. There's a link on the model playground interface for submitting feedback in case you run into issues. So please do give a try. We'd love to hear how it works for you.

Alex Watson:

Emile was asking a second question about what foundational model do you use? So getting into Tabular LLM's architecture a little bit. So under the hood, Gretel uses a tabular model that we have fine-tuned on industry specific data sets. It's an agent plan and execution based architecture. So if it gets, for example, a natural language domain question that would be better served by using an NLP specific foundation model, for example, like the Azure Open AI GPT-4 model, it can automatically route to the foundation model in your cloud of choice. So you have a level of flexibility there. Whenever the task could be better served by the underlying model that we have fine-tuned ourselves at Gretel, which we call Tabular LLM, it will route to that model automatically.

We have another question from Magda. And when we say the synthetic data generated by Gretel is validated, how is it validated? How can we make sure that the data is good quality and which large language model is behind Gretel? So a second part of the question there, it's a combination of language models both our own and then a foundation model which can be configured to run with Gretel.

For validation, how is it validated? So when we look at the input prompt that's coming to the model as well as input data that you may or may not provide with the model, sometimes you're just generating data from a prompt. Sometimes you're saying I want you to augment or create more records like this. What we do is, behind the scenes, we construct a set of rules. We call this validators. We actually since the very beginning of Gretel, our first model was a language model back in 2020. So we've had a fair amount of experience building this out. It constructs some boundaries. You could think of it as a compostable set of rules based on the user query and it ensures that the output of that model matches that. So for example, if you're asking for an age, for example, we'll look at the age and say reasonable age for a human is somewhere between zero and a hundred years old and it will drop things that are outside of that distribution.

So a question from Rajeev. How big and large Tabular data it provides, can we ask Tabular LLM to give us 1 million synthetic data records for my customized dataset requirement? So the answer is yes, that is possible. In the early preview right now, we have some limitations. So each individual batch you have is limited to five or 10,000 records that you can generate at a time. So with the higher tiers we can unlock more usage or you can break it up into smaller pieces. But we do see cases, for example, for model pre-training where you may have a set of input-output pairs you want to train your own LLM on from scratch. So at that case you do see NLP use cases at that scale, and you might use the synthetic model, you could use Tabular LLM to go through what sometimes could be dirty, low quality or poor input quality data you might have. For example, call center transcripts, things like that, and to improve the quality. So that is a case where we do see data generation using Tabular LLM in the millions of records.

Magda had a follow on question asking, are those validators manually defined. At the moment they're composed as the prompt comes in and the input data. We're looking at ways to expose that to our users. And so to give a concrete example of it, whether you're working with tabular or natural language text, sometimes you have some business constraints that you might want to see are always enforced in the data. So like a really simple way to think about this, if you're looking at a patient medical records data set like EHR, and you have a time series that the model's generating, you always want to make sure patient check-in happens before the prognosis or the initial validation. Similarly, in the HR world, if you've got at onboarding, you always want to make sure employees higher date is before employee start date. So we are looking at, in addition to the automated ways to do that, to allow you to via natural language once again, or maybe even some lightly structured code, define those kinds of business restrictions that you want to see on the data.

We have a question from John about, please comment on use cases for anomaly detection. Martin, maybe that would be a good one to dive in on.

Martin:

Sure. So my answer is in a very broad way with how synthetic data can help you to do anomaly detection. So oftentimes if you want to build a machine learning model that detect anomalies in your data, you're also often presented with a dataset where you only have a limited amount of anomalies, right? And so you can take those few examples of anomalies you have and ask the model to generate a synthetic version and upsample and create more versions of the anomalies that can feed into an anomaly detection model. So that is one thing.

If you want to combine it with our Tabular LLM approach, you can ask the model to generate more anomalies that inherit certain principles. It's using the foundational knowledge on which the Tabular LLM model has been trained, or Tabular LLM model has been seen lots of tabular data and it's able to ingest the type of anomalies you want to generate it and create more diverse anomalies that you then can use in your anomaly detection model. So that's one way synthetic data can help there.

Alex Watson:

Daniel has a question that in the scenario where additional synthetic records are being created from a dataset, do you expect the synthetic data to accurately reflect the relationship between fields found and the original dataset? Or does this validation stop at fields containing reasonable values?

So the answer here, I wish there was a single answer, the truth is we have multiple models in Gretel. Any of the models that are fine-tuned on your own data. So any of our tabular models, for example, will retain that relationship for any data that gets created past that point. When you're using Tabular LLM without fine-tuning, so you're having Tabular LM for example, look at your dataset and augment fields, augment records, create new data, things like that. It will keep a limited context that it compiles about each dataset. And you can also adjust that via the prompt. So this is an area when you want to make sure that it has context from the entire dataset instead of just the chunk of it that it's looking at at a time. You can accomplish that by ensuring that some of the overall statistics that you're looking to maintain are included in the prompt. So happy to discuss that in more detail. If you want to jump on, we have a Discord community channel. That's a great place to ask questions like this. Our applied scientists are in there and they can help answer those questions directly.

Looks like we have one more question here. So Malik was asking if we do anything in the sports analytics area. Interesting question there. I would love to hear more about that. It's hard to answer it to you specifically, but I do say we've got several examples in the gaming industry that we've worked with quite a bit around synthetic data. So happy to discuss that at a high level or dive in on that question in our Discord.

Last question here is from Marcus. Just came in here. So how do you ensure that there's no re-identification risk with data? And this is where privacy-enhancing technologies give you a unique ability to protect against this. There is a technique I would look at if you haven't called differential privacy. Specifically with Gretel across most of our models, we support a technique called DPSGD, which is a implementation of differential privacy that happens during model training.

So what happens as data comes in, we train it on the distribution of a dataset, DPSGD. What it does is it inserts noise into the predictions and also clips the gradients coming out of the neural network. And what that gives you is a mathematical number, which is called an epsilon and a confidence interval, which is called the delta that your privacy is maintained here. And how DP works at a super high level, we've got some great workshops we've done on this, but essentially what it does is ensures that any one entity or human or row could be taken outside of the dataset and it would not change the output of the dataset. So that's been by the formal privacy community, really kind of the state-of-the-art approach to prove that, yes, I could remove this person from my dataset, ask it the same question, and get exactly the same answer. So that helps establish proof that no one individual's privacy can be compromised in this data. So you wouldn't be able to re-identify a user in the dataset based on knowing some set of attributes about them.

Thank you so much for the attendance today. Thank you, Sam, for joining us and walking us through that great Azure AI Copilot Studio example. Please feel free to follow up with any questions you might have in our Discord or follow up on hi@gretel.ai and we look forward to running many more webinars just like this. Thank you.

Fast Tracking RAG Model Evaluation with Synthetic Data

Video description

More Videos

Generating differentially private synthetic text using Gretel GPT

Anonymize Financial Data with a Fine-Tuned SLM

Video - Generate synthetic data for training LLMs and SLMs

Read the blog post

Transcription