Gretel Demo Day Q423
Video description
The latest advancements in Gretel's Model Playground and Tabular LLM, plus no-code tools for automating synthetic data generation
More Videos
Transcription
Speaker 1 (00:00):
Hey, good morning. I am Alex Watson. I'm a co-founder and chief product officer here at Gretel. Today we are going to be doing a workshop, starting with an introduction to what synthetic data is, some of the popular use cases that we see across the users for Gretel across our customers and that we see in the space. And then moving into a set of demos where we'll go through anywhere from basic to relatively advanced functionality with Gretel. Super excited to have everyone here today know we had some technical difficulties getting started, but thrilled to be here. Now I saw we had several hundred participants that have signed up, so looking forward to the discussion. If you have any questions as we're going through the workshop here, please drop in the comment sections on the right. We'll go through those questions. We'll definitely leave time at the end. And for anyone who's not able to attend the workshop today, just like our other workshops at Gretel, we'll be posting that live to YouTube very shortly after the video is done here today. So let's jump on in next slide. So I introduced myself, let me let my colleagues go ahead and introduce themselves as well.
Speaker 2 (01:10):
Hi, I am Sammy Tobe, core product and applied science at Gretel.
Speaker 3 (01:15):
Hi, I'm Grace King and I'm a product manager at Gretel.
Speaker 1 (01:21):
Great. Next slide. So like we'd like to do at Gretel here, I think we're going to start with the working backwards from the customer use cases that we're seeing right now. So when we think about synthetic data and we think about the space, it's really much more, we think of it as a way to unlock bottlenecks with data much more than we think of ourselves. For example, as a synthetic data company. And I can talk about some of the popular use cases that we have seen across our customer base. Here you see some of the reference examples that we're happy to dive into the level that we're able to, and then also maybe a little bit on the progression. So early days of gretel, when we first got started with synthetic data, it was around taking data that couldn't be shared and making it shareable.
(02:07)
And what we saw talking to so many organizations, even the most cloud native advanced companies in the world, was that all of those companies, while they understand and grasp this intrinsic value of their data, they struggle for one reason or another to make that data accessible, how to share that even inside of your own business. And definitely when it comes to across businesses, so we've all heard about data silos. We've all, many of us, the developers and the data scientists in the world have had to go through hoops to get access to data. And even when you have access to data like how to make that data clean and ready to use can consume so much data science project. In fact, talking to many of our customers, we find that across a typical data science project, 30 40% of the whole project time can be spent just initial access to data, exploring it and cleaning it so you can start to use it for your use case.
(03:06)
So the first use case really with the gretel is taking that data that it's private, it's dirty, it's locked up, it's compliance controlled, and creating an artificial, clean and anonymized and compliance controlled version of that data. So under the hood, how does GRETEL work? It uses a similar technology to GPT or GPT-4. So we use language models, we use gans, we use diffusion models, but these generative AI models. And what they do is the first step is the model goes across your entire sensitive dataset. For example, Gretel can run inside your own cloud, it can run in ours, and just like a GPT model learns to recreate the distribution of texts. Gretel's models learn to recreate the distribution of a dataset. So at the end you have a model that can create unlimited amounts or customized amounts of data that matches the same insights and distributions of your data, but it can't be point back to any individual that the data is based on.
(04:08)
And this unlocks really tremendous opportunities for hospitals to share medical records to each other for analysts to get access to financial data and things like that that you can use to build a better customer experience to better predict where things are going to go for your business then is possible with really any other technology. And I think one of the other really great advantages of synthetic data versus other privacy enhancing technologies, and we're fans of any privacy enhancing technology, but I think one of the really neat things about synthetic data is the portability of it, right? The output of a synthetic data model is a dataset. You can interact with that inside of your data warehouse, you can interact with it inside your database, you can load it into your notebook environment and Databricks, you have complete portability and access, which is really hard with other approaches, for example, like the secure enclave or things like that.
(04:59)
So these use cases that we see at Gretel and what we're looking at right here really divided into two different categories. The first one is privacy related that we covered earlier, like how to build a anonymized private version of a dataset. The second use case is how do I take my dataset, which is somehow limited, maybe I don't even have access to a dataset at all and how do I make it better? So what we see is the, so often the machine learning space and something that I think many of us have run into ourselves that you have data, you just don't have enough of the right data. So whether you're building an ad recommendation system that's really struggling with one ad category, you're building a sentiment detection classifier that once again is struggling with a different category. You can use a synthetic data model to create new variations, essentially improve your training data to help the downstream models.
(05:51)
Whether you're training an LLM or you are training a machine learning model for ad recommendation for fraud for any of these use cases, you can give it more high quality examples that will result in improved performance downstream at a small fraction of the cost and much faster than traditional techniques where you would use a human to generate more data that would be labeled and then could be incorporated into the training set. So I love the fact that as a developer, that you can incorporate synthetic data right into your workflow and you don't have to wait weeks or months for new data to become available for you to improve your model. Next slide.
(06:31)
So a little rel, we are about 65 people across the United States and Canada. One of the neat things about Gretel, and I think what has really helped us advance so quickly is that we have a completely free tier. Any developer, any data scientists can use this. You get a certain amount of credits every month. And hint, I think really what that's translated into is a lot of experimentation that's been possible for developers on our platform. So as you can see here, we've got a little bit over a hundred thousand users that have signed up to use Gretel and run a model and a lot of SDK downloads. So we build our own core machine learning generat AI models at growl. So that's another area that we've really advanced in is when it comes to working with tabular, for example, data at scale, which makes up such a large percentage of data across the business.
(07:22)
Most businesses, 85% of all data in the business is in some sort of tabular format. So it's in a database, it's in a data warehouse, something like that, and across different modalities. So the different modalities we support at Gretel include tabular data. It includes time series data, which is a component of tabular data where it's important to maintain the time constraints. Examples where that's really important if you are looking at the progression of a disease, for example, and a rare disease dataset. If you are working in FinTech financial services and you are looking at forecasting the time component as you go through data is really important and text as well. So often either you're working with natural language text or you have things like patient medical records that might have an initial diagnosis from a doctor, including categorical data, like what time you came into the business.
(08:17)
Gretel's models out of the box are multimodal and since they support any combination of these attributes, whether it's numeric time series categorical or text inside of your data. So really exciting stuff here. The underlying technologies that we use, as I mentioned earlier, are models anywhere from pre-trained transformers, so similar technology to GPT-4 except uniquely our models have been trained on millions of industry specific like machine learning and analytic data sets. So what's this resulted in is a model that you can interact with via natural language prompt. You can tell it that want to add more records like something or alter distribution side of your data, no data science knowledge required. And this model has seen good data, it's seen industry specific machine learning formats and it's able to augment or improve data in a really meaningful way. So really exciting stuff there. We also use models and that we built from the ground up, including gans, including diffusion models and things like that that are really good at working with high dimensional data.
(09:23)
So when you have a dataset all the way up to the scale of even genomics data that we've worked with with some of our partners where you have hundreds of thousands of columns of data by millions of examples, the Gretel models are built to scale up to handle that whatever scale that you have inside your team, your organization or your business. Next slide a little on how you interact with Gretel. I mentioned earlier our free tier. I would encourage everybody on the call, feel free to jump over to gretel.ai, click sign up or sign in and start using the free tier on the left here. You see, it's kind of hard to read kind of small font there, but we see our set of use case cards. So the most popular use cases that we see for grot, we've really codified it into this low-code, no-code interface that allows you to take a data set of yourself or even just prompt a model yourself and use it to generate a dataset to get started with synthetic data without having to dive in and learn an SDK or learn how to code up a workflow.
(10:21)
So the other mode here and what we see most popular with developers or organizations using synthetic data at scale is our APIs. So you can think of Gretel as a data and data out platform. It starts with can take data as an input or can take a prompt as an input and you can hit our cloud APIs or you can deliver essentially Gretel services directly into your cloud for compliance controlled examples. And you see these core APIs, we have the ability to transform or mutate data if you want to add columns, delete columns if you want to hash your encrypted value, if you want to replace PII for example. So that's one of the core APIs. Synthetics here are the core set of synthetic generative models that we support for tabular text and time series data. And finally evaluate. I think such an important part and something we'll show in the demo is the ability to look at your synthetic model, you've trained on your data and to quantify how well it works both generally like how well did it learn the distribution of this dataset, but then also how well does it work for my particular use case, if I'm trying to improve my ad recommendation system or I'm trying to improve my financial forecasting model, or I'm trying to improve sentiment detection for an LLM that I train to work across my platform, you want to know that the synthetic data model you've trained and the examples that you generate from that, that model are high quality.
(11:48)
They're not hallucinated, they are things that you can generate at scale. Often with Gretel, we have customers generating anywhere from hundreds of thousands to millions of examples. So you really need to be able to trust that data. This third API that we'll cover a little bit today, the evaluate API allows you to within the Gretel platform or also very easily with other platforms like Vertex ai, with SageMaker with Databricks, things like that, to very quickly quantify how well your model is working for your particular use case and to filter out any examples that don't meet that criteria. So the goal here is to give you very high quality trustworthy data available as an API. So you can, once you're comfortable with it, you can automate it and have it run at scale automatically as part of your pipelines. Next slide. So on this slide I'll ask Sammy to jump in and talk a little bit about our roadmap, both what we're building today and what you can look forward to seeing in the short future.
Speaker 2 (12:53):
Thanks Alex. So we have a couple of demos coming up, but before getting to those demos, I wanted to give you an idea of how we think about building product at Gretel and what is our North Star. So our North Star is to make data better in general and we believe the best way to do that is with synthetic data. You might ask, how could synthetic data be better than real data? So first, synthetic data can be safer. We have models that support differential privacy as part of training the model where essentially differe differential privacy, injects noise into the training process such that the model is able to generalize across the dataset and not learn about any individual user or record characteristics. And those are mathematical guarantees. We also support the ability to train and generate data in your own environments, which means that your data never gets exposed to the outside world, not even to gretel.
(13:48)
So you can have guarantees that nobody's seeing your real data, they're only seeing data that was generated using a differentially private model. So as a result, data is a lot safer and more accessible. You don't need nearly as many access controls. You can just make the synthetic data available to everyone and promote innovation within your company. The second way synthetic data can be better than real data is in terms of quality. So if real data was a true representation of the population, if it was collected correctly and there's no noise in the data at all, and that's a perfect sample, then great, it's hard to beat that. But in the real world we know that that is not the case. Often samples are biased, they are point in time, whereas the real world is changing, there's missing data, there's low quality data, and what we've seen is synthetic data can be significantly higher quality than real data.
(14:41)
So for example, you can fill in missing fields, you can simulate rare events. So let's say you're a bank for example, where you just discovered a new type of fraud that started last week and you don't have enough real world records of this type of fraud and you're training AI models to detect fraud. Well, with synthetic data, you don't need to wait until the fraud has grown and you've had a lot more real world examples. You can actually simulate such examples with synthetic data and have your AI models catch up a lot faster. You could correct distribution bias. Again, real world data is point in time and it's often not a true representative sample of the population which you can fix with synthetic data. So we see this a lot with our customers, but as an example of synthetic data being higher quality in the wild, I recommend you look at the textbooks are all you need.
(15:28)
Paper that came out recently where they trained the model that is effectively five times smaller than LAMA two on seven times less data, except instead of using real data, they use synthetic data and they achieved comparable performance to LAMA two. So even a much smaller machine learning model trained on far less data can have better performance than a larger model trained on a lot more real world data if the data is carefully chosen, which is something that synthetic data enables you to do. The last couple of ways synthetic data can be better than real data is with synthetic data you can ask and essentially get exactly the data that you need with real world data. You need to find a dataset. Maybe the dataset doesn't have exactly all the fields that you need. You need to do some SQL joins with some other data sets.
(16:11)
For example, if we're talking tabular data and you might still not get exactly what you need, whereas with synthetic data and what I'm about to demo today, you can actually write a natural language prompt or a schema that describes exactly the data that you want and get that data in seconds. And lastly, as a matter of convenience and scale with synthetic data, you can get data exactly where and when you need it. So my colleague Grace is going to be demoing workflows and connectors which allow you to generate synthetic data, right it to your data warehouse, write it to your database, do it on a schedule, essentially operationalize with a few clicks without having to worry about infrastructure. So with that in mind, I will jump into demoing exactly the data you need in seconds Component, which I'm really excited about. This is using our model playground and our new tab, LLM model. Just bear with me for a second while I present.
(17:25)
All right, so here we're looking at our Gretel console, our no-code interface, which Alex was describing earlier. You can see here are blueprints which allow you to spin up batch Gretel jobs training on your real-world data and generating synthetic data. The new thing, which you might not have seen yet is our model playground, which is in beta today. Today we have currently two model playgrounds, one for natural language and one for structured data. Using tab LM, everyone has access to natural language and we're rolling out structured data access using TAB LM to early access users. So I will start with a natural language demo here, and the difference between model playing ground and our existing workflow is that model playground is designed to give you almost instantaneous results so that you can tweak your prompt and make sure you get the data that you want before you spin up a batch job.
(18:19)
So as an example, let's do a question answering example for natural language. So here we're asking the Gretel GPTX model, which has a choice of models under the hood. In this particular example, we're using LAMA two 7 billion parameter chat model. So we're asking what is static data and what and why is it useful? We'll ask the model to generate an answer here and you can also tweak model parameters like temperature, the maximum length of the output that you want. And here we got an answer. So synthetic data is great for data scarcity, data privacy, data augmentation, cost effective, and it's flexible. Great. So if I like this sensor, I can just download this result or I can generate a hundred or a thousand or 10,000 or a hundred thousand answers, different answers for this question by essentially increasing the temperature and asking the model to generate more data, which will take us to a batch job workflow.
(19:13)
I will demonstrate that with structured data later on. You can also, one really neat use case that we see here is actually asking the model to generate a prompt for our tabular LM model to generate tabular data as a result. So let's see here. So create a prompt that I can use to generate a realistic consumer packaged good dates. Let's see what we get. And then I'm going to try that prompt on the tab L infrastructure data. You can also use it for text generation, creative writing, idea generation. Basically anything you want. A lot of you might be familiar with natural language style models such as chat, GPT. And let's see. So here's a prompt that we can use to generate a very realistic consumer packaged goods dataset. I'll go here to create structured data and I'll make it a little shorter.
(20:10)
Let's see. So we're asking the model to generate a few examples here. Household cleaning supplies, personal care items. And you can see here the model is giving us a data set that has a bunch of these examples here. This is not quite the format I was looking for, so I can tweak this prompt and ask the model to do that, but I'm just going to use an example prompt here in this case. So here we're asking, we're giving the model exactly the columns that we want and what we expect them to contain. So let me stop the generation and then try again.
(20:48)
You can see here you start getting results almost instantaneously. Great. So I'm seeing ISPN title, description, brand shipping, weight, essentially everything I was looking for, it's pretty realistic. 0.6 pounds for a pair of shoes, 0.4 pounds for headphones. You can see that this is not fake data. This is actually high quality data where the data from the different columns are actually correlated almost as if you had collected real world data in this case because it is based on training a model on large scale public data sets. So let do one more example here. So let's say we want an example that has users that are from a company called FU that's based in France, and we're specifying that we want the names to be traditional French names. We want the email to be first name dot last name. We want gender. We want a city in France and we want the country to always be France. Let's see where we get.
(21:46)
Alright, you can see traditional French names like UA email. The gender city is in France, and those are real French cities, not fake French sounding cities. So again, you can see how well the model maintains correlations here. Let's see. I'm looking to add another column here. Let's say I'd like the hobbies of these users and let's say the hobbies must be skiing or soccer. And let's try this again. So as you can see, you can tweak your prompt essentially you continuously until you get something you're happy for, you're happy with. Let's see here. We've got hobbies and they're all skiing or soccer. Great. So now that I'm happy with this output, I'd like to generate a lot more records. So I'm going to go to the generate more data button here, which allows you to spin up a Gradle batch job. I will choose my demo project. I will ask for let's say 500 records. It shows me just a reminder of the prompt and then click on generate. So this spun up a batch job and in a few minutes I will have the results ready to download. Grace will demo next our connectors and workflows where you'll essentially be able to not just download the results, but actually connect to a destination such as a data warehouse or database and write your results directly there
(23:20)
After you. Grace. One thing to mention while Grace comes online is what I showed you here is our no-code interface, but you also have the ability to do that through our SDK. And we also have some additional functionality in the SDK that we haven't implemented yet in the no-code interface, such as the ability to edit or augment data sets. And that will be coming to our console also very soon. All yours, grace. Thank you.
Speaker 3 (23:49):
Thanks Sammy. So as Sammy mentioned, I'm going to walk through some of the features designed to help you operationalize synthetic data. This quarter we added a number of connectors to popular data sources and we also introduced Gretel workflows. Workflows allow you to automate the generation of synthetic data so you can integrate synthetics directly into your data pipeline. So we're going to take a look at what that looks like right now with a demo. So here we have the Gretel console, which Sammy was just showing us. He showed us things in the model playground, and now we're going to take a look at connectors and workflows.
(24:43)
So this quarter we introduced a number of different connection types. We have support for S3 GCS, MySQL Postgres, Microsoft SQL Server, snowflake, and we are adding Azure blob, Oracle database and BigQuery in the next month and a half before the end of the year. And so what I wanted to show you today is how to create a workflow that connects to a MySQL database, is going to synthesize that database and then write the output to a second database. So the first thing we do is click new workflow. I'm going to choose an existing demo day project for this workflow. Next, I choose my model for my use case. You can choose the model that fits your needs best or workflows. Also support chaining multiple models together for maybe a more advanced workflow like making sensitive data GDPR compliant. For this example, I'm going to use our synthetic ACT scan model.
(26:00)
Now I'm going to define my input data. I created connectors ahead of time for this demo, but if at this point you needed to create a connection to your data, you can do so here and follow the steps today using a MySQL database. And this is what the connection creation would look like. We take very standard parameters with your username, password credentials, host name port, your database name, and then we also have optional J-D-B-C-U-R-L paras for more advanced configuration. And if you don't see the data source that you need for your use case, please reach out to us. We're always interested to hear about your needs and are always looking to add more data sources to better serve you. So for today, we're going to choose a MySQL input database. This is a telecom database that has five tables with various informations about clients, their accounts, there are locations, their subscriptions, invoices, things like that. So another thing we offer is to choose what input data to take from that source. You can select the entire database. You can also do maybe just one or two tables from the database, or you can pick exactly what data you want to extract using a SQL query.
(27:31)
For this example, I'm going to take my entire database, which is five tables. The connection has tested. It's acknowledging that it's located five tables and it has extracted their relationships because this is a relational database. So we'll click continue. Next, I can choose where to store my output data. I can choose to exclusively keep it in the Gretel cloud and download it from the Gretel console when I need it. Or maybe I want to connect to an external data source. I want to write it directly to another MySQL database. Now this is really helpful because that means that your data is exactly where you need it. There's no extra steps required, you don't have to download and re-upload. And I love anything that makes life simpler. So now that we've chosen our external data source to write to a database, we're going to click continue and now we're going to set a schedule. One of the great things about workflows is that you can schedule them to run automatically. You might choose to schedule your workflow to run on a daily or a weekly, monthly or maybe even a more custom basis.
(28:44)
A lot of the feedback we get from customers about scheduling is how valuable that is for them because they want fresh data on let's say a weekly basis, and they don't want to have to worry about running batch jobs or manually clicking a run button. They want the synthetic data where they need it, when they need it. And with scheduling your data is always where you need it and fresh. So for now, I'm going to choose to just run the workflow now, and what's nice is we can always edit it later to set a schedule.
(29:20)
So with both those five steps, GRETEL has automatically generated our workflow YAML config. This config contains the various actions that make up the workflow. So you'll see we have a read action here. We have our model train and run where it's going to train a synthetic database using our tabular acton model, and then we're going to write it back to our destination. Now all I have to do is click run. And of course for users who prefer our SDK and CLI interfaces or maybe have more advanced configurations that they'd like to make to their workflow, our workflows are editable. You can edit it in the console or in your preferred environment and upload the AML file. But for users who prefer a low to no code experience, the Gretel console is a great way to generate your synthetic data without needing to have that coding ability. So when a workflow starts running, we'll see in our activities tab, the different actions will pop up as they start to run. So here we have our read action and we can look at the logs to see exactly what's happening. And so we can see that it has extracted the five tables from my database, and now that the run is complete and it's extracted that data in a minute we'll see the model train and generation start.
(30:58)
So this database will take about 15 to 20 minutes to train and generate. So I actually ran a workflow ahead of time so that we can look at what the results look like at the end. So here, this workflow used the exact same inputs, the same parameters, the same data, and we can see that it trained and generated the data and we got status updates every 60 seconds. So you always know what's happening. And then it wrote the output to our database. And of course you don't need to take my word for it. So let's look at my SQL database. This is our input database, which was telecom, and I queried all of the records from the client table, which was one of our five tables. As you can see here, we've got an id, we've got a social security number and first and last names.
(31:57)
Now we can switch over and look at what that same data looks like in the demo synthetic telecom data base, which is what I wrote our output to. So I'm just going to run this query and we'll see that it finishes with 10,001 records, which was the number of records that I asked it to generate, which matches the same size as the input database. We've got our client IDs, we've got social security numbers that match the correct format, and we have names, but none of that data is the real world data. We've removed any of those privacy concerns with having real people's names and real people's social security numbers. So now I'm going to hand it back over to Alex who's going to demo some exciting things to come first. Alex,
Speaker 4 (33:13):
Alex, you're muted. Sorry. Hey, Alex, you're muted. If you could unmute yourself please.
Speaker 1 (33:45):
Apologies there. So today we are going to walk through a little bit more advanced of a use case here and really talk about how synthetic data, particularly that privacy guaranteed anonymized synthetic data here fits into some of the emerging architectures and constructs that organizations are using with LOM and essentially using an LM to grant access to your data. And here we're going to cover also why synthetic data is such an important part of that generative AI stack or architecture you would want for your business. So what we're looking at here is a diagram, and one of the, I think the most promising things that we've seen of generative AI so far is the ability to have natural language prompts what Sammy demonstrated earlier to generate or interact with data and the business or the organization level. The idea of essentially creating a natural language interface to your data warehouses or your database is incredibly powerful proposition.
(34:47)
It doesn't require your users to learn how to write sql, that relatively complex language that we all use to interact with databases, and it could allow anybody in any business unit, for example, to query your organization's data, query a bunch of different databases or data warehouses and get a single consolidated natural language response back. So incredibly powerful. One of the things that we've seen our users and our customers run into very quickly is the privacy concerns and constraints that happen even when you're interacting with data that's been de-identified inside of a dataset. So you've taken the PII out of your database, but what we'll show today is how a combination of attributes, for example, your zip code, your age or things like that can become identifying and often end up being unfortunately a blocker enabling that enterprise wide access to data that so many people love.
(35:41)
Synthetic data gives you a really great alternative to that where essentially what we see our customers doing today and we're going to walk through is the ability to create a synthetic data lake. So you have your real world data, you have your real world database. As Grace demonstrated, we create a synthetic version of it as well that has privacy constraints for most users, especially those ones that are just learning to learn from data or to train a model on data. You don't need access to real user data. So what we're going to demonstrate here is we're going to have on one hand the real world database and on the other hand a synthetic database that we created working with to query using an LLM. So in this case, we'll be using OpenAI or Palm, and then what's called an agent and an agent is a, that's a new construct that essentially allows LLMs to get access to a set of tools.
(36:32)
So I think we've all seen, for example, LMS that are capable of browsing the internet and bringing back results. In this case, a SQL agent is an LLM that knows how to interact with the database. So it can take a natural language query, translate that into a query that goes across your data and bring back results. So with that, let's go ahead and jump right in and start looking at this demo. For this example today we are going to take a pretty classic machine learning dataset, which is called the IBM HR Employee Attrition Dataset. So this is the dataset. Let's go ahead and look at what the format of the dataset looks like. Classic machine learning dataset to predict whether employees are going to leave the business or not. So I'll go ahead and pull back an example I've trained. What we just saw is I've trained a couple different versions of Gretel models with different levels of privacy on the data.
(37:24)
Right now we're looking at one that was trained with differential privacy. This is an actual mathematical guarantee of data privacy. Here is what the real world data looks like. We can see a bunch of categories here. We have ages whether the employee left, what department they were in, what level of collegiate or high school education, they have monthly income, things like that, but spent the de-identified. So there's no direct I inside of here anymore. Really powerful data set here, and we're going to essentially in this next step, we're going to move over to notebook and we are going to grant a SQL agent using an LLM access to this data. So we can ask it questions and we're going to highlight some of the risks that come in from a privacy perspective and why synthetic data can be such a important part of this architecture.
(38:14)
So going right up here to the top, we have a notebook. We have a blog, which we'll link to this as well, so you can run this all yourself, but what we're doing is we are taking two versions of a database. One is the IBM attrition dataset here, which we're going to convert over to a database. The second example here is we are going to create a synthetic version of it with a really strong value. So in differential privacy, epsilon of five, or really as you get towards one, you're starting to get into the formal mathematical guarantees of privacy where it is highly, highly, highly unlikely anyone could ever recover any private data. So this is when you need those really strong guarantees of privacy. So we'll go ahead and click this. We're going to generate our database here. Next thing we're going to do is we are going to write some basic code, and this is what we're going here that creates an agent.
(39:04)
So essentially for this example here, we're going to use a version of chat GBT, which is been it's aware of tools. So essentially it's able to use these types of tools, toolkits, and we are creating an agent specifically that has access to SQL databases so it knows how to query a SQL database. So this is really cool. So you're going to see not just one LLM call, but a sequence or a chain of LLM calls going to the database where it's going to take a prompt from the user and figure out how to query a dataset and bring back the result. So we've go ahead, we've created our agent here and I created a helper, and what it's going to do is take our synthetic provably private machine learning dataset. It's going to take our real world dataset and it's going to take a prompt that we enter, we're going to use it, and when I click this button, it's going to fire off an agent and the agent will start querying the database and trying to figure out how to answer this user question.
(40:00)
What we're going to look for here is we want to see very similar responses between the synthetic database and the real world database indicating that yes, for your use case, when you're training a model on this dataset or you are creating a dashboard off this dataset, the synthetic data is going to give you very similar but not exactly the same results to the real world data. I turned on the verbose logging mode. So I can kind of walk you through what's happening with the LLM. So bear with me as we kind of scroll through here. Here we can see the LM firing up. It knows it has access to a SQL database. So here the first thing it's going to do is it's going to list the different tables that exist inside the database and then describe them. So first thing, the LM is going through here, it's found our synthetic IBM attrition dataset.
(40:46)
It's describing all the different fields that exist inside there and it's sampling some of the values and it's going to use this to try to figure out how to answer that employee question. Here you can see it figured this out pretty quickly. It created a custom SQL query to bring back this result with what three departments have the highest attrition, and then it returned the results. Next, it ran on the real world data. And let's go ahead and compare the results across these two. And once again, what we're looking for are really similar results, but not exactly the same results because then you'd be working with real world data. So here we can see in the synthetic dataset, relatively small dataset, the distributions here are pretty close. So even with the mathematical differential privacy guarantees that we applied to this data, you can see that the results here are very close.
(41:29)
R and d had 144 attritions versus 133 in the real world dataset, human attrition, human resources here had nine attritions versus 12 in the real world dataset. So the distribution of these two is very similar. Here's another thing if we wanted to look at the age of employees. So essentially asking the LLM query, this dataset, return me a list of the different distributions of ages and 10 year increments. So pretty common SQL operation here. We see the exact same thing doing. It starts off and it lists the tables again, it samples the tables as you can see right here, and then figures out how to write a SQL query. As you can see here, it's doing this, dividing it by 10. So it's really smart. It's figuring out a bucket and right SQL to return these results. And what we're looking for again, is a similar distribution.
(42:18)
So here we can see, once again, synthetic data very closely matches the real world data. It's allowing to extract the same insights of real world data with without compromising privacy. For the final example here, we're going to talk about and give you a real world example on this dataset of why privacy is so important. So here we're going to do an attack, which is called a re-identification, a singling out or membership inference attack. There's lots of different terms for it, but here we know something about the dataset. So even though it's been de-identified, we removed all the PII from the dataset. The idea is we're going to allow the employees in our business to query our data. So you want to know that if people know something that they can't put other people's privacy at risk. So here we know something, we know somebody in this dataset is age 46, we know their gender and we know which department they work in.
(43:09)
So if you did that, can I query it and learn something that I shouldn't know about that person? So here we suspect this person's in the dataset. We also suspect that this combination of attributes really just the age, gender, and department is going to be identifying by itself. So let's query the real world data and the synthetic data. And what we're going to hope for here is that this combination of attributes does not exist in the synthetic data, but it will probably exist in the real world data and we'll see kind of the effects of that happening. So I'm going to go ahead and run this. So it's starting up the LLM again, it's doing the same thing. It can't remember what it did previously. So every single time it has to follow the same format or list tables creates a SQL query to answer my question and pulls back the results.
(43:53)
And let's look at what happened here. So both of these things ran on the synthetic dataset. It created the right SQL query query to return this data, but it did not return any matches. And that's because in the synthetic data that was created, there was no combination of these attributes that pointed back to a real world person from our dataset. But in the real world dataset, and this is where you start to see that some of the challenges you have around privacy, it did find that person. So here we found the person's 46-year-old found their gender and their department, and then we pulled back a bunch of information that we should not know about this person, their monthly income, the performance rating, and then how many years since the last promotion. So this is an area where really just turning on synthetic data and adjusting the level of privacy that you're looking for in your data can give you amazing insights into data across your business while giving you really strong protections against the different types of privacy attacks and concerns like that that exist with data. With that, I think that concludes our demos for today. So thank you for participating in this with us. I think we're stepping over to the q and a section now. So I'm going to hand it back to Meza.
Speaker 4 (45:18):
Sorry, we have a muting issue today. Hello everyone again, thank you for staying on. There's a bunch of questions coming to live q and a. Please continue to ask them there. We're also going to take a few of the questions that came in through the registration forms, so continue stay engaged. We have all three of our speakers who will be here with us for the next 15 minutes or so as we take q and a. So for the first question that came in, there's this one question asking about synthetic clinical data being useful for training or fine tuning a model.
Speaker 1 (45:52):
This is one of the more popular use cases we see in the healthcare life sciences space. Tremendous value, and I think we saw this during covid, right, the value of data when it can be made accessible to researchers to understand the disease that the progression of the disease, or even to accelerate clinical testing. The challenge there that you have is privacy, right? So you have compliance protections through hipaa, for example, that are there for a reason. Synthetic data gives you a way to create a dataset that allows a model, a model that you might train like an lm as we mentioned here when we're talking about fine tuning a model, allow that model to learn about the disease without learning about patients. And I think that's one of the more powerful concepts with synthetic data that you can train a model that will get just as good at detecting a rare disease or simulating events for clinical trials and things like that without ever risking, replaying essentially private data for an actual patient in the dataset.
Speaker 4 (46:56):
Cool. Let's take another question again, this is an interesting question. I know we have some cool product answers to this, but what safeguards will be put in place in ensure quality outcomes when working with synthetic data?
Speaker 1 (47:14):
Tammy, you want to take that one?
Speaker 2 (47:21):
Sure. So I think one, it kind of depends the synthetic data model that you're using. So for example, if you're using our tabular LLM and just simply using a prompt, basically we've done our best to train on high quality data. And I think the higher quality your prompt is sort of the more specific your prompt is to what you're looking for, the higher quality output you get. If you're using a model that you can train or fine tune on your own data, then ideally your real world data is as high quality as you can make it. So generally, we actually have in our documentation, there's an entire page on that. So you can adjust precision, you can remove some blank values if possible. Basically make sure your real world data is as high quality as you can, and then you can further boost that with synthetic data by, for example, adjusting for biases and the distribution and so on.
Speaker 3 (48:20):
And one other thing we do to help you evaluate the quality of the synthetic data that you generate with Gretel is our synthetic data quality reports. And so we have different metrics like our SQS, which is our synthetic data quality score that really allows you to dive deep into how well your data captures the statistical properties of your real world data.
Speaker 2 (48:49):
Thank you. Grayson. Also, just to build on what Grace just mentioned, we just launched the Gretel tuner, which is essentially a framework for doing hyper parameter tuning, which is fancy word for essentially adjusting the configuration of the model. And you can set SQS so our quality metric as an objective for the GRETEL tuner and essentially ask it to help you choose what the best parameters would be in order to get the highest quality on your own dataset.
Speaker 4 (49:26):
Cool. Another question that came in is about the market of synthetic data. I think it's an interesting question of where do you see the market of synthetic data going? Will there be more and more data platforms or will there be more and more specialized data w slash topic solutions in the next five years?
Speaker 1 (49:47):
Great question, I think. And so once again, just our opinion that synthetic data and people are starting to realize really creates a way for businesses to unlock the value of data that they have. We've talked about it for internal use cases, but imagine for example, an LLM that is trained on customer service responses or LM is trained on clinical trials, for example, to use the example we saw earlier. So I'd say where I see this going are a large number of domain specific highly capable models that can be part of a platform and can be granted access to or even monetized by businesses. So you could quickly ask for example, leading healthcare life sciences research company to help augment your data based on a model that they've fine tuned on synthetic data. So I'd say a long answer to that question, but I think the idea of a bunch of customized models trained on synthetic data or synthetic data that we've made available for domain specific use cases is a really powerful concept we're starting to see and hear a lot about from customers.
Speaker 4 (51:08):
Cool. Another interesting question, I just came in from a live chat is do you have methods for detecting biases in the distribution natural language, text data, and I think just biases in general in the data. I think it's an interesting topic
Speaker 1 (51:21):
And a hard one at that, right? So I think this is hitting on one of the hottest topics in machine learning today, which is alignment and capturing unwanted biases that a model might learn from its training data and replay. We showed some examples earlier of our synthetic quality score report. We have one for both text data and we have it for tabular data as well. And what the SQS really does is it shows how well that the model managed to have memorize the distribution of the original data. So if you trained it to capture all the biases or the unwanted or essentially everything that exists in the original data, essentially we validate the model's ability to have learned that you can use synthetic data as a tool to adjust those biases. So if you want to go to unwanted biases in your data, for example, things against protected classes or things like that, you can use the synthetic model to increase the representation of minority classes in the data to even adjust skew in the data. We don't have a tool that automates that. So when it comes to addressing, when it comes to addressing biases and data, unwanted biases and data, I would really view synthetic data as a tool you can use to address it. Unfortunately, not a magic button to solve that problem. I think there's a lot of nuances that require human attention to look at it. So we've got several examples we can share and some blogs, but view synthetic data as a tool, unfortunately not an answer for it.
Speaker 4 (52:54):
Cool. I think we have time for two more questions, so I'll put up one more and then we'll take the last, I'm sorry, there's a lot of questions. And if you don't get to yours, there's a Discord community. If you go to Gretel ai, they could always kind of join. You have Alex Grace and Sammy all monitor that, so you guys can all get answers after this as well. But this question I thought was interesting. It's a use case driven that, are there examples you have around how synthetic data can tackle fake news quotes or help create better statistical modeling around quotes? What if scenarios on geographical statistics of large issues, big data?
Speaker 1 (53:36):
Great question. It feels like unpacking that a little bit. We have two sub-questions in there. Can we use synthetic data to better identify fake news? And then B, can we use it to create what if scenarios for? You're talking about a very specific use case, but also just generally against large data sets. I'll answer the second one first because it's a little more straightforward. We have in all of our models, whether you're working with tabular text time series, any of those, we have something called conditional data generation. And what this allows you to do is to prompt the model and ask it to generate more data that matches a certain distribution. So you could ask for a certain day of the week and a certain class of things, if you are an e-commerce data set, get me more sales activity on this one day of the year that happens to be the biggest shopping day in my country, and generate a data set for that.
(54:32)
In the financial world, we see a lot of this where you're creating a black swan rare events that rarely exist in real world data, but you want to be able to detect effectively with your machine learning so you can prompt the model to create these distributions for you. So really popular use case that we see both for our pre-trained and then our other models that are trained on the data directly on tackling fake news. I think we'd have to unpack that a little bit more. I think that you could compare a dataset against a model that's trained on synthetic data to say, how well does this match the distribution? And even potentially use that as a way to detect a fake news. So for example, with natural language text and looking at things, if you knew what fake data looked like, you could use techniques like this to build a classifier. But I think that fake news generally as a category is a slightly separate use case than what we're building with synthetic data.
Speaker 4 (55:33):
Okay, cool. So final question and perfect segue, but we have got a lot of questions in the chat in Discord and other places about tab LLM. If we could tackle what is tab LLM and when will it be ready for early access? And I don't know if you have any exciting announcements to make with that, but I'm going to put that question up.
Speaker 1 (55:54):
I do. So tab LM I think is one of the more exciting things that we've had as a research project at Gretel since the very beginning working with synthetic data. The question has been, I think everyone grasped, as Sammy was kind of pointing out earlier, that we all have data sets, all of our data sets are limited in some way. And what if you could apply global knowledge at global context to your data? So Table M is meant to address that use case. Essentially it is the first example that we're aware of. There's been some cool academic research, but the first commercial example of a LLM application that is designed from the ground up to help you work with tabular data. So use cases for tab LM are creating data from scratch from a national language prompt, like what we demonstrated editing a dataset. For example, if you wanted to add a new column, we've got some great sample videos and demos we can share here.
(56:45)
But if you have a data set inside of your data warehouse and you want to add a new column to it, so you've got customer product views inside of your warehouse and you want to add a number of stars and it didn't exist in your original data or you want to add a sentiment, it's as simple as asking the model to read each record inside the data and apply a new column, which is the sentiment or the number of stars of the report even to translate it to another language. So it's a really simple way to interact with your data at scale. And finally, dataset augmentation. So at times where you don't have enough of the right data, let's say you are trying to train a sentiment analysis classifier on data going across your platform and you don't have a lot of examples of toxic language, for example, or you're trying to adjust some sort of distribution, you can ask the model to create more data.
(57:35)
Essentially ask tab LM to create more data that you can use to augment your machine learning training set that you're using to train an LLM. So in this case, it will create new examples in the exact same format as you provided you asked the model to do that will help you essentially create better training sets for your models. The tab, LLM right now, we launched to early access last week. So I'm thrilled that we've got several hundred users on the platform right now. We're adding more and more really using it to gather feedback. So one of the things we ask for with early access users, anyone can sign up at Gretel ai slash tabular lm, we'll send a link. We ask, please jump in, give us your feedback, share your use cases, help us build a better service and application, or the best application we can. But super excited about it. Right now, it is available. We're letting users in off the wait list, so please sign up, jump in, and we ought to be able to let you in within a few weeks.
Speaker 4 (58:33):
With that, I just wanted to conclude. Thank you everyone for joining and thank you. Thank you Alex Sammy Grace, for this great presentation and we hope to see you all guys in future Gretel events. Thank you everyone.
Speaker 1 (58:46):
Thank you.