Presentation

Synthesize Relational Databases with Gretel

April 26, 2023

A deep dive on synthesizing relational databases with Gretel

Read the blog post

Transcription

Speaker 1 (00:08):

Hi everyone. Thanks so much for joining here today in this webinar about synthesizing relational databases. My name is Murtaza. I work in product marketing here at Gretel. Super excited for you all here to join. We have a great agenda here today. I'm joined by Alex Watson, co-founder and CPO of Gretel and Grace King, relational product lead. Super excited for 'em to run the show. Just setting expectations here. We're going to have a brief introduction of relational databases, why it's so difficult, has historically been so difficult to work with. Grace and Alex, we're going to work through a couple examples, walk you through a notebook, and at the end we have ton of time for q and as. So please do continue to comment if you have any questions and there's a number of questions that have come in through the forms. Thank you all for submitting them. We'll have time at the end to address each one of them individually and without further ado, I'm going to hand it over to Grace and Alex.

Speaker 2 (01:09):

Thanks Meza for connecting us here. I'm Alex Watson. I am the co-founder and chief product officer here at Gretel. Today we're, as Meza mentioned, we're going to spend a little bit of time, super fast overview of what is generative AI and how does it relate to synthetic data, and then we're going to dive right into working with relational data sets and Grace will answer the tough questions that we have coming from the audience. Grace, you want to do a quick introduction?

Speaker 3 (01:36):

Yeah. Hi folks. I'm Grace. I work in product at Gretel,

Speaker 2 (01:41):

So starting out from a thousand feet talking about generative ai. One of the questions that we get all the time and we've really heard in the past couple months in particular has been how do I leverage generative AI within my business? And there are so many opportunities here. It's such a transformational shift in the area, so it's a really exciting space. Where Gretel focuses is around the creation of synthetic data and that's really the topic we're going to be talking about today. So whether you're working with tabular data or text or time series data images, like really any type of data that you would use for machine learnings or analytics purposes, that's really what we focus on here at Gretel. An important part and for I think something so critical to how this works well in the enterprise scenario is that we refer to this as the last mile.

(02:31)
That last part of taking one of these incredible generative models like what we offer or open AI or CHATT BT or things like that and making it work for your enterprise data. One of the hardest problems is how do you get that model to learn about your enterprise data without learning anything that it shouldn't? So privacy is such an important construct there. So here is a quick overview that you guys can see on the slides of the different modalities that we support currently with gretel all be available via simple developer interface or via our console. So no code interface via console. You see really here this kind of reflection that there's no one model to rule them all. So each different type of model, whether you're working with again or a diffusion model or a language model, each has different characteristics and different advantages. So we offer a suite of models that allow customers to pick the best model for their particular use case.

(03:27)
There's a lot of text here, there's a lot of stuff that you can read about, but really I think the TLDR is that it's very simple today to use Gretel to work with tabular data or text or time series, whatever type of data you have. And really our vision is that while these models will always be here that most people on our platform, most developers, data scientists that are using our platform shouldn't even have to know or care or spend the time learning which model to work with. We are increasingly automating the selection of these models, the selection of the high parameters, and really giving you the most accurate and private data that we possibly can. With that, let's jump right into talking about relational data. So Grace, maybe just we'll kick off here with some questions, really kind of high level questions that we could use to talk about this for our audience. So today we're talking about relational databases and a question for you is how does relational or how does generative AI create synthetic versions of relational databases and what techniques do we use to do that?

Speaker 3 (04:31):

So synthesizing relational databases is super interesting. Unlike maybe single table tabular synthetics, you have to maintain the correlations both within a table and then also across tables because these tables are related to each other. So we have five available models and use a combination of key frequency preservation and conditional generation and model seeding so that we can make sure that the synthetic databases maintain that statistical quality that you would see in your original database while also adding those levels of privacy that are inherent with synthetics.

Speaker 2 (05:18):

That's really great and I think that mention of privacy and quality leads into a question around what are some of the real world use cases that we see customers? Why would you want to sympathize a database using gretel? Why not just copy your production data?

Speaker 3 (05:36):

I think there are so many use cases, but to name one, a lot of times databases can have sensitive data, maybe things that you wouldn't want to share in lower environments, but lower environment teams might need that data for analysis. And so you could subset your database and share a private and smaller version with a development team. Can you think of any others, Alex?

Speaker 2 (06:06):

Yeah, I think that's one that stands out. Often the customers we talk to will have databases that can be in the tens to hundreds of gigabytes in size. This is the production data. This is kind of the lifeblood of data that are coming in from their application and there are a variety of use cases you might use it for inside the walls of your business. That said, customer privacy and trust is so important. Also, maintaining the uptime of your production database is really important. So a common pattern that we see with our customers is creating, we call it a pre-production or a staging or development version of database that can be made, access can be made available internally. So what you want to have is all of the insights and the dynamism of your production data and the accuracy of your production data without having it be real customer data.

(07:00)
It allows you to open up access for data analytics, for application development and even machine learning, which is an area that we've seen quite a bit of uptick in. We'll be talking about that a little bit later today. So those are big and I think Grace, you mentioned Subsetting, which is another really common pattern that we see where it's very expensive to run a 500 gigabyte database, especially for depending on your use case. So what if you could run a five gigabyte or a 10 gigabyte version that could run on a single box inside of your cloud that had the same level of insight, so it had the same type of distribution of data. So that's a really big one. Conversely, sometimes we see people even wanting to scale the data up. So you might have an early application, you want to do load testing. You say what happens if we had a hundred times the activity or a thousand times the activity? So another use case that we see pretty often are developers coming in training a synthetic version of a database using gretel, but then using some of our generative models like Amplify or Act in that are extremely fast to generation to generate sometimes tens of billions of records to load test their systems.

Speaker 3 (08:13):

Yeah, I think that's such a big use case and load testing is so important. You want to know when your application is going to break before it actually does.

Speaker 2 (08:24):

Yeah, so taking a look at the quality aspects that we're talking about here, how do we measure the quality of a synthetic database and how do we compare it to the original real world data that it was based on?

Speaker 3 (08:38):

So we have a data validation tool for relational, we call it the Gretel relational report. So we have unique accuracy and privacy scores that measure both how the synthetic data matches the statistical properties of your railroad database and then also measures how private your data is. And what's great about this report is it's super consumable. You can look at it on a high level database level and then you can also get scores for the individual tables and drill down on really deep level metrics. And so whatever view you want to look at, you have that data available.

Speaker 2 (09:30):

This has been for me, I think one of the surprising learnings that we've had over our time at Gretel where we started with this overall report, we built a score that you can see reflected there. We call it this and that quality score, which is a zero through 100 score, very high level, but really useful to look at your overall dataset and then the report allows you to dive in to whatever level of detail, even looking at correlations and PCA distributions and things like that of data. So the power of a high level score like that to help you kind of reason is I think a pretty cool thing that we learned as we were building out Gretel.

Speaker 3 (10:06):

And a little bit later on we will show a live version of the report and how you can click in to see more details in different tables.

Speaker 2 (10:17):

That's great. Today also as well, I think some of the things we'll be talking about are some of the real world use cases that we've seen from customers, developers, people like that that we're able to reference using our platform to work with synthetic data sets. Grace, if I'm correct, you've got an example from an educational institution that we'll be talking about later today?

Speaker 3 (10:40):

Yes. Do we want to talk about that now?

Speaker 2 (10:42):

Let's jump in. I think it's good framing for the rest of the discussion.

Speaker 3 (10:45):

Yeah, so we have been working with a top research university. Their team is doing clinical research for cancer patients and they came to us with a database that contains clinical studies but has some I some private information about their cancer patients. And for them this was posing a problem because it was limiting the access to the data. And so the researchers whose job it is to analyze the data and develop more targeted cancer treatments, their research was being slowed by having to jump through hoops for data access concerns about preserving patient privacy and really respecting healthcare regulations. And so they came to us and we worked with them on a solution. They used Gretel relational to synthesize their clinical database. And what really stood out is that they were able to preserve both the statistical properties and the cross table correlations in this database that are really critical for something like cancer clinical research. And they now have a synthetic cancer treatment data set that's safe to share across researchers. And the feedback they gave to us is that this was just invaluable at speeding up this critical research by simplifying their data accessibility.

Speaker 2 (12:20):

That's so cool. I think we're so used to consuming machine learning data sets on Kaggle or hugging face and we're looking at these highly formatted and feature engineered data sets that you can use to train a model. But so often the beginning of working with machine learning training set is actually a database. For example, like what we have here with the educational institution or we've got several examples in with customers in the gaming vertical space too as well where they're building anti sheet or anti-fraud mechanisms and it starts with extracting data from a database. So by being able to create a synthetic artificial version of that really sensitive dataset database to start with, it really speeds up the process. And I think that's one of the unique things about greel is the high level of accuracy that we're able to get using generative models allows you to do downstream use cases including machine learning training, which opens up a lot of doors for businesses.

Speaker 3 (13:19):

And another great thing that we hear from customers both with the gaming company you referenced and this research institute was just the ease of use really regardless of your technical level where if you're an MLAI expert or really not a lot of coding experience, we really strive to make our products as easy to use as possible.

Speaker 2 (13:47):

Most definitely. Well let's go ahead and let's jump on in.

Speaker 3 (13:51):

Yeah, so we're going to do a demo showing you guys our sample notebook for relational synthetics. So I am going to move over to our, this is our Gretel console. If you don't already have an account, create one for free. Super easy and fast to do so I am going to show you guys our relational synthetics notebook on the dashboard. We have a bunch of use case cards, so whatever your synthetics use case is, we've got a way to do it. And this can be a great jumping off point to either find a use case that you already know you have or maybe get inspiration for how synthetic data can work for you. So today

Speaker 2 (14:41):

Before Grace jumps in there, I was going to say if anyone wants to just go to gretel.ai and click sign in or sign up. All you need is a GitHub address or a Gmail or your corporate email to sign in. You can do everything that Grace is doing here using our free tier.

Speaker 3 (14:56):

And additionally, we've dropped the link to the notebook that we're going to be running here. So if you hop on late and you've already missed the console part, you can copy that address and follow along. So I will cover more of the notebook details about our sample database when I open the notebook, so it's a little bit easier for folks to read, but this page gives an overview of the database format and what we're going to show to run our notebooks. You'll need your Gretel API key. Super easy to just copy it here and then I'm going to launch the notebook.

Speaker 2 (15:41):

Couple questions for you, grace here we're using generative models here. Do you need to have a GPU to run this model? And what kind of deployment mechanisms do you have for running in the cloud or running it inside your own cloud?

Speaker 3 (15:55):

So you can run in the Gretel cloud or you can run locally on your own infrastructure. We have support for both. We also have five different relational models. The model that we're going to be demonstrating today is our Amplify model and you do not need GPUs.

Speaker 4 (16:16):

Great.

Speaker 3 (16:18):

So this is a sample tele communications database. The format is simple in terms of the number of tables, but you do have some depth in the complexity of relationships. And in this use case, they have a team that's tasked with analyzing patterns with invoices and payments, but you do have some sensitive data in terms of street addresses, sim numbers, telephone numbers, stuff like that. And so the goal here is to create a database that really maintains the statistical properties so that their analysts can extract accurate information while also getting that privacy protection from synthetics. So to start the notebook, I'm just going to click run all. And the great thing about this notebook is it will automatically install any packages you need when it finishes installing those packages. You'll be prompted to enter the API key we copied and at that point it is off and running. We have integrated connectors to I think over 30 different databases so that you can connect directly to your database and don't have to worry about manually defining the schema, the relationships between the tables. And if you're not comfortable or can't do that, you can also provide the table data using individual files.

(18:19)
So my network connection not great at the moment, so I'm going to flip over to a notebook that has already been run. This notebook takes approximately 15 minutes to run. Well look, so I'm entering my API key and we are now off and running connecting to the database

(18:50)
And starting training. So a really important part of Gretel relational and something that sets us apart is maintaining relational integrity so that the key relationships are still there and the key frequencies match your original data. And a really fast way to check that referential integrity has been maintained is to join two tables and compare the number of tables in your parent. So here with the number of records in your joint table and if the integrity has been maintained, those will match. And so this is our source data, but we'll show at the end of the notebook running that those numbers still match and that referential integrity has been preserved.

Speaker 2 (19:45):

Do you have some flexibility, grace with this notebook? Should you want to create less records for synthetic data or more records? Is that something you can do from within the notebook?

Speaker 3 (19:54):

Totally. So here you can change which model you're using. As I mentioned, we have five available model options and then later on after the model has trained, you can choose how much data to generate. In this case, we're creating a one-to-one synthetic match of your original database. But as Alex was talking about earlier, sometimes your use case, you might want to subset your data and so you can adjust that parameter to maybe create a database half the size or one 10th of the size. On the flip side, some people need more data. You can double, triple 10 times your data and it's just as easy as adjusting that parameter. So I'll flip over now to a completed notebook. So as we saw, we have trained and run generated the data. This takes about 15 minutes and so we can look at the results. First I wanted to show you just an individual table so that we compare what the original data looked like with the synthesized data. And you'll notice here that not only do the formats of these records match the synthesized data, looks like the real data, if you look closer, the date ranges are within the ranges that we saw in the original dataset, these statuses and amounts. In this case, this is how much the customers were paying the telecom company, those are also consistent. And

(21:57)
Right now you're just taking my word for that because we're looking at five records. But later on we'll look at the Gretel relational report that gives you hard numbers so that you can be confident in the quality of the synthetic data.

Speaker 2 (22:12):

This is one of the things that's so cool too, is you don't have to define a field as being a date field or another one as being a person field and you want to replace with something else. All that gets detected, all the numeric attributes get modeled automatically. So really as Grace just showed, it's very hands off. You just click run and go.

Speaker 3 (22:31):

So as we talked about earlier, we want to make sure that the key relationships, that referential integrity, which is critical to databases and having them be usable, we want to make sure that's been maintained. And a nice fast, easy way to do that is to compare those record numbers. And so as we looked at before, we were comparing the account table to joining it with its parent table client and those numbers are still the same, which shows that those relationships have been preserved. And while this notebook doesn't show it, we also make sure that the frequencies of those foreign keys remain the same. So if one client has only made one payment, you want to see that in your source data or sorry, your synthesized data. So as I mentioned, we have the GRETEL relational report. The report automatically renders in this notebook, but you can also look at it as an HTML or downloaded as A PDF right at the top. As we talked about earlier. We've got that quality score that tells you the accuracy of your synthetic data and we have a privacy score. Alex, is there anything you want to add before we dive into the table level scores?

Speaker 2 (24:07):

No. One thing that stood out to me is that the privacy score isn't normal. And what that means is we don't have any of the additional privacy enhancing technologies in addition to synthetic data turned on. So we expect this to work really well for internal use cases. We're you to take this database and want to share it externally. For example, post it on the internet, share it with a trusted partner, something like that. You might want to look at more privacy protections. This is something you can configure directly in your synthetic config. So it's as simple as saying, I want to turn on privacy filters, which give you really nice protections all the way to differential privacy. So I want formal mathematical guarantees for the privacy of this data, all different options. And really we want to work with developers to find that right kind of balance between utility, so the accuracy of your data and privacy. So in this case definitely it's a utility focused use case. Privacy is good, not necessarily like the mathematical kind of formal guarantees that you would get with differential privacy. So you get to pick where it lands. This is really just meant around optimizing for utility and which works great for most internal use cases.

Speaker 3 (25:20):

And another great option for adding that additional level of privacy protection that allows you to share data externally and feel really comfortable with that privacy level is to combine our transform model with the synthetics. That's what we use in our GDPR compliance examples. And with Gretel relational, you can create that workflow to transform and then synthesize again just by clicking run and letting it do its thing. And so that's a great way to really add that extra level of reassurance that you have privatized your data. So we talked earlier,

Speaker 2 (26:10):

I have a quick example of why you'd want to run those two in a row. So here if you're looking at this schema and you see account ID and client id, the synthetic models on its own not going to recognize that as being PI for example, what transform gives you the ability to do is to say it has the ability to search for over 40 different types of known detectors or for you to even build your own. And you could say, for example, this account ID is private information, I want to replace it with a hash or a fake account ID or auto incrementing numerical attribute. So that can be done via a single configuration across all tables to make sure that the synthetic model never gets exposed to, never even sees what a real count ID looks like. So the synthetic gives you protections against all the other fields in your data. Transform gives you the ability to hash and crib, remove fake the known PII examples inside of your dataset.

Speaker 3 (27:04):

Yeah, I love the flexibility of transform to not only find that sensitive information for you, but then you can also choose how you want to transform it, be it a hash or maybe you have a use case where you want to create data for a sales pitch and you want it to look real but not be private sensitive data. And so one of my favorite things is using the feature where you can replace names with realistic looking names that then you're not exposing anyone's data. So in addition to those top level scores, we also have two quality scores for each table. The individual SQS looks at that table really in isolation comparing the statistical accuracy, that data quality. And then we also for relational have what we call a cross table SQS. And this is looking at the table in the context of the database as a whole and really analyzing how well the relationships and statistical correlations were maintained across that whole database. And in addition to just looking at the scores here for folks who really want a deep dive on that quality, a additional report is generated for each SQS and you can click and get a really deep dive on the quality of individual tables.

(28:52)
Alex, do you have anything to add or do we want to move back over to our slide deck and continue on? Oh, Alex is muted.

Speaker 2 (29:01):

We could maybe, sorry about that. Scroll down and look at some of the more advanced parts of the report there if you don't mind going down. Yeah, some of the areas that I find myself looking at a lot are the correlations that we have across different fields, so a little bit lower down as well as the PCA distribution, so all kind of neat data sciencey views. So this gives you a real quick look. This is a very small dimensional table, so at the top graph here you can see that very weak correlations between those different tables. So not a lot to look at, but it was captured pretty well. PCA is a data science technique and it's kind of a favorite technique in the toolkit for a lot of data scientists to look at the distribution of data and make sure we didn't overfit. So here when you see the synthetic data, kind of like a graph looking very similar to what the dispersion level of the original real world data, it shows that the model not only learned the characteristics of the data but recreated the same kind of distribution across the dataset as well.

(30:03)
So these are all reflected in the top score that you see, but it's a really nice ability to kind of dive in and just get a feeling for whether your dataset overfitted or if it learned the different characteristics that it was supposed to. And it looks like it succeeded on both right here.

Speaker 3 (30:17):

I really appreciate these reports personally just for the confidence it gives me in the synthetic data because it's hard to get this kind of statistical confidence just by, I don't know, scanning the first thousand records of your data and especially as dataset get bigger, not having to manually do this analysis is a huge plus.

Speaker 2 (30:41):

Yeah. Do we have the per column distributions down towards the bottom too? Is that right Grace? We can go a little bit lower. So these are ones you can look at too and just kind of understand how well the model did here at recreating this specific distributions. Looking at this, you had a really even distribution across the training data. When we look at the histogram and it looks like probably a very small data set that we're working on right here, more of a galaxian type distribution across the synthetic data. So this would be a sign if you're trying to improve this one that you want to add a little bit more training data, you can also do some model parameter tuning to get a little bit closer. But my guess here without looking at the dataset itself is this is a really small dataset that didn't have a lot of correlations as we saw earlier, so not a lot for the model to pick up on.

Speaker 3 (31:28):

And another option here, we used our Amplify synthetic model. We have other synthetic models and so if you're seeing maybe data that isn't as accurate as you'd like it to be, we have a number of tips and tricks like Alex mentioned, adding more data, cleaning up your data, but another option is to maybe test out another model and see if that is a better fit for your use case.

Speaker 2 (31:57):

Yeah, my feeling is on accuracy there. Using something like Acton is extremely good at capturing these types of distributions, so that would be a thing to try.

Speaker 3 (32:07):

Yeah, maybe after the webinar I'll go and test again. So moving back over to our deck. I think at this point we are happy to start taking questions. I don't know if our lovely moderator Za wants to hop back on.

Speaker 1 (32:28):

Yeah, I'll magically reappear here. So we have a bunch of questions coming in. I'm going to just say some of them in the order that they came in. So our first question here is from Mike Mike's asking, does Gretel have to ingest the raw dataset and what protections exist?

Speaker 2 (32:50):

So I'll take this one. Gretel needs access to the real world data to train the model. As soon as that is done, as soon as model training is done, the real world data is no longer required. You have a model that can generate either the distribution of the real world data or you can tell it to modify the distributions and maybe some clarifying or next level questions on that one. If you are using the cloud service, that means that the training data needs to be sent to the cloud service, in which case it'll be deleted immediately after never gets stored, but you also have the option of running in what we call hybrid mode. What that does is it deploys sret workers inside your own environment. This is something we see really popular with companies in the regulated space where essentially the machine learning training takes place in their cloud using containers that are shipped from Gretel. So your data never leaves, your environment model is created there. Data stays inside of your BPC or your cloud. So you have just total assurance that no data or nor no metadata is going to be sent up to the Gretel service.

Speaker 3 (33:55):

And just to add to that for folks using the cloud, we do have end-to-end encryption and Gretel never sells your data and in the era of everyone selling your data, that adds just another level of confidence. Yeah,

Speaker 1 (34:13):

Very cool. We have another question here. I think we've somewhat touched on this, but I'll reiterate it in case you guys want to add anything else. We have a question in from of near asking, how do you maintain correlations between tables?

Speaker 3 (34:29):

Yeah, so we touched on this a little bit earlier, but we have a variety of techniques that we use to both maintain the key frequency, but then we also use some smart seating and conditional data generation to really ensure that those cross table correlations are maintained. That's part of why synthesizing relational data is so complex and why historically there haven't been great generative AI tools for databases. Those relationships add just an extra level of complexity and we spent a lot of time figuring out how to do it well and how to make it work as easily as possible for folks. Yeah,

Speaker 2 (35:25):

So for little, to add a little bit to what Grace was saying, conditional data generation is when we ask a model to generate a record with a certain amount of attributes. So when you want to make sure for example, that a frequency of orders or products or for example exists in a table because it existed in that frequency in the real world data, rather than relying on the model to statistically generate something that matches the same distribution, we can actually learn distribution from real world data and then ask the model saying, Hey, model, I need you to generate a T-shirt with these characteristics of this size from this vendor. Complete the rest of the record yourself. So essentially we're able to take advantage of some characteristics that we learned from the original dataset to enforce that any data created from our model kind of matches that distribution. So it helps quite a bit.

Speaker 1 (36:21):

Super interesting adding to that discussion. We have a question in from Alberto who's asking the how do you control both fidelity and privacy when generating synthetic data for relational databases?

Speaker 2 (36:35):

Yeah, we always kind of refer to this as a balance or a scale, right? Where on one side you have utility on the other side you have privacy and synthetic data has both the ability to meet the utility of the original data. But one of the things so cool is we actually have in many cases the ability to exceed the utility of the real world data. So the question is like how do you get something better than real world data often, and I think a lot of people that work with data here can sympathize with this data is very limited. So the data you have is really an approximation of the real world data and it's often lacking. So you either have a poor balance of distributions for classes that you care about, not enough examples for a class. And so one of the things that synthetic do can do is it can introduce new examples that would help a downstream model or analytic that you're training on that data respond better to things that it hasn't seen.

(37:36)
A super simple example is if I have a patient medical data set of people between the ages of five and 45 in age and I want to train a machine learning model to detect some sort of rare disease, I get a patient introduced that is 47 years old. I've never seen anyone that's 47 years old. A lot of machine learning models will get tripped up by this. The synthetic data model has the ability to learn about what an age is. It's seen this before and introduce new examples there that would help a model trained on it to do better. So that's where you have the possibility of doing better and in many cases with some tuning do better than the real world data. It was based on that said, you have other constraints. You want to make sure that no combination of attributes, this is speaking to privacy inside of your data could become identifying.

(38:24)
So for example, if you know that someone of a certain age and a certain zip code had a certain preference you could infer about whether they were inside the dataset or not. All of the fake data replacements and transformations you do in the world that really aren't going to protect you against this. What the synthetic model does is it creates a thousand fictitious new patients where none of them are based on a real individual patient. So if you query the overall data, you're going to get an answer very similar to what it was based on, but you're not going to be able to extract whether one specific patient you knew about was in that data set. So it's a huge advantage. We have ways to even layer on top of that. So there's technologies that, we actually just released a new model this week on a technique called differential privacy, which can give you mathematical guarantees that that won't happen.

(39:11)
With these privacy enhancing technologies you start, it often comes at the expense of utility. So for you to get those differential privacy guarantees, you may have to insert additional noise into the data if you were to query it or train a machine learning model that may initially reduce the accuracy of that model because you had to put noise in to protect privacy. So we really want to work with our developers on our platform to find that right balance, how much privacy do you need for your use case and make sure that we deliver on that.

Speaker 1 (39:46):

Awesome. Those are really the two pillars, right? Privacy and utility. We have another question that came in live actually from Prashant who's asking about where they can learn more about the amplify model. Also, if there are any constraints on the number of tables that can be synthesized and whether this could be used relational, grad relational can be used for very large databases like knowledge graphs where there are multiple hierarchical classes and complex inner relationships. I'll display it here so you guys can see it.

Speaker 2 (40:20):

Grace, you want to go ahead and take that?

Speaker 3 (40:22):

Sure. So a couple great places to learn more about amplify our docs at docs gretel ai or you can play around with it on our console. We have examples for both single table amplify and then relational amplify as we showed earlier. In terms of constraints, we do not have constraints on the number of tables. You will have more compute to do that, but we can handle whatever number of tables you have. Sorry, just reading the last part of the question. Alex, do you want to take the last part?

Speaker 2 (41:11):

Yeah, so the last question here is can this be used for large databases like a knowledge graph with complex interrelationships? We have very quickly seeing that the scale of databases that are being used scale into the hundreds of gigabytes in size into the terabytes of size. So definitely can be used for large databases. Our underlying generative models are built to scale both to data sets that might have thousands of columns that are all related as well as millions to tens of millions of rows. So definitely scaling there. To answer the second part of your question about the complex inter relationships and you mentioned a graph database, that might be a good place to start. We are just starting to look at graph databases like Neo four J for example. One area where we've seen success with our customers is when that graph can be exported as a series of nodes and edges. So node would be a logical entity and an edge would be a strength of a connection in between those two different entities. That is essentially a two table structure you can use where the relationships can be captured pretty well. So the graph databases are not among the 30 connectors that we have right now. We're excited about that. If you have any questions, definitely reach out. We'd love to hear more about your use case.

Speaker 1 (42:37):

Awesome, super cool. Another live question here from Eric. Eric's asking that, he's saying that, I know this must depend on the application, but in general, what ratio of synthetic data versus real data, real world data, are we seeing the max gains in performance? I don't know if there's a simple answer to that, but Alex and Grace,

Speaker 2 (42:57):

There's not, I'll try to answer this one without saying that. It depends where are we see the most luck. There's no kind of magic ratio. If you sympathize as many records or two x as many records as you have for synthetic as the real world data, whether you're going to see gains or what those gains are going to be. So there's no magic ratio. I would encourage you to try augmenting your dataset with twice as many synthetic records and see how that works. The areas where we've seen the most improvement are when you have a class imbalance. So this is a case where some class that you care about inside of your dataset to use that heart disease detection example, like early use case with Gretel and working with University of California Irvine researchers was they had a data set that was highly imbalanced between males and females. We see a lot of gains typically when you would balance out those classes that you care about. So essentially augment that machine learning training set with additional synthetically generated female patient records. No substitute for having real world data, but if in many cases there's no way to go back and generate more of these examples or recreate the experiment. So they got significant gains by taking those classes that were underrepresented in their dataset and then boosting them up. So something I would definitely recommend experimenting with.

Speaker 1 (44:23):

Okay, we're just about to hit 45 minutes. For the interest of time, I'll end with one last question and then I'll encourage folks to join our Discord community where you can ask as many questions as you'd like. We have more than almost 700 folks there, our applied research team and all of Greta looks at all of those questions, takes them seriously and gift replies. But I'll end with this one last question, which asks about the quality of input data. How much does the quality of input data affect the quality of synthetic data? It's coming from Edward?

Speaker 3 (45:02):

Yeah, this is a great question and it's something that we're asked a lot. So our machine learning models, like the name says, they learn, and so if you have high quality input data, you will get high quality output data and if the quality of your input data is a concern, we have a number of tips and tricks for cleaning it up and really improving that quality. That's another case where the synthetic quality score comes in handy so that you can see that synthetic quality and decide if you need to make adjustments for your use case. Another example is Alex mentioned with the conditional data generation, with the heart study, the quality of the input data wasn't necessarily great, it was unbalanced. They had too many male data points, not enough female data points. And so that's a great example of using conditional generation to improve the quality of your output data.

Speaker 2 (46:10):

One thing to add to grace, and this is always fun to maybe talk about roadmap and things that our research team is working on right now. We just did a hackathon last week on this. Actually. One area we're really excited about improving quality for input data is the ability to use these generative models. Use LLMs for example, to automatically clean or improve your dataset or to walk you through it in an interactive manner. So definitely an early area, but for example, detecting and adjusting for skew inside your dataset or missing values. These are things that we think synthetic models will be able to walk you through in the future. How do I just improve this dataset that I have in a natural language? Kind of like a very conversational manner. So we're pretty excited about that.

Speaker 1 (46:59):

Awesome. As we conclude here, I just wanted to thank everyone for joining all the great, great questions. There were so many questions coming in. I was very humbled to see all the interest. I wanted to remind everyone that you could go to gretel.ai to sign up for a free account and try all this out for yourself. There's really nothing like actually tangibly trying all of this and not a theoretical study here. It's you get 15 free credits just for joining, no conditions attached, no credit card acquired, all that type of great stuff, and continue the conversation, our Discord community where you could ask any questions for Grace. And Alex, anytime. Thank you so much, grace and Alex, and thank you everyone for joining.

Speaker 2 (47:38):

Thanks Mar. Thank you everyone. Bye.

Speaker 4 (47:40):

Thanks guys.

Synthesize Relational Databases with Gretel

Video description

More Videos

Generating differentially private synthetic text using Gretel GPT

Anonymize Financial Data with a Fine-Tuned SLM

Video - Generate synthetic data for training LLMs and SLMs

Read the blog post

Transcription