Empowering Test Data Management with Synthetic Data
Video description
How synthetic data can empower various software development and testing workflows
More Videos
Read the blog post
Transcription
Speaker 1 (00:33):
Hi, my name is Murtaza. We're excited to have you all today. I work in product Marketing Gretel. We're excited to have you today. At this webinar, we're going to explore software development workflows and testing and management and how synthetic data data can empower these workflows. I'm joined here today with my colleagues, Mike Knepper, senior software engineer, Grace King, product manager, and elegant Lukes senior engineering manager. We're going to be exploring different use cases today with synthetic data going through a brief, we're going to start with a brief overview of what synthetic data is, what Gretel is, and how we've been seeing synthetic data being used across industries. After that, we're going to have a brief overview of different test data management and software development workflows and the various applications that we've seen synthetic data being used for across the field. And then at the end, we're going to go through a couple of deep dives into specific workflows, what they actually look and feel like and have some examples for us to explore together.
(01:38)
And at the very, very end, we're going to have an open question answer section. A bunch of you have submitted questions already through the webinar registration form. Thank you for that. Please feel free to continue to end your questions in the chat. There's a small little icon carrot there that you guys should all see. The icon carrot can be used to change your name if you'd like to change your name. And you can either keep the name that you registered with or anonymizing your name if you prefer to be identified anonymously. So we're going to get going here. So Gretel is a synthetic data platform built for developers with multimodal synthetic data generation capabilities. Since our product GA in February, 2022, we've seen massive adoption both from users and enterprises. We have around 70 more than 75,000 downloads of ra, DK alone, more than 65,000 active users and more than 150 billion records synthesized to date.
(02:36)
For those who are new to synthetic data, synthetic data is probably similar to workflows, other generative AI workflows you've seen across the field where you enter a prompt and get a response. In the case of synthetic data, most people enter a real world data set where they want to generate a data set that has all the same statistical properties as the original data without any of the privacy concerns. So you start with the original dataset, you configure kind of the things that you want to configure and what you want to generate. You train a model and then you validate and refine that model and perfect it until it's ready to use in your workflows and embed in your workflows. Then you use that to generate as much synthetic data as you like. And we've kind of seen three main uses of these type of general synthetic data workflows.
(03:21)
Largely we've seen create people create provably private versions of their data sets that share the same distributions and statistical insights of the original data. It's important to realize that these are not only the insights that you know, but insights that deep learning algorithms and statistical algorithms kind of catch that might be underlying distributions and correlations that humans might not be kind of cognizant of, but that these algorithms can capture in the dataset that's generated. So they're really representative kind of versions of your data without any of the privacy risks. Synthetic data is also used to generate infinite volumes of your data in cases where you don't have enough of it. And lastly, synthetic data is used to generate infinite variations of your data. So small permutations that might account for different type of events and stuff of that sort. But my colleagues are going to speak a lot more to each one of these and how they apply this software development.
(04:11)
So I'll let them do that. And at Gretel we have kind of a multimodal approach to synthetic data generation. We believe there's no one true model that rules 'em all. There are multiple models out there for tabular data, text data, relational data, image data, time series data, and we provide kind of the infrastructure and the models to kind of generate any one of these types of is are multiple types of these datas. We also have these enterprise features that kind of ensure data quality, that preserve data privacy and make infrastructure and scaling super easy. And we have multiple deployment models. So you can either deploy in your own cloud or you could use our Gretel cloud or secure Gretel cloud. Both of those options are kind of available and at the end of the day, all of that kind of complexity is kind abstracted away into three very simple APIs. You can transform your data, you can generate more data using synthetics, and you can evaluate the quality of that data using our evaluate model. And all of that's kind of accessible through either our SDK or public facing cloud GUI right here. I'm going to now hand it over to Grace King who will be exploring how synthetics apply to test data management workflows.
Speaker 2 (05:25):
Thanks Za. Hi everyone. So first we're going to look at an architecture diagram that really highlights the, for lack of a better word, madness of many modern cloud environments and how many different environments exist. While we only have one production environment, there are often multiple dev, multiple tests, multiple pre-prod environments, all of which have different data needs and different security permissions. When it comes to maintaining data hygiene, only the prod environment should have prod data.
(06:01)
When it comes to test and dev data is often the bottleneck. Your data is your organization's most valuable asset, but it's also one of the most challenging things to operationalize in testing a management. This bottleneck can be due to a number of factors, privacy and security policies that limit data access in lower environments, a lack of data or on the flip side, too much data for the budget and resources of those lower environments. An IBM study found that during testing 30 to 60% of a developer's time is spent searching for test data. Rather than actually testing a DevOps survey of DevOps professionals found that on average 3.5 days and 3.8 people are needed to fulfill a single test data request. I So let's talk a little bit about how people combat this bottleneck issue today. One approach that is used in the industry is weak data masking. Data masking is a manually applied approach where sensitive data is masked in an attempt to protect privacy. However, this has significant risks and limitations. Often the mask data still contains information or patterns that can be used to reveal the sensitive information it was attempting to disguise, making it prone to re-identification and other attacks.
(07:31)
Randomized data generators provide low quality data that doesn't truly represent the relationships and distributions of your production data and will not recreate those unseen correlations that merchs have talked about. Randomized generators struggle to capture the complexities and the nuances of real world data, and then some enterprises choose to simply use production data in lower environments. While that does capture the data quality, it creates significant privacy risks, can delay access to data and increase costs and resource requirements in lower environments. At the end of the day, these solutions force enterprises to make the impossible choice between data quality and data privacy. This is not a choice that can or should be made in test data management. At Gretel, we believe synthetic data is the solution that eliminates this data bottleneck and allows access to high quality data without compromising on privacy. Now let's look at some concrete use cases for synthetics and test data management. Today we're going to discuss how to use synthetic data to accelerate safe access to sensitive data in lower environments, to promote realistic unit testing, to enable load and stress testing by augmenting small data sets to reduce the time and cost of developments cycles using data subsetting and to unlock scenario-based testing using conditional data generation. Now I'm going to pass it over to iGen who will walk us through an example of using synthetic data to accelerate access.
Speaker 3 (09:22):
Thank you, grace. Let's start by talking about how we can accelerate safe data to production level data for software development and testing in different stages of software projects. As we know, teams need realistic data to test their application and avoid bugs before deploying their changes. So with this need, they turn to production data that's in many cases have been collected from internal or external users. And understandably, this data is protected in production databases under strict privacy, security and regulatory compliance requirements, data often resides in separate tenants or accounts than where the actual testing and development is happening, which brings additional challenges. So as a solution, using high quality synthetic data instead can prove to be immensely valuable. So let's take a look at an example.
(10:17)
Let's imagine you're part of a software team developing new features for an in-house payroll application. Let's say this application would allow the payroll team to enter and view employee information and related payroll data such as their names, gender, salary, data. And this data is in a tightly controlled environment and access is limited to only few individuals. As we near the completion of the project, we want to test the functionality with real employee data that represents the challenge of providing sense sensitive payroll data to a set of employees. But on the other side, without this, bugs could be discovered after launch, such as not handling special cases like contract versus full-time employees, the wide range of salaries, also the sheer number of employees and their data needs to be tested. So this is definitely an area that we don't want to see any bugs. On the right hand side, you'll see a sample of the database that we used that we are going to use for generating the synthetic data and we use for this use case, we use great tools, relational, synthetic data generation capabilities and create the synthetic version of this employee database, which you see a sample of.
(11:32)
So let's take a brief look at that. Next, on the left side you can see the same sample data that I had in the other slide coming from the employee database. And on the tables on the right are the synthetic versions of these tables that were the output of our jobs. Next slide please. So now that we have the data, but at this point, how can we make an informed decision on A, whether the synthetic data is realistic enough to share with the team to start testing with, and B, whether privacy is actually protected. There are no techniques to validate the quality of synthetic data. For example, teams can perform ground root comparison, statistical analysis, or even bring in domain experts to validate the new data, mainly looking at the properties and patterns against the original data. At Crile, the jobs we ran already produced a report that will make us make the right determination.
(12:35)
So at a high level here you can see we can look at two scores. On the left we have the synthetic data quality score, which came out to be pretty high in our case, which means that the statistical properties of the original data set is preserved and we have a realistic data set that we can share for testing. On the right, you can see we have a high privacy protection score, which means that the data is safe to share outside of the organization and in this case even externally. Let's move to the next slide. In some cases we would want to dig deeper though. For example, our team wants to ensure referential integrity of the data so they can test different parts of the application with one big set of data. Here we can validate from our report that the primary and foreign keys are identified correctly and that our new data maintains these relationships.
(13:34)
For other use cases, we can even check results for individual tables. For example, our list, imagine our payroll application has a feature to analyze gender pay gap among employees. So we need to ensure the static data also maintains the same statistical nature of the source data such as distribution of gender in these records. So this is something we can also look and verify at the table level. So empowered with our results, we can now provide the synthetic version of the employee database to our team and unblock them to continue with their testing phase. With this, I'll hand it over to Mike who will walk us through some more examples.
Speaker 4 (14:17):
Thanks GaN. Hi everyone. In this next section I'm going to talk about how we can use synthetics to intelligently subset production data for our testing purposes. The challenge here is that our production database is simply too large to mimic in our lower environments like a one-to-one scale. We need our test environments to be populated with less data, but we need to maintain the complexity and the nuance of that production dataset while of course ensuring privacy. So we might be constrained by resources, the storage capacity, the processing power, there might be budget constraints, simply the data storage costs. And really what we're looking for is faster feedback loops, quicker iterations and easier debugging. So synthetics is going to be how we subset that data. I go in this scenario, we've got data from a bunch of third parties. We want to transform this into a common format and internal schema that we're going to use for our downstream analysis.
(15:22)
So we know we're going to have a bunch of ETL jobs and in order to implement those jobs, we need to know what the data looks like. But on top of the general security access problem, we have this problem of scale. The warehouse stores inception to date data from all these vendors. So the production costs are just exorbitant and out of control. And our development teams meanwhile want to have multiple non-production environments for testing, right? They want a dev environment where they can pretty aggressively test changes. They want a staging environment that's a little more stable, a little more prod, and we just can't afford to incur that cost like two or three or four times over.
(16:04)
So our strategy here, if we go to the next slide is we're going to train on our full production data, but then we're going to generate synthetic tables that are only 5% the size of those in production, and then we're going to load that synthetic data into smaller databases and rely on our integration tests to start exposing nuanced edge cases in that data. So on the right here we've got a graph. This is a principle component analysis graph of both the training data and the subset synthetic data. PCA is a technique for analyzing data that reduces the dataset that's highly dimensional and complex to its most important factors while trying to maintain as much of that information as possible. For our purposes here, we don't need to go into too much detail. The important thing is that these graphs kind of look generally the same at this high level. These graphs can often form clusters and various patterns, and this case our data's pretty scattered, but the two don't look wildly different from one another. So that's good. That's showing that our smaller synthetic data still resembles the larger production data.
(17:11)
This next graph is a similar comparison, taking a different representation of it. You can imagine the fields from a given table going both across as the columns and up and down as the rows. And what we're checking here is how much does each field correlate to the other field? Again, the patterns here look very similar when you look at the two on the left side by side. And then we have the third version that is sort of subtracting one from the other. So we've got the difference, and that's great because it's very pale green. We don't have too much differences between our training data and our synthetic data. What's great here is that despite seeing so much smaller, the synthetic data is still representative of that original production data, whereas some of the other techniques, like the things Grace mentioned earlier, maybe we take the last 30 days worth of data to try to get a smaller size. Well, we might not capture as much of the original characteristics of that full dataset.
(18:18)
So at the end of the day, what we've been able to do is we reduce our storage costs in our lower environment compared to production, but we're still able to identify those edge cases earlier in development and avoid failures in production. Instead of the classic case of hitting a bug in prod, then you write your unit test to sort of recreate that issue that you hadn't seen before. And then we adjust the ETL job. We saw all that complex data earlier and we were able to implement V one in a way that handled all those edge cases. And really what we're trying to get at the end of the day is minimizing the amount of time we spend setting up our test environment and maximizing the amount of time we spend actually using that environment. We've got a lot of tools in our toolkit these days for provisioning resources in non-prod environments that are basically identical to what we have in production, but at a smaller scale, right? When you compare and combine what cloud providers offer with modern infrastructure as code tools, it's fairly easy to say, Hey, I want this exact set of resources in this other account just dialed down 10%, but there hasn't really been any equivalent for scaling the data that gets loaded into those databases for those different environments until now. So synthetics, I think is a really nice tool to have alongside those infrastructure tools. So next I'm going to bring Grace back and she's going to talk more about scaling synthetic data.
Speaker 2 (19:56):
Thanks, Mike. So Mike just walked us through the benefits of subsetting, and now let's discuss scaling in the opposite direction. For some software application tests like performance testing, massive amounts of data are required to identify failure points. Now, ideally, this testing is completed before any failure occurs in production. Therefore, the production environment doesn't contain sufficient volumes of data that are needed for this testing. For performance testing activities like load and stress testing, you can use synthetics to augment production data, therefore generating unlimited amounts of test data to look at a concrete example that might be vaguely familiar to some of you. We have a popular pop artist who's going on tour and those tickets are about to go on sale anticipating truly never before seen demand, the ticket vendor needs to conduct some performance testing on their platform to ensure that it doesn't crash. They need to augment their production relational database to create a test database that allows for load and stress testing.
(21:16)
So our strategy here is to generate a synthetic database that is 50 times larger than that production database, and we need to make sure that referential integrity is maintained. So the many toone one-to-one relationships between entities and those tables and the key frequencies are maintained how often the artist is visiting different cities, how many tickets are available in each city, things like that. And then once we've generated that database, we're going to use it to conduct performance testing. So on this chart on the left, we have the key frequency for a single table in the training data, and on the right, the synthetic version, as you can see, the shapes of these graphs are identical showing that the key frequency was maintained. But on the Y axis, we see that the numbers for the synthetic data are much larger, in fact, 50 times larger. So what we've been able to do is generate a database that's 50 times the size that was available in production.
(22:30)
So let's talk a little bit about the impact. Why does this matter? How does it benefit this ticket vendor? So having a database that's 50 times the size of their product data that also maintains the statistical quality of that data enabled the vendor to conduct load testing that identified a critical issue. They realized that their checkout page was unable to support the anticipated number of concurrent users that they were going to see. They also enabled spike testing to ensure that the platform could handle sudden and high ramp ups of users, and they discovered it couldn't. But this testing means that they had time to make the necessary improvements to fix those issues before tickets went on sale, and ultimately it saved the company from a total meltdown of their software, a PR nightmare, maybe some public humiliation, the wrath of enraged users and the worst case scenario, an eventual investigation by the Department of Justice, which really nobody wants. So taking a look at another example, we have conditional data generation. Conditional data generation is used looking at the next slide in situations where there is simply not enough production data to properly test for new, unique, maybe unusual events scenarios where a lack of scenario specific production data can limit those testing capabilities and slow dev cycles. This is applicable in use cases like when you're testing for edge cases, boundary conditions, maybe a new feature that has just launched or some unusual bugs and errors. In this case, you can use synthetics to boost that limited data using a technique called conditional data generation
(24:38)
To look at a concrete example. In this case, we have a software where after a new version release payment failure errors started to occur. We have a production dataset that captures these payment events, but it is 97% success events and only 3% payment failed. The limited data for that error event really restricts the ability to test for this scenario, and we need to boost the number of error records in order to identify the cause of these failed payments. So to do this, we train a Gretel Acton model on that production data that incredibly unbalanced data set we just looked at, and then we conditionally generate synthetic data to create as many payment failed records as we need. And just to back up a second, the Gretel Acton model is one of our deep learning synthetic models that in this case we used because it's great for highly dimensional data. I didn't show all of the different columns in our production data, but this would be a table that has maybe 30, 50, a hundred columns with various event information, and we really want to identify what the patterns are for those payment failed conditions. So going back, we've now conditionally generated as many payment failed records as we need, and now we're going to use a dataset that has 10 times more error events for testing.
(26:29)
Looking at this graph on the right, we have another principle component analysis, the same type of chart that Mike showed us earlier. The green dots are the synthetic failed payment events, and the orange are the production data payment failed events, and you can see that the green shows a better overlap with the orange. There are a lot more green dots because we've created a lot more synthetic cases, but we don't see a lot of overlap with the green and the blue, the blue being those payment completed events. So this gives us a strong indication that the data that we've generated matches the properties of those minority cases, the payment failed events. So what does all of this mean? Why does it matter? What's the impact here?
(27:27)
Using conditionally generated synthetic data for failed payment errors improves the results of the testing by providing more samples of highly accurate data for this outlier event. This saves hours of manually creating statistically inferior test data, and it ultimately expediate the identification and fix of the bug responsible. Avoiding significant loss in revenue conditional generation allows you to create test data for virtually any test case. Here we talked about its use for bug and error testing, and this strategy is also helpful when testing edge cases, boundary conditions for new features, and really anywhere where the real world data is limited.
Speaker 1 (28:22):
Thank you so much, grace. I'm going to invite everyone back to the screen here for our live q and a, we're going to be taking questions that were submitted through the form. There's a lot of questions, so hopefully we can get through to all of 'em, but if there's a question that you see here that's not answered and we don't get to it, you can always use this QR code to go to our website and reach out to any one of us on our discord. We're going to start here with just one of the most frequent questions asked in our form, and the question is about mimicking production data patterns. So can GRETEL or synthetic data in general mimic production grade patterns?
Speaker 4 (29:09):
Yeah, the short answer here is yes, I think we've kind of talked about this throughout this talk, but the synthetic data that you can generate will maintain a lot of those characteristics of your production data. What's maybe important to note here is that we knew at Gretel this is going to be critical for people to prove and verify from the start. All the graphs that we showed today in our slides are taken from our evaluate reports. That's one of the three APIs that Tza mentioned at the top of the show here. So yeah, I think we put a lot of effort into providing ways to validate that your synthetic data is of sufficient quality and resembles your production data and visualizing that in a variety of ways.
Speaker 1 (30:12):
Very cool. We're going to start with another question in here. This is actually an interesting question tangentially related to some of the content matter, but we get this question a ton. We're asking about creating credible time series data. So I'm assuming the question here is whether generative models can be used to create a credible synthetic time series data?
Speaker 4 (30:36):
Yeah, I can take this one too. Time series definitely adds an extra level of complexity to generating synthetic data. You're kind of no longer looking at the correlations between fields in a single record, but you also have this over time cross record trends that you want to capture as well. We talked earlier about how grad is a multimodal platform. We don't want to try to put all of our eggs in one type of model basket, so we actually have a dedicated model for time series data. We call it degan or doppelganger, and yeah, it's definitely a unique beast. We have a couple blog posts about it that I can share in the comments if you're interested in reading more about how we built that model, but yeah, yeah, it's a really, really good question.
Speaker 1 (31:42):
Very cool. Okay, and we have another kind of question here about starting the process. So someone's asking, and we actually got this in a few different permutations that I'm curious how the process has started with generating synthetic data, how bias is remediated?
Speaker 2 (32:00):
Yeah, I can speak to this one. This is a great question, and I think a lot of us struggle at the beginning with just where do I start, and this is where I find the Gretel blueprints on our console super helpful because they speak to a specific use case and walk through the steps to get there. We do have a use case card on balancing bias data sets that I can drop in the comments. Additionally, our last webinar was actually on using conditional data generation to de-bias healthcare data, so that is a great resource. If you want to learn more about bias, our blog is also another great place to look.
Speaker 1 (32:49):
Cool. So we're going to head over to another question here. This one, again, I think it came in multiple different forms in our pre-registration form. Someone's asking here, a couple folks are asking here that what is the relationship between generative AI and synthetic data?
Speaker 2 (33:23):
I can take it. So the relationship between generative AI and synthetic data. At Gretel, we use generative machine learning models to generate our synthetic data, and so we take a multi-model, multimodal approach. We have models for various types of data as Meza spoke to tabular image, natural language, et cetera, and then we use various generative AI models in order to lack of a better word, generate that synthetic data.
Speaker 1 (34:05):
Okay, cool. We have another question which is slightly related, but an interesting question in general about synthetic data. Someone's asking here that, are there any top use cases that we've seen the financial services industry? I know we have a couple of different use cases we keep on seeing, but I'll let GaN or whomever take this one.
Speaker 3 (34:28):
Yeah, I can take this one. I think the examples we have talked about so far also, do you apply to creating realistic test environments for financial software system development, especially given the sensitive nature of financial data and the amount of private information that you would see in that domain? Some use cases that come to mind, I would say definitely if you talk about stress testing simulation of different economic and financial scenarios, using synthetic data I think would be very helpful. That way you would be able to evaluate the impact of changing market conditions, stock market fluctuations, and changes in interest rates. For example, we could also use synthetic data to test algorithmic trading strategies. The advantage here would be that we would do it in a controlled environment without risking real capital. We could also analyze, if you look at the customer side of things, you can definitely analyze customer behavior and without exposing sensitive personal information when you're developing various personalized financial products and services. And lastly, definitely the case of training ML models for fraud detection and prevention. Again, since this is kind of a prevalent issue in financial industry as well, without exposing customer data, I think would prove to be beneficial.
Speaker 1 (36:09):
Cool. I'm going to bring up another question here. This is actually an interesting question. It comes up a ton. I know we've written about this as well, but the question here came in, this came in a few different ways. How can synthetic data help with data privacy compliance? And I know that's a broad question, but if anyone here wants to take that, maybe break it down a little bit for the audience.
Speaker 2 (36:31):
Yeah, I think that's a broad, but a great question and privacy was certainly a theme that we touched on really throughout this talk, and synthetic data is inherently more private than real world data. One of the huge advantages is that it maintains those statistical correlations and the relationships in the data while removing concerns about privacy. We have a number of privacy resources and privacy techniques specifically for regulation compliance like GDPR compliance. We have a blog and a white paper and a blueprint in our console that walks you through how to generate GDPR compliant data when privacy is really the highest concern for you and for your data. We recommend our new tabular DP model, which provides mathematical guarantees of privacy, something that I, being a huge number oriented person, find really comforting. I love being able to see a number that tells me, Hey, this is how private my data is. Then we also recommend applying double layers of protection, combining our transform model with a synthetic model that fits your needs, that really gives you the comfort to know that your data is safe to be shared internally, externally, if that's your use case and provides that sense of security in the awful event of a privacy breach or whatever your concern might be.
Speaker 1 (38:24):
I'm just going to mute myself right here. Okay. I think we're, we've gone through a lot of the questions we're about at time, so I'm going to say that we could probably call it time here, but I just wanted to thank everyone again for joining us for this live webinar. Just as a reminder, you can visit gretel.ai, join our Discord community. You can have access to all the people you see here, as well as our broader applied science team. In case you have any questions. Thank you again for joining and we hope to see you next time. Thanks.