Using synthetic data to power machine learning while protecting user privacy

On this episode, we talk to John Myers, CTO and cofounder of Gretel. The company provides users with synthetic data that can be used in machine learning models, generating results comparable to the real data, but without exposing personally identifiable information (PII). We talk about how data outliers can identify individuals, demo data that feels real but isn’t, and skewing patterns by skewing dates.

Episode notes:

Gretel uses machine learning to create statistically similar data that contains no personally identifiable information (PII).

Think your commits are anonymous? Think again: DefCon researchers figured out how to de-anonymize code creators by their style.

We published an article about the importance of including privacy in your SDLC: Privacy is an afterthought in the software lifecycle. That needs to change.

Our Lifeboat badge shoutout goes to 1983 (the year Ben was born) for their answer to Why can I not use `new` with an arrow function in JavaScript/ES6?

Transcript

John Myers I'm the only person in the Philadelphia area in my company. And if you look at our Google call logs, and you took out my name, and my email address, you're like this person is calling in from Philadelphia, like, that's John, like there he is. Bow do you mitigate that? That becomes a really challenging problem. And so what our synthetic data capability does is goes in and builds a machine learning model on the original data, at which point you can throw out the original data. And then you can use that model to create records that look and feel like the original records. And then we have a bunch of post-processing that removes outliers or overly similar records, and what we call privacy filtering. So it's kind of like the easy button, although it's not so easy. But we want to make it easy for the developers to use.

[intro music]

Ben Popper Stop overpaying for Big Tech cloud - Try Vultr Instead! Vultr offers powerful cloud compute and bare metal at a fraction of the cost. Visit Vultr-V-U-L-T-R-dot-com-slash-stack to redeem $100 in credit today!

BP Hello everybody, welcome to the Stack Overflow Podcast, a place to talk all things, software and technology. I am Ben Popper, the Director of Content here at Stack Overflow. And I am joined today, as I often am by my terrific co-hosts, Cassidy Williams and Ryan Donovan. How's it going y'all?

Cassidy Williams Glad to be here.

Ryan Donovan Hey Ben, how you doing?

BP So let's put this out there, before today, before we set up for this podcast, either. Have you heard of privacy engineering?

CW I have heard of it. I've heard some talks given around it. Not so much privacy engineering as it is, like figuring out how to privatize your code, because everybody has a fingerprint on the internet and fingerprint way of doing things. And I saw a really interesting talk once about how someone was able to determine who was coding a particular code sample based on the style of code. The simple example is like, do they put spaces around parentheses or put a curly bracket on another line? But it got much more deep into coding styles. And her research was all around how to anonymize your code a bit more.

RD Yeah, we published a blog post a while back about why your privacy should shift left into the software engineering cycle.

BP Yeah, it makes a lot of sense to me, I feel like we've had a bunch of conversations on the podcast about how these days, even at sort of almost the MVP level, you are worrying about things to do with consumer privacy and data, especially around GDPR, and how you help people auth-in or log in and keep a password or share some of their data with you, without compromising your users in some way. And having that come back to bite you. Alright, well, let's get into the meat of the episode. Our guest today is John Myers, who is CTO and co- founder at Gretel.ai, an organization, a company that specializes in privacy engineering. John, welcome to the show.

JM Hey, thanks for having me, excited to be here.

BP So how would you define at a high level what privacy engineering is all about? And tell us a little bit about how you ended up in the the gig you're at right now?

JM Sure, yeah. So to summarize what I see privacy engineering, as you know, I came from a background of doing a lot of data engineering. And I view it as kind of an extension of that, where as part of doing what you'd normally do in data engineering, whether it's like ETL, or doing some type of enrichment or modification to the data, you need to really consider what types of steps you need to take to make the data safe. From a privacy perspective. You know, I've spent my entire career building a variety of data pipelines mostly for you know, some end state where you're trying to do some analysis on it. And oftentimes, I found myself spending more time figuring out how to make the data safe to insert into some downstream data storage that was going to be queried by, you know, a whole different user base. I would say most of the people that Gretel have a similar story. And, you know, as an engineer, myself, I never really found tools that could do it. Or if I did find tools, they were pretty scattered across different places, right? Like, I would have to go use some library to do one thing and another library to do another thing, and nothing was really unified together. So when we launched Gretel, we said, what if we took like, all these capabilities, like put them behind a singular set of APIs with like a very, you know, similar developer experience so that you don't have to go reach out to six different vendors to do what is basically a related task of taking data and making it safe to share.

RD How do you make it safe to share?

JM Yeah, so we have kind of a buffet of capabilities. So I would say we start off with a couple of tools that we consider to be table stakes. So some of the capabilities we have focus around the ability to detect sensitive information and data sets. And then after that, you can detect it you can apply various transforms to it. Those capabilities are out there. That's basically an area known as like data loss prevention. There's a lot of API's that are baked into different cloud vendors for that, like AWS has won Google has won Azure has won basically have to be in their ecosystem to use them. And then there's a bunch of vendors standalone that you can install in your ecosystem. But they're really designed to work at a high level, like they're either doing some type of like data governance type of function, where they're trying to identify and create pointers to where you have sensitive information, we're really we wanted to do was kind of bring that down to a lower level and enable developers to actually execute those types of tasks directly in their workflows, whether it's the ICD or some type of ETL process. So that's one of the big capabilities that we have. And that gives you a lot of flexibility to identify what I would call like discrete privacy risks, right, you have phone numbers, or email addresses or addresses, stuff like that, you can detect those. And then you can apply a variety of transforms to it. Tokenization and encryption, you can do partial redaction, we have a capability to do like a surrogate replacement with a fake value that maintains consistent across the data set, you could do a lot of that relative to other foreign keys on the data. And then we kind of move up in complexity from that into what we call synthetic data, where you might not know all the risks that are within your data set. So outside of having those types of discrete risks, where you have all those identifiers to individuals, there's a whole myriad of risks around like re-identification attacks and aggregate risks and being able to you know, identify someone just because I'm the only person in the Philadelphia area in my company. And if you look at our Google call logs, and you took out my name, and my email address, you're like this person is calling in from Philadelphia, like, that's John, like there he is, how do you mitigate that, that becomes a really challenging problem. And so our what our synthetic data capability does is goes in and builds a machine learning model on the original data, at which point you can throw out the original data. And then you can use that model to create records that look and feel like the original records. And then we have a bunch of post processing that removes outliers or overly similar records, and what we call privacy filtering. So it's kind of like the the easy button, although it's not so easy. But we want to make it easy for the developers to use. But under the hood, we have a whole bunch of complex things going on that been developed by our ML team. And then we've, you know, put it into our actual API ecosystem. And that's really kind of where we're going with it is, how do you make it so developers don't necessarily have to go through this, like discrete identification process and know what's exactly what's in their data and have to be like experts in the schema.

CW That's amazing. And I feel like it's something that when you describe it, it totally makes sense that something like this needs to exist for developers to maintain that level of privacy. But it feels like it's something that not enough people are aware of.

RD This is a really cool application of machine learning. But I would suspect there's going to be some folks who look at this and say, oh, this is a over engineered solution, right? For data privacy.

JM Maybe, yeah, I put an extension on my deck. And they said it was over engineered. I was like, what does that mean, they're like, well I won't ever fall down. [Cassidy laughs] So, I don't know if that's a problem.

BP Right. It cost you a little money, but won't ever break on you.

RD And I think the fact that you're creating statistically similar data that is usable. That's fascinating. And I think there was one of the things you sent over talked about a particular attack that can be performed on either the hash data or real data.

JM Yeah, it might be like a, I don't know, if it was like one of the model inference attacks or just a re-identification attack that you can run on, on data. Both of those are, you know, basically, using different properties of the data set to piece, you basically take two or three other columns, and you basically are recreating the same statistical relevance as my name in the dataset, right? There's a study out there, I can't remember who it was from, but you know, they say given like, you know, even like zip code, and gender or some other characteristic, you can get close to 80%, of just identifying a singular person to that. So, that's one of the things that it's really hard to figure that out, like, from a brute force attempts on a data because if you're highly dimensional data that's really big, you're going to run out of resources, trying to find all the permutations of columns that allow you to do that type of inference. So on the synthetic data side, you know, we have different privacy levers, you can pull from differential privacy, to privacy filtering, to injecting additional types of noise into the data that kind of takes care of that for you without having to run those exhaustive types of like, searches on the dataset.

BP And let me ask the dumb questions. I guess before I do that, let me say one thing, which is Cassidy reminded me is talking about spaces versus tabs and a curly bracket, you know, that would identify some of these code. And what you were just saying about how you could de-anonymize somebody with a few simple traits that you might appear at a zip code a gender and an age. Ah, and from there, you're going to work back and figure out that person's name and social security number. I remember reading a story once about this. And it was about things that I don't usually think about. But that make a lot of sense. It was like, they look at what browser you use and what size your screen is. And then, you know, they know that you end up at this GPS point and that GPS point everyday because you're going to and from work, and from there, boom, it's like trivial to own you and have everything that, you know, they can they know who you are, and they can probably even open up some of your less well protected accounts, right? Because they've got your name, your address, and bunch of other stuff. So yeah, this stupid question I wanted to ask was, if you create synthetic data, you mentioned you can throw away the other stuff. But if you need the original data, how do you work backwards when you want to, you know, like, reach out to the customer with some of that identifying stuff, their phone number, their address, like this synthetic data, I can understand how would let you identify them as a person and therefore work with them on a bunch of different, you know, areas, but how can you throw out the original, and then just work from the synthetic. Are there not things in the original that are a value to you, that the customer or the user has given?

JM So what I meant when I throw it out, meaning you don't need the original data to then go ahead and create the new safe versions of it, because the model is going to take care of that for you.

BP Gotcha.

JM What we recommend and to kind of, you know, bleed into a little bit how you talk about operationalization of this and see ICD data pipelines that we recommend is you basically want to get as closest to the data creation as possible, right? So if argument's sake, let's say you have a stream of records, and that stream is like, you know, the ground truth, right, that's like the first stop, where let's say, a transaction record falls into, at that point, you want to probably start capturing some of that real data and start creating your safe version of it, whether you're gonna do transforms or synthetics, or whatever. And then you can keep a copy of that somewhere. And that repository of that synthetic data is probably the one you want to start forking the data from when people start coming to you and be like, Hey, can you dump a table from two weeks ago, because we want to send it to the biz analyst to do something with right, that's you just want to pull from that table, because you know, it's ready safe, and it's good to go from there, the original data should stay behind some, you know, really big vault, so to speak with a key and only have access to who needs it. Because most and a lot of other industries are going to require that for compliance or legal or FBI comes knocking for one reason or another, like you can't just be like, cool, here's a bunch of synthetic data, they were they will need the real data. But for the use case, where you want to just share that data for collaboration, or you want to share it especially over with teams that are doing like machine learning and data science on it. Right most of the time that that is more valuable in aggregate than it is knowing I need to know who made this purchase? Exactly, because they're gonna go to jail, right? Like most companies aren't doing that's an edge case that you have to still retain that original data for.

BP Right, I'm on the data privacy team, the privacy engineering team, we've got a lot of customers, and they've shared with us their credit card and address and phone number. But we've also got other user metrics, and we want to send that over to a team that's going to look at that user behavior and use that to make good product decisions. We synthesize the stuff we don't want need them to see we send over the stuff they do want to see. Now they get the value without the privacy risk.

JM Right. Yeah. And in our experiments and our customers have given us is actually back to, you know, we've had folks that have taken whatever their data is the created a machine learning model on it, because they want to predict a user behavior, they want to predict something. And they have whatever accuracy of that model. And they put that same original data set through Gretel to take the synthetic data set train is same exact machine learning model, same outcome, right. So really, what you can do there is you can, you can build a machine learning model on synthetic data, take that model, put it back over like that safe, like firewall, so to speak, and then start running on your production data, you get the same output, because most times the data owners are not the ones that are doing that heavy machine learning work, it's usually got to get that data over to an ML team, sometimes even a third party if you're a small business, right? Like you're going to bring on a company that has expertise of doing machine learning. And so do you really want to give them all that sensitive information that they don't even need access to? Right? That's one of the biggest use cases we see for the ML side of the house.

CW When you're generating all of this synthetic data, how do you measure the quality of it? How are you confident to be able to say like this dataset gives you the same value as the original?

JM Yeah, so when the data gets created, when the model gets created, we generate a sample of data, we use that that immediately compare it against the training set. And we have this whole synthetic quality report that comes out. And it shows you the breakdown of how well the distributions in a field hold up, the correlations across the field. We do other checks like principal component analysis to kind of look at like underlying structures. And then we also have a bunch of things that run that do semantic validation on the dataset. Those are kind of guarantees you get that don't have to be analyzed. So like a whole column of names, you know, will identify what the character set is right like only these letters use these numbers are used. We'll make sure that those only come out so you don't get like the oh, John, zero, right? Like you don't want that to happen. So we enforce that type of thing. And so that report comes out. And then they can basically look at that report. And they'll know, okay, based off of the report, we kind of have a spectrum and we say, Well, if your score is this good and excellent, you can probably use this data for the exactly exact same thing you're going to do with the original data, like train a machine learning model of it's kind of like, the highest level of like, proof that the data was good quality. Now, if you start turning on, like the privacy knobs, and that's the thing is you have utility versus privacy. And that's kind of a trade off, right? If you if you say I want maximum privacy, we're going to filter out all the outliers, we're going to filter out similar records, or we're going to do it like very aggressively, you might not get the utility you're looking for. But some companies are just looking for the data to kind of look the same, because they want to use it for demo data. If you're testing like a user interface, right? Like, I don't care if it's statistically sound, but I just need the first names to look like first names, you know, and in the Styles, berries to taste like sounds, berries. And but again, if you're going to share that data for a demo, you really just want to turn up the privacy filter as much as you can. So you have those capabilities before you start training. And then the report will tell you, here's what you should use the data for it, here's what you shouldn't use the data for.

BP So once you learn how to do this, if you saw great results, could you then build me some sort of procedurally generated engine that just creates new fake customers, which are equally believable as the existing customers?

JM Yeah, so that's kind of like on our roadmap for this year is now that we have that ability do we have the ability to go from a cold start, basically, is, can someone come in and be like, I'm a bank, right? And all banks kind of here, their transactions look, the same doesn't matter. If you're like a Visa or MasterCard, right? They're all had the same like, schema. Can you just generate me transactions that are common in the northeast side of the United States around the holidays, we should have that ability, then to basically generate those from scratch, either from a pre-learned model, or some kind of underlying semantics we've learned from dataset. So that's actually a roadmap for this year to be able to kind of do that. So people can come in and just describe data they want, and then they get it back.

BP Oh, great. Well, you should hire me to be on the r&d team because I thought of that just now.

CW Aha! So when it comes to the creation of the data, kind of blows my mind, like, it makes sense from a theoretical perspective, but you've talked about measuring the quality of it, how do you ensure that like, privacy is still maintained now that the data has been created and is so high quality that it could be used the same way? Like, what are the best practices around that?

JM Yeah, so there's a couple of different capabilities we have around that, what we're finding more and more is that customers are actually combining a lot of our capabilities to start with. So you know, previously I mentioned, we have this capability to do PI`I detection, and what I would call like more discrete transforms on the data. And sometimes it's really safe just to go ahead and make take a pass with that before we even train a synthetic model. So for example, if you have like time series data, right, and it's heavily dependent on, you know, some point in time and some measured value, you could go ahead and say, Hey, before we even train a synthetic model, go ahead and apply, we have like a date shift transform, that will apply like some date shift within some window, call it plus or minus five days, go ahead and do that first. Because then when the model learns on that, it's going to learn the kind of semantics of those shifted dates. And so it reduces the risk of it, memorizing, you know, some dates that were in the training model, you know, go ahead and take all the names and replace them with fake names before we, before you train this synthetic model, you basically helps reduce the risk of the model memorizing anything. And meanwhile, all the numerical data, all the correlated that is still held when you do it that way. And there's a whole bunch of hyper parameters on the machine learning model that you can tune as well, on how much it should like learn. And then that will kind of control like how much I'll say like variants gets injected when you start creating it. So there's, like on the underlying machine learning model, which, you know, I'm not the expert in, right, I'm, I'm kind of on the service delivery side, but you can turn a lot of those knobs that kind of control the data output. And then finally, there's another stage that we call privacy filtering that you can set to a bunch of different settings that specifically look for outliers in the data set. So when the synthetic data gets created, there's any outliers you by default, you kind of want to throw them out, right? So if you have outliers in your training set, it's going to learn those and recreate those. And so, you know, it still might create some records for someone who's in Philadelphia, and that's me. Sometimes you just got to remove those records from the dataset because there's no other way to avoid that. You know, there's only one person in this remote location. So all those knobs can be tuned by the customer based off of what like what level of privacy they need, and that kind of maps to like, what level of sharing you're going to kind of do with it, right? Like hey, you just really got to share this inside of your organization. Are you actually going to try to like send this outside of your organization because you're sharing it to a third party or vendor? You know, we've worked with like volunteer organizations that are getting data from companies, if you're going to share it with them, you probably want to make, you know, err on the side of precaution and use like a higher privacy setting there.

CW That makes sense.

RD So, you talked about transforming the data. So does this all happen within the sort of ETL pipeline or happen, kind of, at various stages?

JM We would recommend that would happen like right, as part of like the Gretel stage in your in your pipeline. So Gretel would be kind of like a bump in the line of your data pipeline, or your CI/CD process, where a job based system so that's kind of the way the system works. So it would just be like two Gretel jobs, the first job would do some transform the output from that would go into the synthetic job. And then the output from that would be your synthetic data.

BP And can you tell us a little bit about sort of like what the tech stack is that you work with?

CW I was going to ask that too. I'm so curious.

JM So we run as a service, so a SaaS model. So the entire stack is running on AWS. And it's a job based system. So if there's like an Apache Spark, or something like that, you're you know, you're basically starting a job. And then that job consumes an input dataset, a configuration file, does its thing, and then kicks out a bunch of artifacts, right? The synthetic data, the report that you get all that. Our API and our console, or CLI or like, kind of the interfaces into the system. You know, through our API, it's more of an orchestration API, so you request that a job can get started. And then once you get like authorization to run that job, can start the job either in our cloud or you can start the job yourself. If you start the job yourself can actually just run a Docker container that takes in all your data, that Docker container still sends information to our API, just basically like when the job started, when it stopped, are there any errors that so we can kind of use it as like a billing and metering mechanism? If you choose to run the job in our cloud, we basically do the same thing. We have a managed Kubernetes cluster that runs all the containers for you, our containers don't know, like where they're living, they just know that they're starting up and are getting some metadata for your API, then they start their work. And they're saying, hey, when you're done, drop your output data, either on a volume out, like if I'm running in, like my own VPC, we're upload those artifacts back into Gretl cloud, and a customer can download them. Yeah, so tech stacks on AWS, we pretty much rely on a lot of the serverless capabilities from AWS on the back end to do all that we manage a big Kubernetes cluster, which does our cloud providers for our our customers if they want to run on our compute, which is helpful because on the synthetic side, you have to run with GPU support. And even just getting like a deep learning setup, like configured is a royal pain. So we kind of take care of that for the customers. For customers like enterprises that run on prem, all they really need to do is configure some compute, you know, that can run Docker containers. And you can use our CLI and our CLI to launch the containers for you on your behalf and all that. But our cloud system was really just like this big troll plane that manages the jobs.

RD But how do you protect the data, whether it's in transit or on your servers?

JM Yeah, so I mean, all remote communications through our cloud is TLS, 1.2. and above, any kind of data that gets sent to us by customer to process is encrypted in transit, and then encrypted at rest, and only accessible through that customer via their API key to upload and download. We don't really hold on to much, we're kind of in a femoral system like that. And that out, we expect the users to download the data out of our cloud, we don't, we don't recommend that they store it up there. And that's really just that helps minimize like cost, you know, we're not a remote s3 service for them. So basically, if you upload a dataset to us, we process it if you use a lot of our SDKs. And our tutorials will end up just deleting the data you sent as soon as it's done that we expect or recommend that you download your results right through it. Really, our cloud is just a way for you not to have to like manage your own compute and scale your own compute.

[music]

BP Alright, everybody, it is that time of the show. I'm going to shout out the winner of a lifeboat badge who came on Stack Overflow and helped rescue a question from the dustbin of history. awarded five hours ago to 1983. Same year I was born. 'Why can't I use the new with the arrow function in JavaScript ES6?' Alright, we've got an answer for you. And we have people on Stack Overflow been answering this question in a duplicate for the last five years, so appreciate your help there. I'm Ben Popper, the Director of Content at Stack Overflow. You can always find me on Twitter @BenPopper, email us podcast@stackoverflow.com. And if you'd like to show, leave us a rating and a review, really helps.

RD I'm Ryan Donovan. I edit the blog here at Stack Overflow. I'm on Twitter @RThorDonovan. And if you have a great idea for a blog post, please email me at pitches@stackoverflow.com

CW I'm Cassidy Williams, I'm head of developer experience in education at remote you can find me @cassidoo on most things.

BP And John, tell us who you are, where you can be found. If you like to be found. If you want to find the synthetic John, that's also okay, the real John or the synthetic John. And then if folks are interested in Gretel, where should they go to check it out?

JM , thanks for having me on the show. And I'm John Meyers, I can be found on Twitter, sometimes, JTM_tech. Other than that, you know, I'm easily reachable at Gretel, anyone can ever always reach out to us check out our homepage at gretel.ai. We are hiring like crazy for all roles imaginable. Engineering, machine learning, marketing, product, sales, you name it. Or you can always reach out to us at hi@gretel.ai. Yes, that's what comes to me or the other co-founders, so chances are one of us will answer you and it will not be a synthetic John who answers. [Ryan laughs]

BP Maybe. Alright everybody, thanks for listening, and we'll talk to you soon.

CW Bye!