A Special 'Fireside' Chat with Gretel's Founders

December 15, 2022

A fireside chat with Gretel's founders on the latest in synthetic data generation, LLMs, privacy, and much more.

Read the blog post

Read the post

Transcription

Speaker 1 (00:06):

Is my mic. There's no background noise on my mic, right? I have it set up good.

Speaker 2 (00:11):

No, just like some random screaming.

Speaker 1 (00:13):

Oh yeah, that's normal.

Speaker 2 (00:14):

Yeah. Okay, so it sounds like we are live, so we can get started, man. Just keep going on the same transition. I was going to suggest John and I have a one-on-one sync coming up. I just say we just do it here. Just be very efficient on the time and just get that out of the way.

Speaker 1 (00:32):

Hyper transparency. All the way.

Speaker 2 (00:34):

Hyper transparency. Yeah. Yeah, that'll go over super well right now. All right, so we have a bunch of topics we want to cover. We're going to do a year end review of all things generative ai, privacy, synthetic data. We'll make this hopefully a thing and we will write about it a little bit. Should we do a quick round of intros just so people know who the hell we are. All right. I'm Ali. I'm one of the co-founders at Gretel. I have known Alex and John for a very, very long time.

Speaker 3 (01:03):

I'm Alex. I'm one of the co-founders here as well. I lead our product and applied science teams.

Speaker 1 (01:08):

Alright, and I'm John, also one of the co-founders and I lead our engineering team.

Speaker 2 (01:12):

And John, when do you get back to civilization?

Speaker 1 (01:16):

Probably in the next three months.

Speaker 2 (01:19):

I just anticipate all those buckets behind you just full of fireworks.

Speaker 1 (01:25):

Yes, as much as I'm UNC civilization kind of in eastern Pennsylvania here,

Speaker 3 (01:29):

I'm waiting to see some deer walk around back behind you,

Speaker 1 (01:33):

Somebody

Speaker 3 (01:33):

Taking a pot shot at him.

Speaker 2 (01:35):

I feel like it's been quite toned down as far as what John has around him since we were only five people having conversations and now over 50. So it has definitely evolved. All right, so we have an hour blocked, but we'll sort of end sooner if we get through all of our topics. I think we have the first 30, 35 minutes slotted just to talk about FTX. So we can just start at the top there. And then just joking, we're now going to go down the crypto Fs FTX rat hole right now. So we're going to keep that one off our conversations. Where I would love to talk about is maybe just frame the conversation for what we want to get across. I think it's been a pretty incredible year this year for AI generally generative ai, synthetic sort of everything that's come and there's going to be a few topics we definitely want to touch on, whether it comes to generative AI for imagery even for example, prompting text to image.

(02:32)
I think the big one we really want to all hit on is chat GPT, because I think a lot of people talk about an aha moment when it comes to their products and I think a million people at the same time had an aha moment when it came to chat GPT. So we have to hit there. But I do think there's some sort of more fundamental things under the hood we want to talk about, which is the different approaches when it comes to privacy training, preserving data under the hood. And then we'll get into some really fun stuff we came up with without really realizing how much work we're going to put into it, but we signed up for it, so we're going to have to do it, which we will get into at the end, which I hope will be a lot of fun for folks hopefully to get us up engagement. So maybe where we can start is can we just start very high level? We'd love to get your thoughts both on large language models where we've come this year. That's been a big topic that recently has come up, so maybe we can start there. Just generally large foundation models and we'll get into language and image and go from there. Alex, do you want to kick off?

Speaker 3 (03:31):

Yeah, I'm happy to. So I think everyone's seen the news around chat GPT and then before that and instruct GPT and a lot of the work that the folks at Open AI have been working on, which is pretty incredible and has really kind of stretched, I think a lot of people's ideas or opinions of where language models go. And if you were to rewind a year, beginning of last year and ask most people like, Hey, what can a language model do? It could write a tweak for you, it could write maybe even an intro paragraph to a blog, but it wouldn't be distilling knowledge or you wouldn't be able to use it to accomplish tasks and things like that. And I think what people are seeing here is that you can actually use these models in something a lot more sophisticated than just generating random text. You can ask it to summarize something for you. You can ask it to provide a really nuanced opinion based on a ton of data that might be even really difficult for a human to do themselves. And I think we're just seeing the beginning of it. BT is incredible as an example, there are some limitations that I think the research community is working through with language models right now, but as we've seen steady progression and really getting to the point where you have nearly human level responses and understandings, super exciting.

Speaker 2 (04:47):

So two questions we'd love to understand a little bit more and dig in more into what do we see to be some of the limitations and then prior to what we talked about, which is sort this emergence of insights that now models offer would be great if we can touch on a little bit about how we see this order magnitude jumps in hyper parameter trainings are actually influencing that.

Speaker 3 (05:08):

Yeah, yeah, and I think limitations are meant to be overcome. So it might be even better just to call it bumps in the road that we're hitting right now under the hood, still the same type of technology that we've seen with GT two, GT three, things like that. So under the hood the model is essentially always just kind of predicting the next token in a sequence. But some of the improvements that have been made in training, really trying to align the model output with what a human would expect with a set of instructions. Fortunately the model's been trained very altruistically and it's doing things that it's answering what we're asking it to do. It's not trying to plot the world's demise or anything like that. So it is a sign of AI being built really to support what we as humans would want an AI to do.

(05:56)
So that's really cool. Some of the challenges that we see, particularly at Gretel as you start looking at when you apply a model like this, a transformer based model to structured data or things like that is some of the limitations in memory that you have. Even with a model of the same scale as chat QBT, which a memory of about 8,000 tokens, you can do some really clever stuff where you can at each iteration when you ask it a question and then you ask it to refine the previous data, you can kind of stuff the model back with context that help it make the next decision for you. But when we get to things like data sets that have for example, millions of rows or other challenges like that we see in the industry, people interested in using these models to create music or things like that where you have this really long progression, not just a summary, not just a cool one page blog but a piece of music or you wanted to write a full story or things like that, then you have to start doing much more kind of clever things to keep the model in tune with where it is and where it needs to go.

(07:00)
So all things that are being worked on once we put a lot of effort into, I think that's one challenge. I think another big challenge around it too is consistency. And I think almost, I think it's almost an unfair characterization, but when you hear about transformers, you hear about even stable diffusion for images. There is a feeling that these are most useful for creative applications and for anyone that's ever played with stable diffusion for example, you can ask it to generate an image and one of 'em is fantastic and they're like, oh, I just want to iterate on a little bit more and then you can't generate something totally different. So I think one of the next big challenges that we'll see across any type of generative AI is really trying to have confidence and the outputs of that meet the expectations of what you're looking for.

Speaker 2 (07:47):

Speaker 1 (07:48):

Areas handle the inputs changing though, because a lot of the outputs are not going to be remotely close to what the inputs are. You mentioned like oh, writing music, what about saying, Hey, finish this measure for me if you're writing, you're confusing, music can't, it's hard to right now you mess with it and it has somewhat of a novelty factor where you can provide text input and prompts, but do you think there's an evolution in how interfaces change on how you receive those prompts to be contextual to the use case or what you're trying to achieve?

Speaker 3 (08:22):

I think that's so cool. I mean I think we've started to see, whereas we would've all thought a couple years ago if you want to build a synthetic image that you need to start with a set of domain specific images to point in the right direction. And we've seen the power of when you can describe what you want with natural language, where that can go. Yeah, so super excited about that too.

Speaker 2 (08:43):

The part of it that gets really interesting is I think, and we've had this conversation a bunch of times, which is like how does prompting differ when you're doing something for consumer versus businesses? And I think one of the big things we have seen a lot is that a prompt into a model from a business standpoint can have a lot of biases based on what your visibility is. You just don't have the ability to hold all that information in your head. So one of the really cool things that I've sort of taken a note of is this notion that training a synthetic model on that your data can eventually end up being this highly contextual and dimensional prompt for a large pre-trained model. And what I would love to actually go with this is talk about two things which we sort of briefly touched on, which is the privacy and implementation, which I think are sort of very closely tied the implementation.

(09:32)
I just want to get some thoughts around the approach from the open AI side of things, which is a little bit more locked down, a little bit more sort of behind APIs, which clearly is lending itself to some verticalization from integration standpoint. And then sort of the other extreme, the stable ai. Now they've gone through some iterations. 1.0 was a lot more of a Wild West 2.0 was overly locked. 2.1 is sort of correcting back, but one area that actually Alex maybe you can kick off and John would love to hear your thoughts is that around the privacy, send a tweet this morning Alex, about sort of the implications about actually giving props to chat GPT. Can you expand on that?

Speaker 3 (10:11):

Oh yeah, the tweet this morning, I was curious and I was looking at the privacy policies across Opening Eyes website and I was curious about for all of the million users that have used chat GPT in the past couple weeks or even the paying API users, how does that data get used? And one of the things that was surprising to me was that any of the data you submit to the service can be used for model improvement and there's not a lot of detail about that, whether researchers are able to see it or how long it's retained for or anything like that. But it is clear that anything you submit the service can be used and that was surprising I think for a lot of applications. I guess what we've seen at Gradle over the past couple of years is it's very sensitive data that people end up sending in.

(10:59)
That's part of the reason that you're using a model like this to either create more information that's similar to it for privacy reasons. If you're trying to summarize really important domain specific stuff, it's something you want to have a lot of control over. So I have seen, I did read through there that you can request if you are an organization level user that you can request that your data is not used to train models or not accessible to researchers or things like that. But on a couple fronts, I think privacy is a super important part of generative AI and I think generative AI is actually a really clever and kind of nuanced approach to privacy that people haven't thought of. So we've all seen that de-identification of data or scrubbing tax doesn't work, but the idea, a generative model could read all of the data across your business, it could read even really private conversations but then summarize it in a way that wasn't individually revealing or anything like that. It's really exciting. So privacy continues to be upfront and center and a lot of the discussions around generative AI and it's interesting to see how it's unfolding.

Speaker 2 (12:15):

Yeah, I also would love to hear your thoughts on the implementation of open versus closed of open AI versus stable ai, but I would love to hear your privacy take first.

Speaker 1 (12:26):

Yeah, I think on the privacy side, the interesting thing for me is right now you fire up chat GPT and it's basically one model that's serving the world and I feel like there's this opportunity to take those types of large language models and then be able to, for lack of better term, fork them and then build on top of them for very vertical specific use cases. And I think that also addresses some of the privacy concerns. Now that model can be improved and only with the inputs from a certain audience or a certain business, but then you don't have this whole collective unified model that can either learn things, it shouldn't learn or becomes inaccurate, but it could still be show the world what's in the realm of possible.

Speaker 2 (13:11):

So I mean do you guys think this notion of prompting individuals specifically into these models, I mean does it bear the same potential risks as ads carries as far as targeting goes and deal in context or do we think this is a much lower fidelity, lower signal about an individual? Does this really have the long-term similar implications from a privacy standpoint?

Speaker 3 (13:37):

I think LLMs, one of the neat things that I think we've observed with our APIs and how customers use synthetic data is being able to mix public data in with that private data gives you some really great privacy benefits out of the door, which is awesome and really good to see. So you get privacy benefits, you actually build on the accuracy, you can incorporate public knowledge to build new variations and a trust this big problem that a lot of people have where they just have very limited access to data, they don't have enough examples or anything like that. So I think that's super exciting. Kind of echo John on this one that the power of foundation models, but being fine tuned to a domain specific thing, which could be your company's data or that unique data set that you have that talks about anything from electronic health records and patient diagnoses and things like that will build a new set of apps, a new set of ecosystems where people can really quickly recommend and build a value on the data that they have without selling it, which is pretty cool.

Speaker 2 (14:51):

So that brings us to a pretty important part, which is I think there's sort of multiple phases to the entire training cycle. We've talked a lot about the quality of data, the privacy, and I want to dig a little bit deeper because I think one of the things we've talked a lot about is the implementation of privacy. There's sort of the concepts we have privacy filters, then the right implementations of differential privacy, which I think we're very opinionated about. So maybe John, just starting with you, one of the big things we've talked a lot about is one of the hurdles or bumps to borrow from you Alex, as far as generative AI or large models go is this notion that if they're a generative model there's hallucinations and then how do you actually prevent these random hallucinations that the model is potentially creating? Let's talk a little bit about validators and the importance of it because, and the reason I say that is we come across so many users and customers that the notion of validators is a very unique thing for them and I think because most of the market has been influenced by some of these tools, whether it's stable diffusion models or whether it's dollies or others like chat GPT where the validator is not as pronounced in the feedback loop.

(16:01)
You miss that concept, but we'd love to get you, both of you to help unpack why validators are so critical for degenerative AI side.

Speaker 1 (16:12):

Yeah, I would say we kind of have our own validation concepts for some of our models specifically like our LSTM model that can be used to create values that weren't necessarily in the training set. And the important thing we realized there, especially in a world of creating semi-structured data is that you needed to create this type of post generation validation to make sure that the data is within some kind of constraint of what you expect for structured data. That's a little easier to do when you're looking at different types of numerical values, categorical values, things that are inset, things that have to meet certain type of lexio graphic patterns. I think that hurdle is going to get a lot higher when you have these types of LLMs that have basically highly unstructured output and how do you make sure it stays in bound. So I think as we kind of progress down looking at how to utilize LLMs within Gretel, that's going to be a huge research area for

Speaker 3 (17:12):

Us. Yeah, yeah, we've seen that with one of our users. Maybe just to go to an example summarizing automatic

Speaker 2 (17:23):

By the way, Alex, can I interject something? Yeah, please. Because I think as far as the trajectory of a lot of these video sessions we're going to do, we're going to talk a lot about customer learnings, customer learnings, but I think we should just make it explicit. We will never attribute that information to a user or a customer. We're going to talk about general learnings, but I think we're coming across some fascinating stuff that we should definitely make sure the community hears about.

Speaker 3 (17:46):

Yeah, so this is going to a tech summarization use case where you've got data that users have left in a raw instruction format that you want to quickly summarize and make available to somebody. And we love generative models ability to summarize and to generate new text and we also love that different variations that it creates and that, but the challenge is sometimes those variations are completely unintended, so we use this term of calling it a hallucination as an example, right when the network just makes up something that's completely off the rails. As John was mentioning, this is something that we incorporated early days with working with tabular data where we would be generating millions of records to match a distribution of a dataset it was trained on. And you have just to make this up, let's say this is a patient record and you've got different ages of people inside the dataset, you want the network to invent new ages.

(18:41)
You just don't want it to invent someone that's 300 years old. So very early in the days we started building technologies that would look at the output of the neural networks. It doesn't limit what the network's capable of learning or generating, but it makes sure the outputs kind of match the intent or the distribution of the original data that was trained on and being able to depend on the data is so important when you start looking at machine learning, downstream use cases and things like that. So I think over the last couple of years we've gotten really good at doing this ovular data and I think challenges for the entire industry of the next couple of years are going to be how do we do this with natural language text? How do you know that a text summary that you generate, for example, is in line with the original text? It was based on similarly, when you ask a stable diffusion to generate an image of a unicorn on a surfboard, how do you know that that unicorn is going to be sitting there on the surfboard in your images? So you can use that to train a downstream model to recognize unicorns on surfboards.

Speaker 2 (19:43):

So one thing I do want to get into, which is I think more and more coming up and a consistent theme is multimodal data and I think there's, there's multiple sides to that data, that comment or that question, right? On one side there's the aspect of multimodal data, which is the generation or the training of it. Then there's another question around which is, and I actually think I saw this in the comments question somewhere, which was the right balance between call it raw data and synthetic data and gets into seeding and then there's a part of it which is sort of pre-trained models versus large foundation models versus sort of training it on your own data. And I think all of these sort brought all of us to a single conclusion, which is there is no one or two or even 10 models that were rule them all the real values around the velocity of having as many models to your disposal as possible.

(20:38)
This is actually, I really love the approaches of companies like Hugging Face, it's really helpful to have something like that around. But the other part of it is if you're a company that's sort of like ours commercializing and productizing a lot of these things, whether we're building in greenfield technologies or sort of leveraging best of read open source, what we all realized and we bet very heavily into was you need frameworks to be able to actually deploy models very quickly. So one thing that I don't think gets enough attention that I think is going to be a huge driver for a lot of these companies is essentially what we call MIF model integration framework. John, I actually wanted to have you talk a little bit about, not maybe specifically that because sort of somewhat proprietary to us, but the general aspect of why that is very relevant and what sort of other companies doing this should take away from it if they want to build something of that magnitude themselves.

Speaker 1 (21:29):

So when we were early days of Gretel, we had our initial synthetic model that we built on top of the LSTM and we quickly realized that were a lot of other model types that we needed to build and launch. What we figured out is that you could generalize the inputs and outputs pretty well and then really use needed to figure out a way to kick off the work and a single architecture and be able to harvest the outputs and stitch all those outputs back together. So you're saying you could take different types of data and split them up for different models based off of the type of data that you have? Yeah, internally we have infrastructure called the model integration framework, which allows our internal applied science team and engineering team to quickly launch new models by basically for lack of better terms, filling in the blanks for some defined code files, bringing in your own business logic and your own imports, and then our machinery and our backend kind of takes over and then it becomes all universally deployable through our APIs.

(22:33)
And what we're working on now is being able to make this type of deployment model scalable and what we're calling our hybrid cloud mode so users and large enterprises can run this framework in their own infrastructure and building on top of proven technologies to allow for that scale. So building on top of tools like Kubernetes, working on being able to help automate the provisioning of these clusters, especially with their HEPU resources. That's what I think most customers get kind of hung up on is how much infrastructure has to be stood up to be able to run these different types of models and the resources needed. And so that is here Gretel engineering something that we we're trying to tackle from the infrastructure layer to enable that kind of multimodal approach.

Speaker 3 (23:17):

Do you guys see, just to build on that question a little bit, do you see people kind of knowing what type of model they want to go after for a particular use case or is this something that we automate over long term? What are you guys seeing? When we talk to different users,

Speaker 2 (23:37):

I think it's not one or the other. What we have seen is that you will run into users whether it's a vertical, whether it's very specific use cases, that they have such a deep expertise and understanding. For example, I need a time series model that can deliver X, Y and Z and these are the parameters from a standpoint of accuracy, downstream prediction or accuracy overall I need. So I do think mean we run into those types of customers all the time and it's not a conversation of sophistication or earlier in their process. It tends to be more use case driven or familiarity with the outcome. I think what we are seeing at a very high velocity is just because of the proliferation of the number of models, the availability and sort of how quickly they're evolving, it's very difficult to keep track of that. I mean we have a 10 person applied research team that works on this stuff all the time and it's still sort of difficult to track.

(24:32)
I do think there, and this is something obviously we've talked a lot about, which is this notion of automating away complexity from model selection to configuration to output to testing that is to borrow again from you is it's a bump in the road for most not because they don't have the technical proficiency because it's just such long learning curves that are not inherently valuable to a lot of the core work you do. And I think this is, we can maybe talk about it a little bit more is this is the trajectory of what we're building a lot and releasing in a few weeks in our ui, which is these very prescriptive templates and use cards where you're like, I can start from here and I can accomplish this entire thing, automated end to end. So that's my take. John, I don't know, you see a lot more of the infrastructure side of things as well. I'm curious to see what you think on that one.

Speaker 1 (25:21):

Yeah, I think like you said, there's a lot of users that already have a notion of what type of model they want for every specific set of data sets. I think as we talk to more customers, what we see, which is right now in terms of the r and d breakthroughs and kind of synthetic data, we're working on single tables, relational databases. But really I think what the biggest enterprises are looking for is be able to create almost synthetic ecosystems where it's not just like, oh, I have this database of sales transactions, let me synthesize that it's I have this database of sales transactions, but then in a totally different infrastructure run by a totally different team, you have a huge corpus of customer support chat logs related to those purchases and you might be able to synthesize that relational database, but you need to bring in those chat logs and then be able to essentially crawl the entire ecosystem to be able to recreate that, whether it's for training purposes or for internal ML use cases or whatever. And so when you get into that level automation of model selection and how to do all of that is going to have to be paramount because you can't expect a single team to prescribe the right model prescriptions for all this.

Speaker 2 (26:37):

Sorry, Alex, I'm curious as far as model selection goes, there's sort of the selection of the model, then there's the operationalization of the model as a whole. I'm curious, do you guys see specifically in that overall process an acute point? And just to qualify that we're not at a point where we're talking about a few years ago where 75% of models couldn't get loaded into production. Now actually a lot of models make it into production. It's about relevancy, consistency, making sure they actually don't drift from original intent. So I'm trying to understand from your perspectives, what do you guys think is the most problematic or acute selection process? Is it the configuration I need to run for the model? Is it the actual model? Is it automating how many epochs I run or how do I validate the output or downstream predictions? Where in this sort of field do you see the problem being the biggest bottleneck right now? Phone a friend

Speaker 3 (27:40):

I think out of the hood. I think that that last five meters, when you're going for the best accuracy that you can get, you really care about the result you're getting back. This isn't fake data, it's not mock data. You're actually trying to use it for decision making inside of business. That last five meters that you get through parameter tuning typically like learning rate, batch size, stuff like that, that is kind of bread and butter for machine learning scientists, even though we don't like it, it takes so much work. I think that part is time consuming and it's a challenge. So I think we definitely see people spending a lot of time, initial results look really good to get there really fast that last five meters you need to do to get the best accuracy possible. Whether you're running parameter sweeps, you're building as part of your ML ops infrastructure, you've got something deployed in production and data starts to drift. Things like that are ones that currently take a lot of time that I would love to see a year from now how we talk about solutions to this.

Speaker 1 (28:42):

Yeah, I think another, maybe on the flip side, that's the five yard line to getting these models created. But then on the operationalization side, how do you treat these models as if they are that data source without having to basically double, triple, quadruple the amount of data you already have? So we talk to customers, they're like, yeah, we got 500 gigs of data we want to synthesize. We're like, yeah, cool, here's how we could do it with you. And they're like, shit, now we got to store another 500 gigs of data somewhere. And so how do you kind of operationalize that model? And I think what this has been demonstrated really well with OpenAI is that realtime feeling to get prompt in, prompt out, and obviously different model types aren't as speedy is generating the data, but that is something that's going to have to be solved for so that if you want to share data with someone, you're not sharing actual data but you're sharing access to an API that is basically doing the real time generation based off of the inbound prompts. And I think that's going to have to be a really big focus effort as well.

Speaker 2 (29:50):

Yeah, makes sense. Okay, so we want to get to questions. We want to leave about 10 minutes to answer some questions and then we have our own challenge that we want to pose to the audience to give us some feedback so we can go do some homework and come back and show them. So we'll get to that. But speaking, I want to go into the realm of complete ridiculous. We've been talking about this for 24 hours and it suddenly became this weird project, but so this is sort of crossing generative AI into not just science but sci-fi a question. So the number of parameters obviously is really indicative of the amount of total data you can synthesize in one particular one. So if you have massive, massive computational ressler, you have plenty of memory. You can essentially in theory load the entire web's knowledge and synthesize it into a model and then ask it questions.

(30:46)
And then if you sort of mix the component of other models, so go multimodal, you include time series, other types of things and you build a large enough model, you could essentially get to a point that you could make the argument that you could predict the future because essentially the future is just the weight and balance in relationships of all these sparse types of indicators. Is that possible in a model? And the reason I ask this is I know we talked about this jokingly and then it suddenly became this weird proposal for some work that we would want to do to see if you could predict the future. So I'm just kind of curious, sci-fi science realistic 10 years, never thoughts on that?

Speaker 3 (31:35):

Well I think there's a kind of famous saying that history repeats itself and so I think looking at the models that we're seeing right now, the ability to em information across multiple sources and things like that will help build predictive models. Whether you're looking at finance, whether you're looking at it's going to rain tomorrow if you're going to get weapon prediction is a huge

Speaker 2 (31:55):

Thing right now with AI companies, right? That's actually big business that they're investing in.

Speaker 3 (32:01):

I think that we'll be able to do that and we can do that with the technologies we have right now in a way that's never been possible before. So exciting, very minority report risk there, but hopefully it's done for us.

Speaker 2 (32:14):

The validators and the privacy part comes in. I think the piece,

Speaker 3 (32:20):

It looks like you're trying to predict the future.

Speaker 2 (32:22):

Yeah, no, I think it's worth saying we didn't come into this thing, especially on the case of Gretel. We didn't come in saying we want to be part of generative AI space. We came in saying generative AI is going to be massive. We want to take a privacy angle to it. So privacy has always been at the core of what we want to build into the foundation of in this whole industry. Let's move to questions, but before we do that, let's put out our challenge so folks can add comments and then by the time we're done with the questions, we can come back to it. So what we thought would be a really fun thing that we could do for everyone who is joining us and giving us some time is ask what would be a compelling set of debates between any historic figures that the audience would like to see?

(33:10)
And then what me, John and Alex are going to do is go off train a bunch of large models on any historic data available on those individuals. So next time around we can have a quick debate between Einstein and Elon and Picasso if we wanted to and see how that conversation goes. Alex made a good point earlier, which is we need to make at least sure one of them is nice, so we got to figure out how to seed sentiment of nice into one of them and I think only two of them are open to that sentiment, so I think we should figure that one out. But on that note, we'd love to get some comments as to who are some figures you would like to see debate each other. With that, we are going to jump into questions and we'll come back and pick three people and we'll go off and do our homework and see how that debate looks and what it degrades into as far as the chaos goes. Alright, let me start from the top and go through some of the questions. I'm looking for trajectory data on my research. Oh, here we go.

(34:16)
Alex, do you want to start with that one?

Speaker 3 (34:17):

Yeah, this is a great question. Back in the, actually I think this is something very similar to this is the very first blog we ever did at Gretel and it was out of this New York Times article that came out that was saying it was making an argument, we'll link to it after this, but the article was saying that it was impossible to actually anonymize precise location data. So the idea being that the types of stuff that your cellular service might have or your phone provider might have kind of location data, things like that, if precise enough wouldn't be possible to anonymize. We showed, and I'll link to you some research that we've done as well as some other papers that have come out recently taken a very similar approach, but that synthetic data for precise locations like GPS can absolutely be just as good or even better and maybe I'll talk about why it could be better than the real world data. It was based on, often you're limited in the types of data you have, so you might just have location data for a particular geography or region. So you might just have to use your research example with vehicle data. You might just have vehicle data for people driving around downtown San Francisco to start with and your question is how would my models, how would my algorithm work if we were in downtown Arizona, let's say like UBA, Arizona, a small city or something like that?

(35:43)
Part of the power of synthetic data is that you could train a model on this location information in San Francisco. You could generate additional patterns of location data in San Francisco or you could apply it to a whole different part of the world and incorporate map data and different things that the model can automatically extract from maps like open street map and generate realistic locations for other parts of the world. So exciting stuff. If you have any questions, find us on discord, don't hesitate to reach out.

Speaker 2 (36:15):

Alright, next one. Are we using differential privacy SGDS for privacy?

Speaker 3 (36:24):

So each one of our models has different,

Speaker 2 (36:26):

I thought you were going to leave it there. Yep. Yeah, we're

Speaker 3 (36:29):

Follow up to this question.

Speaker 2 (36:32):

Very concise answer. I think we could just leave it there, but it might be good to sort of expand on the SGDS part when we actually do differential privacy and why that matters.

Speaker 3 (36:40):

Yeah, so differential privacy, maybe I'll start here with a quick example here. So differential privacy, it's a technique where you insert noise into data to the point where individual attributes, so the effect of any one person in a dataset, in theory, that person could be removed and the dataset would have the exact same answer. So it gives you really great privacy in the event that you know that your data, if you are part of US census data or something like that, no one would be able to look at a model or query the model in such a way they could figure out that you were in the dataset if differential privacy is implemented correctly. There's a couple different ways to go about differential privacy. The question here is are we using a technique called E-P-S-G-V-S, which is stochastic gradient descent differential privacy? We are actually, so in our LSTM model we use TensorFlow privacy's implementation of DP SGD worked with that team. Yeah, I was

Speaker 2 (37:42):

Going to say Alex, shout out to that team at Google. They're doing some phenomenal work and they were very, very helpful collaborating

Speaker 3 (37:47):

With us. It is a phenomenal implementation and it was a pleasure to work with both the library there and the team as we built this out.

Speaker 2 (37:54):

By the way, one other thing to point out is I think it gets a lot lost in the message about large companies approaches to privacy. This is one that's worthwhile calling out is once you peel back, there are tons of great people doing amazing work with privacy and Google is one area we've had that visibility, which is pretty phenomenal.

Speaker 3 (38:12):

Yeah, most definitely. Some of the challenges around differential privacy and why not use it every time and typically where you've seen it be really successful. Apple has done some examples. US Census Bureau has done examples, Google has done examples where when you're looking at the massive amounts of information they might have for Android or Apple phones, things like that coming back, they're trying to train models but they want to be as private as possible when they do that. Things like differential privacy work really well because you have millions or hundreds of millions of examples to work with. But when you have less examples, you start seeing a really big impact on privacy using vanilla differential privacy out of the gate. So a bunch of different techniques. We've done a couple blogs on it, we've worked with the TensorFlow privacy team and whether you're interleaving public data into your smaller dataset to increase privacy even using, and there's a couple really neat research papers using basic amounts of noise injection or clipping with dp, you get really strong practical privacy guarantees that would help you be confident that, for example, your model's not memorizing something it shouldn't.

(39:23)
So exciting technology. Shout out to the Google team there and yeah, we are using it.

Speaker 2 (39:31):

Before we move on from that question, one thing I actually have heard in the context of use of differential privacy quite often is what is the difference between differential privacy? What do you get, what do you lose comparative to federated learning? So can you guys maybe quickly just touch on the difference because I know it's still early in the market and people sort of see both options and I think there's very clear sort of indications for one or the other and if folks are more interested in really understanding difference, we actually have somebody really phenomenal from Google at our conference in February synthesize giving a talk specifically on federated learning on the work they're doing. But Alex, John would just love a quick snippet of how you see the difference between federated learning and differential privacy.

Speaker 3 (40:13):

I probably call this two out as my two favorite privacy enhancing technologies. I think they're both. And you

Speaker 2 (40:19):

Also mentioned which one's your most hated one just so we have a balance, we're fair and

Speaker 3 (40:22):

Balanced. I think you guys know where I stand on that one. So data clean rooms. No, I could do a whole hour kind of rant on why that's not a good idea.

Speaker 2 (40:34):

What? No safe house.

Speaker 3 (40:36):

No safe house there. So

Speaker 2 (40:40):

That is an hour. Somebody tells you put all your data here and we'll handle, I freak out.

Speaker 3 (40:44):

We're going to try to keep things super positive this time

Speaker 2 (40:49):

You have to show up for the midnight showing to get the raw version.

Speaker 3 (40:51):

Yeah, a little scotch involved and we will get right into it. Alright, so quick overview here. As we talked about earlier, differential privacy, essentially what you're doing is you are inserting noise into a dataset. In the case of D-P-S-G-D, our model, essentially what we do is we insert noise into the model as it's being trained and we clip the gradients on the optimizer and that prevents the model from being able to learn infrequent patterns that exist in the data that can become identifying. So really cool tech essentially safe way to think about it's you're just inserting so much noise that any individual record in there, it no longer becomes identifiable. Federated learning is a complimentary technique and one of the things I'm fascinated about is the application of differential privacy and federated learning together, but federated learning, essentially what you're doing is you have a bunch of independent models that are kind of training together.

(41:48)
So you could think, imagine training to train a model with data from a bunch of different mobile devices or phones. You don't want to send all that data back to a central server because essentially you have a big privacy issue and you have to think about how you're going to train the model. Federated learning gives you a way to essentially have all of these different endpoints collaborate on model training and update weights without having full visibility into the data. So they only see their own data, they see neural network weights that come from the other devices that they're able to contribute to train a single model based on what they've observed on their own data without ever actually sending the data to the server. So we see this quite a bit across different verticals, a lot of interest in it and it is really neat when you have issues about data residency and where data's going, it gives you a great way to have the benefits of a model trained on really sensitive distributed information without having this information to a single location.

Speaker 2 (42:49):

Okay, maybe one last question and then we will wrap it up there. I know there's a lot of questions we haven't gotten to for folks who want answers. Alex and John work from morning to evening and then evening to morning are available on our Discord channel for one-on-one sessions, tutoring feedback, temporary work placement. So depending on what you need, we'll just turn their Calendly into that sort of publicly available one if you're interested. But joking aside, please do feel free to drop in into our Discord channel and post your questions there and we'll make sure that we get to 'em in time and end up answering all your questions. We do appreciate all of 'em, unfortunately we're not going to have time to go through all of 'em. One last question, which I think is pretty relevant, especially in the context of what we have seen this year. So maybe to recap and we would love to hear from you guys, one of the big learning lessons that we've all seen this year is I think the context of what is the right mix of synthetic plus internally generated or produced data. I think that's a pretty fundamental one and I think this has multiple dimensions to it, right? There's

Speaker 1 (43:56):

Getting the data,

Speaker 2 (43:57):

Yeah, there's

Speaker 1 (43:59):

Getting the data and what are you using it for? So we get this a lot and we see almost like a 50 50 mix of a lot of synthetic or a lot of internal with a lot of synthetic. And I think the trends that we see are if you are looking for pure data sharing capability, like hey, I really just want to get this data over to a data science team or some third party either internal, external to your org synthetic data all the way, for lack of a better term is a good approach because you can get the maximum benefit. And then on the flip side of that, we see increasingly number of use cases around improving ML ops and ML accuracy where you're basically using synthetic data as a means to augment your existing internal data. Whether you're boosting underrepresented classes, we see this all the time for medical use cases where they want more samples of medical entries in a database based off of different race or gender and stuff like that to be able to increase the

Speaker 2 (45:06):

Accuracy on all that. We've seen that pretty heavily with Illumina obviously on the genomic side of things actually John, one thing you touched on, whether it comes to augmentation sharing or very specific vertical use cases, maybe one thing to anchor around is can you talk a little bit about if you want to generate a ton of synthetic data, what is the minimal input you need and sort of pick a random model. Obviously there's different threshold, but what is the minimum input you need to have to make sure you have the same statistical distribution still so that data is at least reflective even if you're not augmenting traditionally raw data. What is the minimum for some of these table stakes approaches that you need to have?

Speaker 1 (45:44):

Yeah, I mean obviously it depends on the model, but at a minimum we want, I would say for a lack of better terms, it's like if you could export data that you would feel safe emailing to someone just in terms of the size of the data is a good start for us to be able to start to look at that and be able to generate data that has all the characteristics. Obviously bringing like, oh, I exported five records, it's going to look and feel a lot like there's five records, but you don't need to get started with gigabytes and gigabytes of data and we always recommend actually that you start with a smaller amount, just the way you would say like, oh, can I get a sample of that? Export it to me. I want to do analytics on it. That same concept is kind where we live in terms of how to get started and then you can kind of assess the output, we give you the quality report, the privacy report, and then from there you can kind of really turn up the operationalization and start running things more around the clock to provide that constant stream.

Speaker 2 (46:46):

So on that note, maybe we end here, maybe we each take 10, 15 seconds and just talk about one major learning as a year in review and reflecting. The one I sort of personally would love to share is this year it feels like based on a lot of advancements as far as how model training and development is being done is really the first time in history we've seen the output of synthetic data be as good and in some cases actually better than what raw sort of source of truth data can actually yield downstream. And a lot of this has to do obviously with proper labeling, some transformations, better balancing of that data, boosting the right representatives that remove bias. And I know we gave a talk about this at the NVIDIA conference, which was demonstrating how five out of six downstream models actually yielded better results. That feels very significant, right? Because that feels like we now transitioned from sort of the typical association people had with synthetic data, which is this sort of fake data which is maybe good for lower level environments or basic ml to now being able to bet predictions, recommendations, user experiences on which is really, really exciting. So that's the one I'd love to share over to you guys. What is one thing that you guys took away this year?

Speaker 1 (48:03):

Okay, I think the biggest thing that I took away this year was it's not just about model inputs and outputs. There's a lot of pre-processing you need to do with post-processing you need to do, we talked a little bit about the data validators where we build models for validators and then those models get used on the model output. We introduced privacy filters this year, which actually work post model generation. And so it's really kind of like this sequence of different capabilities you need orchestrated together. It's not as simple as I'm going to fire up this scan, take the output and be able to use it in a guaranteed fashion. And that also has implications of how we build. So that's important to me knowing that we have to be able to support all these pre and post-processing jobs that are sometimes more intensive, surprisingly than sometimes the actual model operations themselves.

Speaker 2 (48:59):

Alex, final words, you're on mute.

Speaker 3 (49:05):

Sorry for surprises or learnings this year. I think this idea about data crossing different modalities probably comes to mind. First, we didn't know this year at the beginning of the year, our users like do we have users that are just interested in tabular data or text data or time series or image or simulation? And the answer both with users and with the companies that we talk to, it's all of the above. Frequently the data sets that people are working with now, they cross different modalities. Sometimes it's image and it's augmented with text. John had a cool example earlier where he was talking about how a user across a business might want to learn from financial records but then also be able to correlate that with chat customer service logs or things like that. And that you have learnings from both text data and tabular. And I think synthetic data has a unique ability to help automate that process of extracting knowledge across these different modalities. That is going to be really exciting to see what happens the next year.

Speaker 2 (50:12):

Actually, one of my favorite charts internally for our metrics that we track is the chart around models run and usage. And it was sort of like this one big blob last year and as we've introduced models and we had this concern, is anybody going to use them? It's kind of created this equal distribution. Maybe next time we can flash that and show how essentially 10 x seeing the number of models we offer is actually equally distributed. That which is sort of good and bad a little bit at the same time. It's very difficult to say what do people actually want? The answer is everything, but it's pretty interesting. Okay, we are at 50 minutes so we can end there. We've had a lot of input and thank you everyone as far as the folks you would like to synthetically see debate. So to keep this going, this has just been some really interesting suggestions.

(50:56)
We encourage everyone to either tweet at us some suggestions or also put it on our discord. We'll maybe take all this in just because there's a lot of good stuff, debated offline. Figure out maybe a little bit which one's the easiest one with the most amount of data available and then we'll take this as a homework project and come back next time. But thanks everyone for joining. This was a lot of fun. We'll do more of these. If you have any inputs, please do tweet or join our discord and give us feedback, ask questions, but otherwise everyone have a wonderful day.

Speaker 1 (51:26):

Cool, thanks everyone.

Speaker 2 (51:28):

Alright, John, is this your poetry reading portion now?

‍

A Special 'Fireside' Chat with Gretel's Founders

Video description

More Videos

Generating differentially private synthetic text using Gretel GPT

Anonymize Financial Data with a Fine-Tuned SLM

Video - Generate synthetic data for training LLMs and SLMs

Read the blog post

Transcription