AI Business Interview: Synthetic Data and Generative AI
Video description
Gretel.ai's Chief Product Officer Alex Watson joined Junior Editor Ben Wodecki at the AI Summit London 2023
More Videos
Read the blog post
Transcription
Ben (00:00): Here we're again at AI Summit London. I'm Ben and I'm really pleased to be joined by Alex Watson, the Chief Product Officer at Gretel. Alex, how's it going?
Alex (00:08): Excellent, Ben. Thanks for inviting me on.
Ben (00:09): And my pleasure. It's always great to see with people talking about the generative stuff because it's been the bane of our lives for the last 12 months. I, I'm keen to kind of work out what you guys are doing because you are working a lot with synthetic data and you've got a platform a little bit about that. Tell us about that and what's going on.
Alex (00:24): Sure, yeah, happy to. So I'm a co-founder and chief product officer at Grettel ai. We are a synthetic data platform used by developers. One of the first questions that we get often is, what is synthetic data? How do I use it in my business and what opportunities might I be able to find with it? So I would say the definition of synthetic data has evolved quite a bit over the past 15, 20 years since the invention of a computer. The idea of synthetic data has been around but very recently advances in deep learning. So the same type of generative models that you see in GPT-3 chat, GP, BT, and GBT four are what is powering our ability to understand and work with data. So our developers on our platform, we have a fully managed service, really easy to get started. You can also plug in with a Jupiter notebook.
(01:09): You can plug in with an SDK, you can drag your data set in. But essentially the idea is that we can train a model very similar as I was saying earlier, to A GPT Threestyle model on your dataset. It learns the insights, the dimensions of your dataset and you can use that to create artificial data. Why might you do that and what types of things are businesses using this for? There's a couple really big use cases we see. I would say one is enabling information sharing, data sharing at a scale that wasn't possible before to date most of the data that businesses have, say like 85 plus percent of a business data in some sort of structured or unstructured format, whether it's natural language text, tabular data, things like that. And given all that incredible data, very few people inside of business can actually access it. So one of the powers of synthetic data is you can create an artificial dataset looks just like the original data it was based on except you can't point any of those records back to a real person. And because we can solve privacy at that level, it enables information sharing inside a business enables data exchanges or even monetization of a data in a way that hasn't been done before.
Ben (02:16): So one thing Alex that is essential when developing is the access to data. It's really difficult in those using open source systems, that kind of thing. So where does using synthetic data come in, improving AI accuracy when the training set, getting that access to data? What are you seeing here? Yeah,
Alex (02:33): That's where almost every conversation we have starts love the idea of synthetic data. How do I compare it to real world data? And I would say a couple years ago the question was how close can we get accuracy of synthetic data to the real world data equivalents today? It's actually like we can meet or we can exceed the quality of the real world data. And the question is how is that possible? How does that work? But what I can say is we're working with data on the scale of, we've actually announced a partnership with Illumina recently, so the largest gene sequencing company in the world where they're aiming to build a data exchange based on genomic data, the most complex data you could possibly do. We've demonstrated the ability to create synthetic genomes using language models in gans that have the same insights and properties original genomic data, but couldn't point back to an individual. So the power of that is information sharing across hospital networks across the entire world. The accuracy typically varies if you are just working with the real world data and you're building a synthetic equivalent of it, often between two to 5% the accuracy for a downstream machine learning use case.
(03:41): And there's a really big but there. And this kind of fits into the, I think the second really big use case that we see at Gretel was the idea that you can improve on machine learning data sets. So this is something I'm happy to talk through a couple use cases that we have, but fundamentally I think the biggest thing we hear from customers is I have data, I have a machine learning data set is limited somehow there is some bit of information. It doesn't do well on this particular class. Things like that, whether it's ad recommendation, we're working with one of the largest social media companies right now to improve their ability to predict whether someone's going to click through on ads or it's trying to find a biomarker in a life sciences dataset. The idea that you could build a better machine learning training set instead of just building a better machine learning model is one that's really resonated.
Ben (04:31): So Alex, let's bring the G word into this. Let's bring in generative because again, it's been around for 12 months and it's something that neatly fits into what you guys are doing, especially with your platform. So one thing I wanted to kind of come across is for those looking at tasking their developers with working on a GPT or diffusion model, where can it come in terms of training but also fine tuning, which is something a lot of companies are doing right now.
Alex (04:53): That's such a good question and one we hear all the time. I would say with generative ai, that question of how do I start using it in my enterprise? What concerns do I have? Maybe I'll start with the concerns and then we can get into how I use it in the enterprise. I think the concerns are that pick a company, I'm going to make a fic, not a fictitious company, but I'm going to use an example here like Johnson and Johnson, right? They don't want chat GPT, they want chat GPT trained on Johnson and Johnson data. And what happens there is often you're training on very sensitive data and you want, for example, to go back to the medical use case, you want this model to learn about a disease. You don't want it to learn about a patient. Privacy is a really important part to enable you to train or fine tune these models on your own data.
(05:39): The second question is how do I get started or where do I get started? The recommendation I've had there is really just start simple. We talk to companies all the time, we talk to developers all the time. You could start with that super big ambitious project like I'm going to improve ads recommendation on my major social media platform. But where we see the most success for most companies is starting something incredibly simple like what's a problem we have today? I can give you an example of something we do at Gretel. That was one of the first operationalization we had with generative ai. We have tons of sales calls as part of our platform. These sales calls generate transcripts and our wider team wants access to it. And we had a simple project that took one of our applied scientists a couple days essentially just to use generative ai. In this case, I believe we're using a Google's palm model to summarize those transcripts into a really digestible format that drops into our Slack messaging. So I'd say start simple, get your feet wet, get comfortable building with it and then expand from there.
Ben (06:40): And that's the great thing about this is drag, its drop and you can move it and integrate it in ways that then you can take it and move it on in terms of fine tuning. And that's what's so great about ai. I'm right. One thing that I've spoken a lot about companies as well is not just much fine tuning, but it's also the data privacy. A lot of companies are going, well, I want this ai, but then at the same time I've got to be safe. I want to keep to what's going on. So where can synthetic data come in terms of mitigating those bottlenecks? In terms of data privacy,
Alex (07:05): Privacy has probably been the biggest bottleneck and the biggest challenge for enabling both internal and external access to data that exists. It's because of it's so hard to get it right and what people have fallen back on are these extremely manual intensive processes for identifying PII and de-identifying data. And it still doesn't fully de-identify the data. So I think this is an area that synthetic data has promise. When you combine other technologies, we call them privacy enhancing technologies with synthetic data, you have the ability to create a dataset that looks and feels just like your normal dataset that you started with, but has formal mathematical privacy guarantees. So you can load this into your favorite tool. So you have a lot of advantages over a secure enclave approach or homomorphic encryption or things that are really heavyweight. So to talk about this really promising technologies we see, one that I'd start with is differential privacy.
(07:58): So this is really the only technique that passes that mathematic rigor for enabling protection against current attacks and future attacks as well. And essentially what you can do, simple way to describe differential privacy is it inserts noise into your data enough that no one individual's data could be identified from the dataset. And if you can do that, you can use data in ways that you haven't been able to do it right? So you're not subject to the right to deletion as part of GDPR, which is a big issue. So essentially you're guaranteeing your model if it sees a secret in the data, won't memorize it the first time. So that's incredibly powerful if you're trying to train an LLM. And this is something we have really popular with Gretel actually is people training LLMs on their sensitive enterprise data. Imagine call center records where someone might be talking about something very personal, it enables the model to learn patterns in the data without memorizing a specific conversation.
(08:54): So really powerful technology there. Another one that I'm really excited about, similar to the advances that enabled chat GPT to work the way that it does, there's a technology in there called reinforcement learning human feedback. And the insight there was that instead of just the initial versions of GPT, which would just regurgitate whatever text it thought someone was asking for, you could have humans review the output of a model and essentially click on what they think is the best answer or the most aligned answer. And that means the model is outputting something that is usable and it's aligned with what the human's looking for. Similarly, what we've found, we've done some research and some publishing on this, we can train models to be more privacy conscious, not talk about things that would affect protected groups. For example, one that I really like as well is in the creation of synthetic data sets, ensuring that you match the diversity that you're looking for. So essentially training the model to want to return more diverse results
Ben (09:53): And having not just the diversity, but actually the not compromising of enterprise stuff is important given the regulatory issues going through right now. I guess that's something that you guys are massively considering, I guess.
Alex (10:05): Yes, right in the middle of it right now. And that's where I think some of those technologies like differential privacy really help get through the regulated barriers when it comes to some really big use cases we're seeing is we soldering covid, the advancements that can happen when you enable data sharing at scale, how do we continue that post covid and we really need to think about patient privacy and things like that. Differential privacy creates a way that hospitals, for example, can create new monetization streams they didn't have before. They can market data on a global scale and life sciences researchers instead of having access to one nation level a data set like the UK Biobank for example. You could query every hospital in the world and get really fast feedback on your query.
Ben (10:49): So looking forward to the next 12 months, Alex, what are you guys working on and what are you guys working on in terms of synthetic data as well? Yeah,
Alex (10:57): Big things I think that we've seen are the scale, the obvious example of scale that you'd see. So as I mentioned earlier, getting into the area where we're handling hundreds of millions of new records going into models per day as a challenge. Multimodality is one that we find really exciting. We've got a couple examples right now where customers are generating mixed images and texts. So imagine I can walk you through an actual example we have right now. When you go through the airport and you go through an airport scanner, it goes through an X-ray system. An X-ray system is searching for different types of contraband. So we're working with a major provider for X-ray technology where they're using Gretel to synthetically create new x-ray images of contraband. Imagine a new firearm gets released, they can quickly build a virtual model without ever having to have access to this thing.
(11:48): They can extend their object recognition to automatically detect it. So that's a really popular use case that we see. And finally, I'd say the cold start problem, most of synthetic data starts with a dataset as we've mentioned so far. But one pattern we see with customers constantly is that I just don't have any data. Alright, I only have a couple records and I want to make the most out of it possible. How can I train machine learning model on my 80 examples that I have? It's prohibitively expensive for me to hire human labelers to do it. Can I just use a model to do it? So those are things that we're seeing investing in quite a bit for things coming around the pike or things to look at. Imagine a foundation model is trained on tabular data. So interesting fact, I think we mentioned this a little bit earlier, is that if you look across businesses, 80, 85% of that data is in some sort of unstructured or structured format.
(12:40): So that's where there's so much potential to impact business operations. So essentially imagine a foundation model that is trained not just on text or images, but is trained on every dataset that we can find on the internet and what can you do with that? And some of the power of, hey, I want a medical type dataset matching this FHIR format just to start a project and the data has to make sense, it has to be correlated, or I have a couple records, I want to build a classifier out of it. I need to create more records. And this model, having seen tens, hundreds of millions of data sets, has the ability to learn from a couple records, create a data set that matches the desires of a user.
Ben (13:22): It sounds like it's going to be another busy 12 months for you guys by the sounds of things, a
Alex (13:25): Lot of compute in our future.
Ben (13:26): Oh hey, compute is why we're here. We're going to have a good time. And the most important thing is we've got to keep going and I look forward to maybe catching up next year and seeing what the next veic trend is. Love it. Alex Watson from Gretel, thank you so much for your time. Thanks
Alex (13:38): Appreciate it.