Presentation

Synthesize 2023: In Search of Data

February 8, 2023

Sridhar Ramaswamy discusses the significance of synthetic data in addressing real data challenges

Read the blog post

Transcription

Hey folks, my name is Sridhar Ramaswamy. I'm really excited to be here talking to you at Gretel's Synthesize. It is going to be a great day and session.

(02:14)
I've worked on data systems all my life, first at Brown University, then at Bell Labs I ran large portions of Google's data systems, especially on the ad side for a decade plus. It's a passion topic of mine.

(02:29)
You often heard this: data is the new oil. And so I actually went on a little quest over the weekend and got lots of help from people to generate images, synthetic images to illustrate that. You've heard this pretty often, but what this phrase does not really capture is the fact that oil is really hard to find and harness. Unless, of course, you're like Saudi Arabia or if you're Google with data, people just assume that there's tons of money to be made with data. But actually finding it, exploiting it and being economically efficient with it is a very, very large and challenging problem that people go bankrupt along the way.

(03:13)
And it's my take that most very successful data companies like Google Ads or even YouTube video recommendation kind of got lucky. What do I mean by that? Google had a problem that was tailor-made. Google Ads, search ads had a problem that was tailor-made for machine learning, for advanced AI to be applied to it. Because we showed a lot of ads, these ads generated a lot of very direct action, which is a click. And so click through prediction was an early and important problem, but it had a very clean feedback loop with literally billions upon billions of examples that came in every single day.

(03:58)
On the other hand, optimizing to conversions, which are real world actions reported by advertisers, took five plus years to build. The data is noisy. It was really hard to figure out, it wasn't calibrated, and so it became very tricky. Similarly, YouTube video recommendations, this is a somewhat straightforward problem. In my mind, sort of the one company that had the genius to think about this beforehand in how they designed their product is TikTok.

(04:26)
They intentionally make video videos pretty small. So you and I as users end up going through a lot of them expressing a lot of choice about what we like and what we don't like. Most other companies have had to sweat how they got great data and how they applied it to interesting problems to really be successful with machine learning. And that's the dirty secret of machine learning.

(04:53)
More synthetic images like impressionist paintings on the right-hand side, this is also a lot of fun to try and generate, but the dirty secret is that working with this data is super tedious, is very, very difficult and very hard to get right. And why is that? One of the main reasons for that, is real data is about actual people. And so the data becomes very personal. You don't want to be revealing your search history to everybody.

(05:29)
I started a whole company Neeva around this movement. Similarly, if you are a business, real data is like, well, that's your real revenue information. You got to be careful, for example, if you're a public company about who has access to that data. So there is an element of just brinkmanship high stakes anytime you have to deal with real data. And we all know this, we live in a world where there is incredibly heightened awareness around things like privacy. These mistakes can finish companies, they can finish careers instantly. So there's just an element of nervousness whenever anyone is dealing with real data, whether it's on the PI side or on the revenue side, this stuff is just hard to deal with.

(06:25)
So data's hard to share and traditional methods for producing labeled data. You're like, ah, I bet I can get some people say living in India, in Pakistan, in Vietnam to be able to label this data. It turns out that this is also incredibly hard and incredibly painful, and some of it is not even obtainable. So for example, if you want loan history information, that's not something you can produce synthetically. Some kinds of data you can produce synthetically, but it's just really hard to get right, especially if you're a small company and don't have the expertise of a Google or a Facebook.

(07:05)
I run a search engine called Neeva. We are a search engine. We have done, we have worked with human eval data, we have worked with machine learning systems pretty much all of our life. And one of the first things that we tried with Neeva was to get our search quality ratings right. Now, what the right results are for a particular query is sort of subjective. And so you end up writing very long guidelines that your raters then have to truly internalize.

(07:32)
Remember, they're busy people, they're doing this for a living, but this is not the only thing that they are doing. We pretty much work with every major AI data company and we have pretty much given up on all of them. Even things like two out of three rater agreement where you get three people to do a task and hope that two of them agree was just incredibly painful. And so even if you're willing to spend the money to get labeled the data, it can become very problematic. And as I said, in many cases that involve customer behavior, there's no such thing as synthetic data.

(08:11)
So how you get your hands on important data, how you share it thoughtfully is a really big and really critical problem. This is where this conference Synthetic Data comes in. It's a big deal because it can be safe to share. As I said, this is a big issue for a variety of reasons.

(08:34)
This can be a big deal because it can be accurate. You can make sure that the data that you generate actually follows distributional characteristics that you are keen on. It can be flexible. So you can say, I want the data to have these kinds of distributions but not others. And it also has the potential to be plentiful, meaning you can generate more data, you can amplify sections of your data where perhaps you don't have as much.

(09:07)
One of the frequent things that I'm sure you have often heard about is how machine learning systems, AI systems are biased against minorities. That is partly because there's not a lot of data around it, but you can elect to amplify data from particular segments so that your machine learning systems to a better job of learning on the data and being able to predict accurately on that data.

(09:35)
So synthetic data can solve a variety of problems if it is done. And why is there so much excitement in the past 12 months around machine learning AI and synthetic data? Well, it's the foundation models. And I am sure all of you have heard by now about things like ChatGPT or GPT-3 or OpenAI, but the underlying revolution is a few years in the making. What has really happened over the last five years is that there's been an explosion in the size and capabilities of, especially language models.

(10:15)
These models have gone through 20 years of Moore's law. Think where we were in 2003, say with respect to the quality of our cell phones. I used to carry like this brick of a cell phone for many years, but these models have gone through 20 years of Moore's law within five years. They have doubled 13 times from 77 million parameters being state of the art to close to a trillion parameters being state of the art. And they have exhibited many, many wondrous properties across text, across midgets, and soon across videos as well.

(10:51)
And part of what companies like Neeva, companies like Gretel are doing are using the power of these models to generate amazing new experiences. And as a practical example again of an area that's near and dear to me is one we call Neeva AI Answers. Early this year, Neeva, which is a search engine, went from generating what we call [inaudible 00:11:21] links as the response to a query to generating fluid answers in response to a query.

(11:29)
Now, the traditional way of evaluating a change like this would have us send a ton of these things to human raters outside and get the rating data back from them and use them to train models that would then tell us what to do. Just the feedback loop for one study could be multiple weeks.

(11:54)
On the other hand, we adopted a radical new approaches based on foundational models. What we did was we took our serving stack, we did things like use foundation models to generate summaries, and then we turned around and distilled all of that into much smaller models that could then generate amazing summaries at a page level. And then we put together a system that is able to combine all of these into a fluid answer.

(12:26)
So a state-of-the-art system doing this thing called retrieval augmented generation was able to be trained in a matter of weeks with very little involvement from external folks because we were able to use synthetic data, trained by foundation models to serve as the input data to train other models. As I said, the normal data loop in situations like this is in weeks. We can turn them around in two days.

(12:55)
We can generate the training data, we can distill into a smaller model and use techniques like reinforcement learning based on human feedback for further improvements. The speed and scale at which we are able to operate is truly astounding. I know what you're thinking. Well, can everyone do this? No. Neeva is a little special in that it has a crack team of modeling experts, people that know search quality deeply, people that had to learn prompt engineering in a pretty big way, and infrastructure gurus.

(13:27)
Most companies don't have the luxury of having these people. Just on the infrastructure side, we went from taking something like six seconds to produce a summary to being able to do that in 400 milliseconds. This of course happened over the space of a month, but the progress that the team made was truly remarkable. This is where Gretel Synthetic comes in.

(13:53)
Now, generating synthetic data and being integrated deeply into the life cycle of how data operates within your company is a complicated task. And so the Gretel team helps you in everything from picking the right models. As I said earlier, there's an explosion of new models on different kinds of modalities, whether it's images, whether it is tabular data, whether it is textual data, there's a lot of advances that have happened.

(14:22)
You also need to pick the right privacy filters depending on the data that you want to generate. And of course, you want to go through a process of validating and really making sure that the data follows the distribution characteristics that you want for your particular instance. And optimization of these models, whether it is for cost or for latency, is another important thing that the team can help you with.

(14:46)
And finally, deployment. These models don't run in a vacuum. Some people like cloud services, while other people want everything to run within their PVC, to run within their Kubernetes clusters, and the team can help with that as well. So this is a team that looks at the end-to-end life cycle of how synthetic data needs to work within your environment.

(15:08)
And the world outside the world, in the world of AI is exploring, is changing very rapidly and the Gretel team has just shown a pace of innovation and delivery that's been truly amazing and fun to watch. I didn't really think they would roll out image diffusion models, but they did and are also able to use them in very practical use cases that they're going to tell you more about.

(15:35)
Overall, I think what ML/AI is going to do broadly, but also in the space of synthetic data in 2023 is going to be pretty amazing. There are going to be multiple foundation model companies. Anthropic is ready to go out with a launch soon, and obviously OpenAI is the big juggernaut here. Google's going to announce the foundation model for external use sooner or later, and there are also going to be multiple other companies that sit at one layer above them producing state-of-the-art experiences for things like chatbots.

(16:11)
I fully expect [inaudible 00:16:13] AI or even Inflection to be releasing amazing chatbots. Neeva is going to be revolutionizing search. And when it comes to synthetic data, Gretel is the clear leader and they're going to be delivering incredible amounts of value to their customers and making data truly, truly accessible and democratizing who has access to this data, both within companies but also more broadly. I cannot think of a more exciting time to be working in machine learning.

(16:48)
Well, that's my talk. I look forward to questions from all of you and look forward to our conversation. Thank you folks.

Synthesize 2023: In Search of Data

Video description

More Videos

Generating differentially private synthetic text using Gretel GPT

Anonymize Financial Data with a Fine-Tuned SLM

Video - Generate synthetic data for training LLMs and SLMs

Read the blog post

Transcription