Cognitive Revolution: The Promise of Synthetic Data for Training AI Models

In this episode, Nathan interviews Alex Watson, founder and CPO of Gretel AI, about the company's work in synthetic data. They discuss why we need synthetic data, Gretel’s new pre trained tabular LLM that creates synthetic data on a zero shot basis, privacy techniques to prevent LLM memorization, and more. If you need an ecommerce platform, check out our sponsor Shopify: https://shopify.com/cognitive for a $1/month trial period.X/SOCIAL:

@labenz (Nathan)

@AlexWatson405 (Alex)

@Gretel_AI

@CogRev_Podcast

TIMESTAMPS:

(00:00:00) - Intro

(00:03:02) - Origins of the company name Gretel AI and initial vision around enabling data sharing while protecting privacy

(00:05:16) - Alex's background in data privacy and compliance from his previous startup Harvest AI, acquired by AWS

(00:06:37) - Early experimentation with language models in 2020

(00:07:24) - Using synthetic data to create additional examples and improve detection of rare disease

(00:12:50) - Why use synthetic data?

(00:17:02) - Sponsors: Shopify | Omneky

(00:19:00) - Training models to recreate real-world data distributions and using validators to detect unrealistic outputs

(00:21:30) - Generating tabular data row-by-row

(00:24:40) - Pre-training the Gretel tabular LLM on diverse internet data sets

(00:30:40) - Sponsors: Oracle | Netsuite

(00:34:00) - Using an agent planning architecture

(00:37:40) - Having the agent determine when to use code vs the LLM for different parts of the data

(00:39:41) - Example use case of adapting models with synthetic data samples for long-tail cases

(00:43:00) - Using reinforcement learning to intentionally generate more diverse and representative synthetic data

(00:48:20) - The importance of alignment checks and controls while still providing model openness and flexibility

(00:51:16) - The potential of efficient, lightweight models compared to massive LLMs like GPT-4

(00:56:00) - Analogizing model specialization to specialized parts of the brain

(01:06:04) - Using differential privacy techniques to prevent memorization and exposure of private data

(01:14:37) - Adding noise during training to blur memorization

(01:18:42) - Optimism that synthetic data quality issues reflect details not fully understood yet vs inherent problems

‍

Alex (00:00):

Your data's messy. It has gaps in it. I can't create new additional examples. It's too expensive or there's no way to go back to it. So we really focused our efforts on first and foremost, helping you build better data. That's been the guiding light. That's what we're really aiming for. No LLM today can generate a hundred thousand or a million row data set. So the first purpose of the agent was interpreting that user query that's coming in and then figuring out how to divide it up into a set of smaller problems that the LM can work on one problem at a time, the promise of a really lightweight model. Really fast model shows the power that you can have of taking a domain specific data set you have or task and doing something meaningful without having to do something at the GT four scale.

Nathan (00:47):

Hello and welcome to the Cognitive Revolution where we interview visionary researchers, entrepreneurs, and builders working on the frontier of artificial intelligence. Each week we'll explore their revolutionary ideas, and together we'll build a picture of how AI technology will transform work life and society in the coming years. I'm Nathan Benz, joined by my co-host, Eric Totenberg.

Nathan (01:10):

Hello and welcome back to the Cognitive Revolution. Today my guest is Alex Watson, founder and chief product officer at Gretel ai, the synthetic data platform for developers. Synthetic data is a fascinating topic. Since the early days of deep learning, it's been well known that training computer vision models on a mix of original and programmatically altered and degraded images ultimately improves model performance. It seems that learning the concepts through the noise boosts robustness to the random unseen oddities that models inevitably encounter in the wild. And more recently, dozens, maybe even hundreds of papers have explored how LLM generated data can be used to improve training sets and ultimately model performance on a wide range of problems. Yet at the same time, some research results and many observers of the evolution of the internet in general have cast doubt on just how much synthetic data the system can absorb before models begin to lose touch with their real world origins or otherwise degrade.

(02:11)
With these questions in mind, I reached out to Alex, who's been building a business on synthetic tabular data generation since 2020, and who proved to be an amazing guide to this domain. While synthetic data might sound like a niche topic, I think this conversation will be of general interest. We started with a discussion of why we need synthetic data, how Gretel has trained specialist models to maintain realism while also preserving privacy in creating it, and how we can be confident that we can trust this data for analysis, testing, and yes, AI model training. Along the way, we also explored the trade-offs between statistical realism and social manners, the impact of LLMs on gretel's business and the new pre-trained tabular LLM that they've recently introduced to help create synthetic data on a zero shot basis for a wide range of data types and scenarios. We even took a detour into AI regulation in the wake of the recent Biden White House executive order and the UK AI Safety Summit. This episode is a great example of why I love making this show. I learned a ton in the preparation and had a lot of fun with the conversation, and I think you will too. If so, I always appreciate it when listeners share the show with their friends, and of course, we invite your feedback via our email at tcr, at turpentine.co or via your favorite social network. For now, I hope you enjoy this conversation with Alex Watson of Synthetic Data Company, Gretel ai. Alex Watson, welcome to the Cognitive Revolution.

Alex (03:38):

Appreciate it. Thanks, Nathan. Excited to be here.

Nathan (03:40):

Yeah, this is going to be great. So you are the founder and now chief product officer at this company, Gretel ai. I'd love to hear how you come up with that name by the way. But what you guys do is synthetic data, and I'm just so interested to learn so much more about it. It's been really kind of eyeopening to explore the product a little bit. You do some of the best live product demos that I've seen. Your recent 12 minute YouTube short I thought was really good.

Alex (04:08):

I appreciate that. Thanks.

Nathan (04:09):

Yeah, I think this is going to be a ton of fun. So tell me where'd Gretel come from? Give me just a quick backstory and then let's talk about why do we need synthetic data? The

Alex (04:16):

Original vision for Gretel was around a better way to make data that we can't make accessible, accessible, and it's evolved quite a bit. Synthetic data has so much more capabilities and promises that we've discovered over the past about three years of running our business, but it was a reference to the digital breadcrumbs that we leave behind and really an effort from our company with using synthetic data to enable data sharing at a scale that hasn't been possible before. Imagine hospitals sharing medical records, research institutions, financial companies, sharing data in a way that doesn't compromise consumer privacy, and that's really where we started. So you'll see that as we kind of go through the technology we build and talk about differentially private training and things like that, you'll see some of that come through in our product. We've expanded that vision and that scope quite a bit, but we really started around privacy and around the idea of protecting individual privacy, but enabling learning and data sharing at scale.

Nathan (05:13):

Yeah, that's awesome. So the big two value drivers today, obviously you are starting the founding premise is privacy and now there's this kind of massive takeoff in AI and so many people training stuff and trying to figure it out. And so the other big use case that we're seeing is improving the data that people are feeding into their training processes. So tell us a little bit about that one as well.

Alex (05:39):

Yeah, maybe start with the history of how it happened and it actually happened incredibly early in our company. So for a brief history prior to Gretel, I was a co-founder of a startup security startup called Harvest ai. We built products that help customers scan and detect important data in the, we ended up getting acquired by AWS in 2016. We're going out for series A raises and got approached around launching that service as AWS service. So I was a GM there for about four years for Amazon Macy, which people used to scan AWS cloud for important data and saw that even the biggest, most cloud native incredible data companies struggled with enabling internal access to data. So Pinterest of the world, the Airbnbs of the world, and things like that. So you saw what a problem this was at scale and also the power of when you can make data accessible like at a bs.

(06:31)
We had a, at the time, 500 person compliance team that could work wonders for making data accessible. So we've started out with that privacy thing and our first open source example we released in 2020, right? Actually I think about a week before the pandemic hit and open source ability to essentially, we used a language model, this is 2020, so we weren't using Transformers at the time. We were using an LSTM where we had started to partner with a Google TensorFlow team around technology called D-P-S-G-D, which enables you to train models with differential privacy so you can make sure they won't memorize secrets. But one of the early features that we had was the ability, just like we all do today, to prompt the machine learning model and ask it to create something new. So our first real experiment was saying, can a language model like an LTM, instead of learning a language in text, can it learn to recreate the distributions inside of a dataset?

(07:24)
So we really started focusing on tabular data around 2020, and that can be mixed numeric, categorical text data, anything in between. Then we had the ability to prompt the model where you could give it a subset of those features like given a zip code and a ethnicity and a date, generate the rest of this record for me, very early in our journey, I think the first time we had this was working with some researchers at UCI, so University of California Irvine, and they were working with a rare disease dataset that was highly imbalanced. So you have thousands of patients, but the only people inside that dataset that had that really rare disease were that tens to twenties. So the question was can we address some of the representation bias here, first of all? So essentially boost that minority class and if we do that, can we improve the detection for this disease?

(08:18)
So essentially the idea is using synthetic data to create additional labeled examples when they weren't able to go back and recreate their experiment or their collection, and can that dataset be used to improve downstream machine learning training? So the idea is it introduces new examples that learn in the training data and that'll help the machine learning model. And we had a lot of success there and since that point, I think we've seen more and more focus and fast forwarding to today, and happy to talk about where Greta is today and what we're seeing today, but I'd say it's about 50 51 corner we have as a value driver safe sharing of synthetic data where we can create data that has up to mathematically provable privacy guarantees. And the other area where we're saying, Hey, how do we improve on machine learning data sets? And this can be tabular data for fraud detection, for ad recommendation systems, it can even be text data, and there's such cool research coming out recently to support that we're essentially using an LLM to create additional diverse examples like it was mentioned in the Microsoft 5 1 0 5 paper.

Nathan (09:24):

Yeah, there's a lot of connections here. I mean, right off the bat, I'm just thinking curriculum learning, that's such a huge theme in my mind these days. The ability to get smarter in terms of what data you feed into, even the pre-training stage and enhancing, filtering, enhancing, curating, boosting, so many different manipulations there, but this one is probably one of the most intuitively obvious where especially you think rare diseases, it's just not in there that much, and that makes it hard for the gradient descent to reinforce, to reward the learning of it. So boost it up a little bit and next thing you know you're getting better performance. I mean so many, many opportunities like that, and that's what was the light bulbs that were going off in my mind as just an application builder today too. I was like, boy, I see just so many quick patches in my future of rare cases that I want to handle better.

(10:24)
So I think that's super interesting. Just going back to the Amazon thing as well for a second. So basically this is because I do love to kind of also contrast recent history approaches you're scanning for important data. Does that mean, I imagine this sort of five years ago version of that was just a whole Swiss army knife of different explicit techniques like regular expressions and a handful of classifiers. What did that thing look like? And now today also I'm kind of like maybe I'd use Claude Instant and kind of clean out a lot of that old code. What is that a reasonable intuition?

Alex (11:05):

Yeah, that's one of the reasons I'm so immensely grateful for the LLM technologies and transformers that are out is that there is a light at the tunnel for people doing traditional NLP and NER of a better, more general way to do it. So really excited about that. But you're exactly right. So going back to Macy, it used a combination of traditional named recognition technologies as well as you were saying, regular expressions and things like that to help identify any type of personal data that might exist in the cloud and label it so you knew where it was and it would really take a look at it and say, is this exposed to the internet? Is this shared to outside organizations and things like that and kind of give you the visibility that you needed across your organization. The real goal was to enable developers to make decisions about what tools to use and use the best available tools, but also give the enterprise visibility necessary for that to happen.

(12:01)
I think the big challenges we faced doing this in Amazon scale was we went from a startup that had a couple amazing customers to the first week that we launched Macy. We had 6,000 customers. We were named recognition up to petabyte scale, so much time focused on how you make that even traditional ML technologies kind of work at the scale. Part of the reason I'm so excited about technologies today is just the amount of specialization that was required or tuning anytime your data characteristics changed that were required. And now it's really as much as all of us probably get annoyed with the need to prompt tune and do things like that, the promise of an LM that can understand your natural language question and make that change for you automatically is really, really cool.

Nathan (12:45):

It's certainly a game changer in so many different respects. Coming back to the present and the synthetic data as it's kind of unfolding today, there are a number of use cases that you guys highlight in your product and your demos. I'd love to hear you kind of talk through a few more beyond the boosting of the underrepresented set. One that jumped out to me and I think really highlights the challenge is insights. The idea that, and I can just imagine I've done a lot of data analysis in my time and it's like, okay, I certainly hear why at the corporate level you don't want to be passing around the crown jewel dataset. I mean, did some work with Rocket Mortgage for example, and the care with which they maintain their customer data, access to it, all that stuff. It's a serious, serious effort. So you can't just be passing stuff around like crazy.

(13:41)
That makes total sense. But then when you say, okay, well and super creative concept, instead of having to deal with all that, we'll just make fake stuff and use that instead. But insights, I was like, okay, boy, insights. I'm going to need some real theory to start to trust that you can make fake data that is enough, like whatever. And that's obviously something that probably most people are going to struggle to wrap their heads around, well, how do you define that, prove that whatever such that I can actually do my sort of pivot tables on this and trust that what I'm getting is making any sense. I've been thinking about that a lot and I'm really, I've got some guesses, but I'm really interested to hear more about the provability of how this stuff works.

Alex (14:24):

Yeah, our approach, and I think the one that seems to be gathering a fair amount of steam in synthetic data world is to train a model. And of course we're minimizing the loss function as we're training it and doing the best we can, but that doesn't tell you how that model's going to capture the real world distributions that you care about and the ability to replay it. So for us, regardless of the modality, if it's text, if it's tabular, if it's time series, it really starts with having the model kind of master the ability to recreate data matching the same distribution as the real world data was trained on. And if you can have confidence in that, you can start to alter the distribution for whatever your task is. So how we do that, we train the model at each iteration and really at the end we sample a bunch of data from the model about a one-to-one equivalent of the real world data, and then we essentially throw from a statistical perspective, throw the kitchen sink at it.

(15:20)
We have two ways of measuring. One I would say is meant to be as objective as possible and the other is meant to be kind of task specific. So we have something called our synthetic quality score. What it's doing, it's easy to walk through from a tabular perspective, we actually have similar scores for a text and time series as well, but we sample a bunch of data from the model. We look at parais correlations and that creates part of a composite score. We look at the per field distribution, we even do PCA distributions for each field and then do a distance metric between the real world data and the synthetic data. And the idea is to give you a one through 100 score that you can look at and you can reason about and say, if this is above 80, we expect it to work well for the types of machine learning use cases that most people use synthetic data for.

(16:06)
If it's below that, maybe that works for your use case. Maybe your use case is just testing or something like that. But as you were saying earlier, you don't want to create pivot tables on that. So really we start with trying to give you that sense of confidence. We've added in the ability and really just after seeing a lot of customers do this, to automatically test a downstream task for your data as well. So after the model's done training, we can run aggression or a classification task or things like that automatically within our platform. We have a lot of customers that use Vertex or SageMaker or things like that to run this as well. So we just built it into the product so you, not everyone had to right code, but I think a mixture of that somewhat completely objective, not task specific score, that is a good general indicator. And then also that understanding of your task, like what you want to do with the data and making sure it conforms to those expectations feels like the way to get that sense of confidence you need.

Speaker 4 (17:01):

Hey, we'll continue our interview in a moment after a word from our sponsors real quick. What's the easiest choice you can make taking the window instead of the middle seat outsourcing business tasks that you absolutely hate? What about selling with Shopify? Shopify is the global commerce platform that helps you sell at every stage of your business. Shopify powers 10% of all e-commerce in the us and Shopify is the global force behind Allbirds, Roth's and Brooklyn and millions of other entrepreneurs of every size across 175 countries. Whether you're selling security systems or marketing memory modules, Shopify helps you sell everywhere from their all-in-one e-commerce platform to their in-person, POS system, wherever and whatever you're selling. Shopify's got you covered. I've used it in the past at the companies I've founded and when we launched merch here at turpentine, Shopify will be our go-to Shopify helps turn browsers into buyers. With the Internet's best converting checkout, up to 36% better. Compared to other leading commerce platforms, Shopify helps you sell more with less effort. Thanks to Shopify Magic, your AI powered all star. With Shopify Magic whip up captivating content that converts from blog posts to product descriptions, generate instant FAQ answers, pick the perfect email, send time, plus Shopify Magic is free for every Shopify seller businesses that grow. Grow with Shopify. Sign up for a $1 per month trial period at shopify.com/cognitive. Go to shopify.com/cognitive now to grow your business, no matter what stage you're in, shopify.com/cognitive,

(18:37)
Omni key uses generative AI to enable you to launch hundreds of thousands of added iterations that actually work customized across all platforms with a click of a button. I believe in Omni so much that I invested in it and I recommend you use it too. Use co rev to get a 10% discount.

Nathan (18:55):

Yeah, interesting. So can we unpack the loss function a little bit more because I'm kind wondering about the relationship. That's all pretty quantitative stuff, right? It's a code base, ultimately a sort of test suite that you can execute on any data set that comes through and sort of say, we're going to characterize what you gave us and then we're going to characterize what we generated and show you that hopefully lines up distribution wise on the generation side. And it's important to probably keep in mind for folks because we're also used to one token at a time language models. I'm very much thinking of your latest tab, LLM that you demoed, but it might be worth distinguishing too between that latest thing and the set of more purpose specific models that you have. But I'm kind of imagining for the new big one, it seems like there is a really natural and kind of an insightful thing here maybe for people where there's a decoupling of the prediction, which is the distribution and then sampling from that distribution.

(19:58)
I think this is something that people maybe don't conceptualize super rigorously, but the task that you have helps me, I think at least crystallize it a little bit. So most people have a general sense, right at the end of the language model, you're putting a prediction onto every token and then you with your temperature setting, this is the experience that people are most familiar with. If they're an AI engineer or whatever, you can turn that temperature down to zero and you can get the most likely token or you can turn that temperature up and you can randomly select from those probability distributions. But in the kind of practical experience of it, we really only see one token. And in the training also there is a ground truth text document that is firing one on the actual token and zero on all the other tokens. So it strikes me that your situation is a bit different here where you can potentially define the target as the distribution and just directly optimize and form the distribution of the predictions to the distribution that you've characterized from the data, and then the sampling from that becomes can kind of understand it.

(21:12)
And I mean it's the same fundamental thing, but the difference between that one token being right and we want to generate the actual distribution seems like something that was really helpful for me to wrap my head around over the last couple of days.

Alex (21:29):

Yeah, yeah, that's probably a byproduct of starting working with tabular data where that rather than looking at it at a row being generated at the time we were using a variety of models when we started, we started with lstm, we gans, we use diffusion models, and now as you mentioned with our type LM model, we use transformers. A byproduct of how we kind of built our product is we end up looking at the row level. So every time a row of data is generated, then we examine everything similar. If you are generating a sequence of new LLM instructions for example, rather than looking at the per token, what we're going to look at is the per line or per record distribution. So essentially we let the model generate everything. The first step during training is we're sampling and we're looking at it, but also when you're using the model for inference.

(22:19)
So when you're asking the model for data to come out, there is the risk that the model's going to hallucinate or invent something new that no one wants to have happen. So we have a secondary level of validation we call not very creative, we call 'em validators, but essentially what it's doing is it's looking at all the outputs in the model and asking how different it is versus the original data that was trained on. And you have the ability to filter out things that are too far outside of the distribution. And the idea there with the tabular data was to make sure that we didn't invent anybody's age that was 135 inside of a dataset, but it works really well for text data as well. Just when the model kind of goes off on a rant or invents something that's way outside of what it should be working with, you have the ability to filter that type of data out and helps you have more confidence in a generative model that it's going to give a usable response.

(23:10)
Another kind of cool thing is that with so much of the focus for synthetic data really being on creating machine learning training sets, you can't have someone looking at every record in a row and saying, yes, this is fine, this isn't is fine. So we've really focused on making sure that when we generate data at a thousand or a hundred thousand records or even a million records, that you have confidence those records or kind match your expectation. So I think that's another really neat thing that I see happening to go back six months ago and there were so many questions about I want to use this model, this LLM for summarizing content on Reddit or things like that. And the risk was that it would summarize something that was off base and would be an inaccurate summarization. And I think technologies like what we built for tech scoring, there's been a few open source metrics have been released recently, really help you quickly check and reason about a generative models output in a way that would allow you to serve the results to customers without having to have necessarily a human look at it. So a nice kind of quick AI check on data makes these models so much more usable.

Nathan (24:18):

How do you do the pre-training and how big of a foundation model is this? Again, I'm so fascinated with the tab LLM, I'm kind of imagining that you've gone out and just assembled every public dataset you can and in a sense taught a statistical world model to this thing. So it's supposed to have all the right priors, basically. How do you go about creating and validating that strong baseline?

Alex (24:47):

Yeah, so for some background, we are about to release for listeners here, we're about to release a model. We're calling it top LM we're tabular LM. What it is is a agent planning and execution architecture built to help people work with tabular data using national language queries. And really at the core of that is both the agent that is making a decision about whether to use an LM to generate data or whether it's to use one of our tools and write code to generate data to serve your response. What we're referring to here is the actual LLM model that we have fine tuned on data sets from across the internet. So it's one of the first examples that you'll see of an LLM that's meant to work with tabular data. Tabular data can be text time series, numerical categorical, any combination of those. The approach we took, and I think this will be a constant evolution from us, the initial approach that we took was exactly like you mentioned, Nathan.

(25:46)
It was crawl the internet, specifically crawl GitHub, find any accessible data sets there, Kaggle, things like that, anything with an open source license. One area we were particularly lucky with was I was noticing a lot of times machine learning papers will reference datasets actually in the readme. So there's really great data linked inside readme's there and we could pull down the license and really understand if it was usable or not. But the ideal was to train A LLM that would be used for a data generation task on what good data looks like. And what's something interesting is that while we all kind of feel that LLMs today are trained, it's mostly accurate on almost all of the content that are on the internet if you're working with OpenAI model or a Palm or even a LAMA model. But these models really aren't trained largely yet on tabular data.

(26:37)
And tabular data also introduces some kind of interesting challenges in the sense that when you look at the context window that are available to LMS today, which is on a great LLM, let's say like 16,016 K tokens, it doesn't translate into a lot of rows in a typical tabular data set, right? So 16,000 tokens, assuming 50 tokens per row is going to give you about 350 rows. So I think most of us grassly work with data sets much bigger than that. So one of the things that we noticed as we started working with LMS and asking them to generate tabular data and the power of asking an LM to generate tabular data is one, they are just by a byproduct of how they work really good with time series type data. There's been some cool research about that recently. Second, it allows you to apply a global level of knowledge to your dataset.

(27:31)
So one thing that I think that's really resonated with our users on the platform is realizing that your dataset is awesome, everybody's dataset is unique and really cool, but it's also in some way limited, right? You don't have enough data, you never have enough of the examples, anything you were mentioning, kind of the long tail of data that you deal with and finding a more systematic way to work with it. So the idea of applying a model that is seen most of the data sets on the internet to that problem and saying, can you help me create some new kind of meaningful variations in the data to help a downstream model generalize is really powerful. So that's where we started in the initial model. We haven't done for the tablet limb model, we haven't done anything like super clever with how to encode or model numeric distributions rather we just treat everything as text and it goes through there, as I mentioned earlier, our first approach was crawl the internet and train it on everything.

(28:24)
And I think very similar to other research and academic work we see right now, I think a more curated, highly diverse set of high quality examples is the way to go. So we're seeing our team really kind of work on that and some of the opportunities here is that even the GBT fours of the world, when they've seen tabular dataset, it's usually a table on Wikipedia or something like that, so it's a couple hundred rows at most. They just haven't, the LMS have not learned that it's important sometimes that the relationships across the dataset might be thousands of rows or hundreds of thousands of rows. So that's a real kind of neat application we're looking at right now is what if we train LMS on a much larger context length and much more data, how good of a job they do, learning the subtle insights and distributions of the data that'll help improve ML generation when you're using the model.

Nathan (29:17):

Yeah, I bet quite a bit. I'd love to hear a little bit more about the agent kind structure because I'm imagining you said, okay, you generate one row at a time for one thing, the order really kind of matters there. I wonder if you have a sort of systematic approach to reordering fields because there's been some interesting research lately that A implies B does not imply that B implies A from the language models perspective. And then I guess there's sort of a sequential probabilistic evaluation where you'd be saying, okay, once at least some amount of pre-training has been done, if I were to give you the zip code as the first field, then you would expect to see reasonable demographics back just based on that zip code. But then you'd have, depending on the first variable that you predicted, you would have a very different conditional distribution for subsequent in all sorts of varying ways, correlated variables. So you're kind of doing a little markoff almost process randomly down the Plinko board of possibility and then going back and evaluating each token for its kind of conditional accuracy or conditional real representation. Is that conceptually right?

Speaker 4 (30:38):

Hey, we'll continue our interview in a moment after a word from our sponsors. AI might be the most important new computer technology ever. It's storming every industry and literally billions of dollars are being invested. So buckle up. The problem is that AI needs a lot of speed and processing power. So how do you compete without costs spiraling out of control? It's time to upgrade to the next generation of the cloud. Oracle Cloud infrastructure or O-C-I-O-C-I is a single platform for your infrastructure, database, application development, and AI needs. OCI has four to eight times the bandwidth of other clouds offers one consistent price instead of variable regional pricing. And of course, nobody does data better than Oracle. So now you can train your AI models at twice the speed and less than half the cost of other clouds. If you want to do more and spend less like Uber eight by eight and Databricks Mosaic, take a free test drive of OCI at oracle.com/cognitive.

(31:33)
That's oracle.com/cognitive oracle.com/cognitive. If you're a startup founder or executive running a growing business, you know that as you scale your systems break down and the cracks start to show. If this resonates with you, there are three numbers you need to know. 36,000, 25 and one 36,000. That's the number of businesses which have upgraded to NetSuite by Oracle. NetSuite is the number one cloud financial system, streamline accounting, financial management, inventory, HR and more. 25 NetSuite turns 25 this year. That's 25 years of helping businesses do more with less close their books in days, not weeks, and drive down costs. One, because your business is one of a kind, so you get a customized solution for all your KPIs in one efficient system with one source of truth, manage risk, get reliable forecasts and improve margins. Everything you need all in one place right now. Download NetSuite's popular KPI checklist designed to give you consistently excellent performance absolutely free at netsuite.com/cognitive. That's netsuite.com/cognitive to get your own KPI checklist

Nathan (32:38):

Netsuite.com/cognitive.

Alex (32:40):

One small modification I would make to that is that we have found the more data you sample from a transformer LLM based model for tabular data up to the level that the LLM is capable of working with. So lemme give an example there. I'll start there first. Let's say you're working with LAMA two where you're working with open AI is like 16 K context window model, right? It might be capable of generating all that data, but if it's never learned that more than a couple thousand tokens are relevant to a dataset, you're going to start to lose some efficiency as it generates more and more data. So what we do is we sample from our model and our train model up to as many tokens as we can at a time and then we evaluate it row by row. And the purpose of the agent is realizing that with current LLM technologies, there's a couple purpose of the agent, but the first one and the most obvious one is that no LLM today can generate a hundred thousand or a million row dataset or can go in and edit your dataset, which is a really popular use case for us right now, right?

(33:43)
If I want to add new fields, if I want to summarize product reviews, if I even want to just search for anomalies across my data, we've got to be able to process data way bigger than what an LM can handle in a single batch. So the first purpose of the agent really is to take a complex user query. For example, create a demo data set with a spike in sales activity in November, I want a million rows or if you're editing data, convert this unit from Celsius to Fahrenheit across my entire data warehouse or things like that. The agent's first goal is interpreting that user query that's coming in and then figuring out how to divide it up into a set of smaller problems that the LM can work on one problem at a time. So a good analogy there would be if you asked in the n NLP world, if you asked g PT four to write a book for you, you would probably get a really short book and you want a novel that's got several hundred pages.

(34:38)
If you were able to divide that up, take that problem what someone's asking for and divide it up into smaller problems like write a paragraph or a chapter at a time, you could see how an agent planning an executive page architecture that would say, okay, first step I need to write introduction next step, I've got to have character growth and start to work on the character arc and finally I need the conclusion and things like that and it can divide those up into smaller problems. That's the approach we're taking with the dataset, either editing or dataset creation where we've got something that is breaking it down into a step-by-step problem that a smaller, in this case our data generation LLM can work with and start generating data to high quality data for that particular window.

Nathan (35:22):

So is that a, it's more of an instruct type model that is creating kind of code as policy outputs and then a dedicated actual data point generation model that is kind of receiving those commands and doing the sell by sell?

Alex (35:41):

Exactly.

Nathan (35:42):

That allows you to put language models too, right? I mean I saw one of the demos was reviews of the product and obviously that's a pretty different situation from the tabular data. I assume that's kind of a little more random somehow or it seems like it would be kind of less harder to give a representativeness guarantee on customer reviews.

Alex (36:04):

We've got some research which I'll link over to you on how we assess the quality of the text based on what you're looking for, but so often datasets are mixed. Imagine EHR data where you've got doctor's notes mixed with initial observations from patients as they come in. So I think that happens quite a bit. So we try to learn across all of them. And one of the kind of interesting things is that you don't necessarily want your LLM to do everything, and that's maybe the other part of the agent planning based architecture. If you were asking for an incrementing ID or a Fahrenheit conversion, we've got a neat example, we're doing maybe a high school kind of physics level problem. The LM will approximate, but you don't want it to actually approximate your answers. You want the real answer. So the other part of I think making synthetic data using this tiger LM work at scale is having the LLM just recognize which areas are best to calculate or compute and doing that for you automatically.

(37:01)
I think we all see that a little bit. If you're experimenting with GBT four or chat GBT and you ask it to help you work on a dataset, sometimes it'll give you data set back, sometimes it will give you back code that you could use to solve a problem. And really that's the type of stuff that we are trying to streamline. We're essentially applying the agent to realize when something should be a Fahrenheit to Celsius conversion, just multiply by 0.6 and you get the right answer. You don't have to have an LM figure that out. So the first step is look at that user query figure out, given the available tools that I have, can I solve this problem with code? If so, execute that code, get that into the dataset so you have high confidence in the answer. But for other things that require that kind of level of knowledge or intuition that an LM would have summarize this review is positive or negative, things like that that require you to look across fields and understand national language text and then we use the LM to fill in that data.

Nathan (37:58):

It comes together very nicely I think in the product demo. I'm definitely excited to spend a little more time with it. I do think it'll be really helpful and it is probably a good idea also to just kind of contrast this as you started to do a little bit with trying to use GT four or certainly any of the R LHF models. I think you have just very fundamental problems here and that's kind of where even for a project like mine, so where I'm thinking of applying this immediately is we have this script writing model and its job is to write a video script for a given user who comes to us often naively, and we kind of grab some content off their website or whatever and figure out who they're, so it's extremely, extremely diverse and you might say extremely sparse, right? We have a healthy usage, but we're not that big.

(38:48)
It's a big world out there. So especially internationally now, different languages, just kind of all sorts of long tail stuff that we have not previously put into our dataset but can at any time kind of come our way. And then I think, okay, if I want to do some patching of my fine tuned 3.5, which is my currently the state-of-the-art best thing that can best nail our task, then one is probably not quite enough to get it to really learn the pattern. I want it to learn five to 10 in my broader mix of a few hundred samples probably is, but I want to create something first of all, especially if it's an unfamiliar area, it's very hard to even know what to do. You show some examples where it's like France and I'm like, oh God, I don't even know the fucking excuse language, excuse my French, I don't even know the structure of the postal system in France, let alone how to make semi realistic examples that I would want to throw into 3.5 fine tuning.

(39:54)
So if I'm making this up totally on my own, I just seem destined for underperformance for just kind of garbage in problems. And then if I ask GPT-4, I'm just like, it's going to be so RLHF to mode collapse on things like this so often that just like it answers 42 or whatever, or 97 way too many percentage of the time when you ask for a random number here, I just don't trust it at all for that sort of representativeness and I think open AI would readily agree that yeah, you should not use it for that. Obviously it's been trained for a very different purpose. So that to me is exciting and I think gives you guys a real different position in the market that is so distinct from the kind of mainline AI assistance. I think that's pretty cool.

Alex (40:50):

The flip side of RLHF, which I think is so interesting, we've done some initial work that we published on using RLHF to reduce the models kind of propensity to talk about stuff that it shouldn't initially. So particularly don't return PI and data in chorus the model, which is a big enterprise use case. But I think what you talked about also is like sometimes you want that, sometimes you need it. We have customers for example that are generating no joke here, they're generating spear phishing emails to test their own spear phishing detection system. So there are times that you need to go against what the model has been trained to do. And another interesting piece that came into mind when you're talking about RLHF, so I think the enterprise use cases are a little different than consumer where in some cases you need the ability to turn off some of the guardrails because you need to create a level of diversity or talk about things that A-R-L-H-F model probably just doesn't want to talk about.

(41:44)
Big use case for us, I'll give you another example would be healthcare companies that are looking to create synthetic versions of patient medical records. And a lot of models when you're trying to augment your examples will just refuse to do it because they think you're talking about something that could be potentially harmful, but it's really being used for a good use case. So I think there is definitely a case for times where you want to take off some of the guardrails or the set of expectations that organizations might have are just much different than what or teams developers might have are much different than what a consumer might have if they're using chat GPT or something like that. One that I think is so cool that really you really start to notice with tabular data, big application and it's the most basic thing that we would see with synthetic data, but people always start generating that mock dataset they've had to generate for a demo or further UI or something like that.

(42:36)
And inside that dataset you're going to have names, you're going to have addresses and things like that and genders, things like that. A lot of protected class type stuff. And what you'll notice is the models have a tendency to return one type of data. So you've got names that seem very consistent, you have probably representative of the training set demographics that are across the United States or things like that. One cool application of RLHF that we've done and we've been experimenting with is actually training the model to be more diverse in the results that it gives back. So if I ask for a set of demographics for a particular zip code or city or things like that, having the model want to return a more diverse and aligned set of demographics, then I think than what a model might do off the shelf is pretty powerful. But you want the ability to be able to control that. Sometimes you want real data, sometimes you want ethically aligned data. Those are both really important. I think the irony of the whole thing is RLHF can be really good for both of them. It's just what direction, it's a tool and it's kind of what direction you point the algorithm and the loss function it's solving.

Nathan (43:44):

This goes back to my kind of decoupling of the distribution prediction and then the sampling from it. It seems like you could achieve that largely with just temperature. If you said we're going to make our under the core model and its logics or percentage weights outputs as true to real world data as we can, then you could slide your temperature slider from zero and be like at zero you get the modal prediction and at one or whatever, you get the sort of normal represented, the real world distribution and then at two you get the minorities overrepresented on any and all dimensions version. But it sounds like you're approaching that in a different way. So is what I'm saying not workable for some reason. And why is there more complexity?

Alex (44:33):

Even a temperature distribution when you're editing it is going to, when you turn it up relatively high and you start to get into even kind of crazy data for a really imbalanced dataset, even then the temperature isn't going to introduce something that is 1% of data very often or at least to the level that you're wanting it to. So this is a technique we can use to force that to happen.

Nathan (44:55):

This kind of reinforcement learning is really a tool for saying, we really want you to create a more diverse representative dataset. This is the stock photography thing where it's like we're going to be intentional about this, but to do that you have to apply even more of a standard technique.

Alex (45:11):

I love their engineering blog at Pinterest and they had a really neat example on how the search results, for example, if you're searching for pictures of wedding rings, would bring back a picture of wedding ring with very diverse skin colors or things like that, which I think is a nice feature to have. Once again, you don't always want that. Often what I found particularly in the machine learning or the tabular data space is that the classes are incredibly imbalanced. Lemme give you another example, even for things at scale. So we work with a major social media company, they were impacted just like everybody else by the kind of changes to the ad, third party ad tracking and things like that. So really they're trying to make the best possible use out of their data that they have. And when you look at the ad recommendation problem, it's massively imbalanced for every thousand people that you presented that to you maybe one or two click hopefully better than that.

(46:02)
But often that's I think the kind of case that you're looking for. So you're trying to make the absolute most out of that data, but that data represents 0.01 something in that range. I'm making these numbers up, but it's meant to be illustrative of the data. So you need the ability to tell the model, this is really important. I want to learn from these particular features, this very imbalanced class, and then generate meaningful new variations to improve detection with this. And so that's where I'm coming from for the different very strong techniques outside of altering temperature where you want very explicit control over the model output to make sure it meets your expectation you've got a task at hand.

Nathan (46:40):

Well, control over models is definitely something we should all be striving for. So in all aspects of ai, definitely in the big picture kind of worries about are we going to keep this whole AI technology wave under control in any number of ways? I think your situation here is one of the more compelling things I've ever heard, also for the need for the raw model that has the more accurate world model, even if it is sometimes not so pretty to look at. So I wonder how you, and I'm not super Ben out of shape about these issues, but I guess a lot of times I kind of view the model developer's ability to control things as kind of a canary in the coal mine. If they can't prevent it from being offensive today, are they going to prevent it from following build a bio weapon command tomorrow is kind of the most credibly alarming scenario I think. But this is definitely, I certainly see your point about, hey, we want to have all these different dimensions of control and we've got, even just to build stuff to test our detection systems, we got to have data that's going to set off alarms. So that is all super interesting. I wonder what role does conventional AI ethics have in the company, given all these use cases that you want to enable?

Alex (48:09):

I have a very optimistic view of ai and I think a lot of times I view synthetic data as a tool that could be used to improve the alignment or the ethics by someone that wanted to do it. I think to your point, it could also be used to do stuff that is, that's harmful. So I think that's a real question that we're going to be wrestling with as a community over the next couple years. I'm a big fan of having the alignment checks and the warning flags everywhere, but to the extent possible, giving people control over what the model does and generally speaking, curious to hear your opinion on this one as well too. I think I'm more of a fan of the open model where it can be adapted for whatever particular use case that you might have versus moving to a more kind of closed space where a very small group of very powerful companies have control over what the models can and can't return.

(49:06)
So there is no perfect answer there, but I would say I want to believe that people want to do the right thing. AI is going advancing so incredibly fast that from what we're seeing, I think that for example, the White House executive order that came out this week is it is a sign that people are paying attention to the right things. I'm also, maybe this is a byproduct of being one of the smaller companies out there. We've got about 65 people at GOTTLE right now. I'm curious to see how regulation will play into this world where smaller companies that are innovating may not have a hundred people on a regulatory or compliance team to help work on this yet. So love the direction. I think something really important that we do as we move forward is thinking about enabling competition and things like that and enabling innovation while protecting people's privacy, while protecting the use of AI across our ecosystem.

Nathan (50:02):

Yeah, no simple answers on all that. And the a hundred plus page order that seems to mostly be ordering another 10,000 plus pages of reports is certainly reflective of that. I mean, I also thought it was a pretty good first step and at the highest level, I've been kind of saying a lot lately that as someone who does take kind of big picture AI risks pretty seriously, it's hard for me to imagine a much better situation for the kind of overall game board to be in today than the one we actually have. At a minimum, we can say all the people at the big companies that are developing the most powerful systems are pretty serious minded and the most kind of rogue one is meta, and they're still more responsible than you could easily imagine. People being if they were just didn't care or thought the whole thing was totally ridiculous.

(50:56)
So I think that's all pretty good start. I like the flop thing pretty well. I mean, I'm imagining that you're, like you had said 500 billion tokens as the pre-training base. So I mean, if my intel is correct, GT four is 10 trillion tokens, so kind of 20 times as many tokens, however many more parameters. It feels like you probably have three orders of magnitude between what your compute budget was for this and where even just the reporting threshold would kick in. So it seems like you have plenty of room to run as a small company before you would hit any onerous regulation.

Alex (51:42):

And that's one of the neat things is I think that we don't have to, and we're not trying to compete with GPT-4 to create a Tyler data model at that scale. I think that the promise of, for other people that are building AI powered applications right now, the promise of a really lightweight model, really fast model at Microsoft textbooks are all you need paper using a 1 billion parameter model that is super fast on inference, super low on training costs, trained on a relatively diverse, but small set of examples shows the power that you can have of taking a domain specific data set you have or task and doing something meaningful without having to do something at the GBT four scale. So personally, so excited about that. I think it's going to enable innovation from the life sciences companies, from FinTech companies, you name it, ai, video content creation companies, things like that that can create small, efficient, fast models that do something really cool and really unique that the big models haven't or can't do at the same level. So personally excited about that. And then the combination of those two still leverage that big model. Where you need it is we use it for the intent parsing, really understanding what type of query a user wants. Then we use our small model for speed. I'm excited about that. I think it enables people to experiment without needing, as you were saying, to train on 10 trillion tokens or something just so big that it becomes a barrier to entry.

Nathan (53:19):

Yeah, I think if we give it a little time, there are some really positive natural trends because, and there are some ways where everybody's interest can be aligned. Generally speaking, the systems that worry me the most are the super general ones. Things that are designed and engineered for narrow purpose seem inherently just a lot easier to keep under control because alpha fold may be a world changer, but up until at least this week, I think it does a couple things now, but it previously did one main thing, and that's predictive protein structure, and you got to fit that into a broader system. Lots of awesome examples like that. Alpha Go can play go better than any human, but that's all it does. So I think that's all really good and there is kind of a vision for long-term AI safety. That's kind of like the ecology of small models that I think Eric Drexler has a manuscript on this.

(54:21)
He calls comprehensive AI services. That is a good early articulation of it, kind of pretty presion actually given it's like five years old I think already. And his idea is just that have narrow superhuman AI in everything and then we don't necessarily need superhuman general ai, which might be hard for us to control, but right now we're just still figuring out all these techniques and how to make things work and what the curriculum is supposed to be and what the learning dynamics are. And the one thing that it is working without question, I mean a lot of things are working without question, but it's so tempting in the meantime to be like, well why don't we go, what happens at 10 to the 27, 10 to the 28, 10 to the 29? And there I'm like, I actually would like to see us be a little bit more cautious before we just race through however many more orders of magnitude. I have no idea really what comes out the other end of 10 to the 30 at this point. All bets feel off. Does that feel safe to you? I mean if somebody were to come today and be like, Hey, great news everybody, all my H one hundreds just warmed up and we're going 10 to the 30 right now. We'll see you in a hundred days with my 50,000 H 100 cluster, it should take a hundred or probably a little more than that to go 10 to 30, but whatever,

Alex (55:42):

I don't think there's any way to stop someone from doing that.

Nathan (55:45):

Well you are working at small city size electricity consumption at that point. So I mean that is the kind of thing that the state can currently intervene. Now there may be algorithmic breakthroughs in the future that make that sort of impossible to stop, but

Alex (55:59):

Well, getting a little bit meta here, maybe just kind of talking a little bit of theory, it feels like these advancements we made have really been kind of modeling how the human brain works and that's neural networks right at a go and at some point nature stopped saying we should have a bigger and bigger brain and started to say we're going to start having parts of your brain that are specialized for certain things. So I don't think there's anything we can do to stop someone from training something on every token that can be found across the entire internet. But I actually think this idea of there is a quantity of smart enough where for general reasoning you've got good stuff there and then it's the task specific, the code generation LLMs or the synthetic data generation, LMS or generation for alpha fold, what they're doing like that, a task specific model that is really great at what it does as a tool that's available to the others.

(56:56)
That both I think kind of helps us reason a little bit about more what's going on in a way we couldn't do if the model got bigger. But I also kind of believe is actually probably a smarter and more efficient way to build out that a GI. So I would see personally see the future as hopefully, because I like the idea of the auditability and the understandability, these small kind of expert models of a world where you've got a lot of models that are trained on small amounts of domain specific really special data and then orchestrated by a larger smart enough LLM without creating the uber intelligence that no one understands how it works. Curious how you've thought about this as well.

Nathan (57:39):

Yeah, I think largely similarly with maybe just a little more tinge of fear in my affect, but yeah, safety and narrowness again I think is super important. It would be I see a case, I guess if I were to try to summarize the case there, it would be beyond a certain point, scaling isn't necessarily economical anymore because you're good enough to do a good job at tasks and that's kind of what needs to exist there. I do now I kind of want to revise maybe my earlier statement on being pumped about the state of the game board. I do think we look at some of the leading developers and it's like there isn't in some of them, one maybe in particular there is a sort of borderline ideological sense that we're going to keep scaling and we're going to make something that's the most powerful thing we can make and we're going to try to do it safely, but we are going to make the most powerful thing we can make.

(58:43)
It seems to be kind of a prevailing notion and I'm like that is the part that doesn't seem super wise to me and it does seem like the kind of thing that the state can do at least something to control for a while. So I would again like to see a little caution there. But it's funny, I just did an episode today with my friend, the CEO of Lindy and we were kind of running down all the places in ways in which we are both eac, which are many, and then there's just this one little corner of the world where we're like, yeah, maybe let's not rush to 10 to the 30. Not knowing what kind of alien pops out the other end of that. But I think I really appreciate your perspective. It is so interesting. You have just such a different kind of angle on so many of these kind of core fundamentals. I'd love to hear how that plays out for you in terms of your sense of understanding on the part of the language models. You've got the stochastic paradigm, obviously you've got the reasoning engine characterization. What do you make of that as somebody who's focused much more on the representativeness side of the challenge?

Alex (59:57):

Our number one focus as we've built out our service is, and I think it's helped keep us grounded, is helping data scientists, developers with the problems they have with data today, your data's messy, it has gaps in it. I can't create new additional examples, it's too expensive or there's no way to go back to it. So we really focused our efforts on first and foremost helping you build better data that better data is either more accurate or is more private than the existing data that's been the guiding light. That's what we're really aiming for out of the gate and learning as we go. One of the areas we're about to release a very early version of our service and to see and really learn from users for what they're able to share and feedback with us for how they use it and use that to guide development.

(01:00:51)
So instead of starting making a set of assumptions that prove it to me incorrect, I think one of the areas that we've been successful as a startup is getting code out there really fast, getting examples out there. People can iterate with asking for feedback and iterating on that feedback. So I'm super curious to see where this use of generative AI for working with our big focus here is working with tabular or mixed modality, tabular text and time series data goes we'll use that to drive our own investments on how much time do we spend working on a better agent for example, so a better agent tooling if you wanted to create a time series or something like that at the scale of a million rows, how do you take in our knowledge around building time series that works and then combine it with the other technologies and where we see it being successful?

(01:01:37)
That's where we're going to double down. One of the areas I feel is so kind of neat and de-risking in this space right now is there are so many potential tools you can bring to the problem, whether it's retrieval, augmented generation, so bringing in example data sets into the LLM memory just to help it whether it's the react or the agent approaches for breaking things down into smaller problems and the LLM training itself. So we've got a bunch of different dials we can use to solve the problem. We're hoping to learn from how people use the service and see which areas we really need to double down on. But I'm psyched for that I think in the sense that I really don't, I have some ideas on where it's going to go, but I don't really know where things are going to go. But this, we think about tabular data as a resource and how much of the data we work with every day is like organizations is in some sort of tabular format. It's pretty unique space to be in. I'd say most organizations like 80, 85% of data in some level is in some sort of structured semi-structured format. So being able to work with that and leverage it is kind of a niche but really cool space to be playing in right now.

(01:02:48)
To your point earlier, I'm sure it'll be just a matter of time until we've got competition from the open eyes of the world, philanthropics of the world and things like that. But right now we've got a great set of users we've been building on it and I think this kind of combined approach across advancing LLMs to the point we need them to do without trying to build the Uber LLM and then also combining other cool technologies that are happening in our space to solve a problem. It's working out pretty well.

Nathan (01:03:16):

Yeah, it kind of strikes me that there's maybe another Pareto curve between these two modalities. I'm trying to find a synthesis for the stochastic parrot reasoning engine debate and in your architecture I'm sort of seeing maybe they can be both. You very much are training the core LLM here that generates the data to be like a stochastic parrot in a sort of highly principled way, but nevertheless, you want that randomness, right? That's kind of a big part of the value driver and then you also need this planner that has to be much more reliable and probably a lot of the models we use today are kind of at some sort of outer part of the curve on the production possibility frontier, but maybe there's a bifurcation that happens there too where you're really pioneering the sort of high integrity stochastic parrot side and then other people are really pushing the reasoning side.

Alex (01:04:21):

That's such an interesting idea and it kind of mimics, we went to a conference, there's a major health organization called HL seven and they have a, it's called fire. It's the most popular medical data record format in existence today and they ran a whole conference on synthetic data and the feedback we heard at that conference was the exact opposite of probably every customer conversation I've ever had, but it was so interesting in the synthetic data world for healthcare, there's a few projects. There's an open source project from Mitre called Cynthia that allows you to generate medical reform record data that you can use for testing systems and things like that in the healthcare space. It's been under development for four or five years, purely statistical based approach and what they called out was that for many of the use cases they want particularly around AI or machine learning, that data from Cynthia is just too clean and I had never heard in my entire professional career up until this point, but what they were saying is you do want a little bit of that variability.

(01:05:22)
You want that slight variation that stoke acidity that gets introduced, but you don't want crazy. So it really is about finding that balance just enough within the scope that you need and then also thinking about it at scale. You can't evaluate these things one at a time. You need to be able to reason about 50,000 examples. You create fornell limb training set or a million examples you create to boost a ad recommendation data set, something like that. So you really have to think about it at scale and that's just been starting with tabular data where it's so easy to look at it and say this is right or this is wrong. I think has maybe had us thinking about this as a company a little bit earlier than the rest of the industry that is now, wow, we can generate really amazing text or images for, they can use the train and machine learning pipeline, but how do we know that all thousand images that I created kind of meet my expectations? So I personally really like this idea of letting an LM be an lm, let a machine learning neural network generate whatever it wants to, but then examine the outputs at each step and build some controls that if it goes too far off the reservation and turn the temperature up too high and something doesn't make sense anymore, real world data, you can detect it and kind of filter it out.

Nathan (01:06:35):

That kind of recalls how in a lot of image tasks there's training on systematic corruptions of the image as well. You want to make your stuff robust, you add a little noise here and you distort this way and change aspect ratio and if it can work across all those different things, then you're going to be much better off in a real world situation. And there's kind of a similar problem I'm sure for a lot of medical things where stuff is whatever, anything from a legible to incomplete to contradictory to, I just saw a funny story about a person who had the same name and birthdate in the same hospital as another person and spent their whole life trying to be disambiguated and struggles. So yeah, it's just so many kind of crazy things out there and we don't have too much time left and I've really enjoyed the digressions in this conversation.

(01:07:29)
I did want to ask a little bit more about how you go about training for privacy protection specifically. We've talked a lot about how you train for representation and super representation, but I understand there's probably a whole different technique for just making sure that you don't spit out somebody's real email address or whatever. That is interesting when I fine tuned, one of the experiment I ran maybe can help me understand this a little bit better. This has been a topic of discussion lately too, is I ran an experiment on open the AI fine tuning with a bunch of my writing and kind of my resume, my data, whatever, and I was like, I need to do it again with 3.5. This is a little while ago, but it was the oh oh two generation of fine tuning, which they never launched publicly, but I had an opportunity to test a of it and anyway, it kind of moved in my direction.

(01:08:19)
It did not know who I was though. I was trying to turn up the epochs a bunch and it still never learned that it was supposed to answer as I am Nathan events, it was like Nathan something else sometimes and it was kind of vaguely similar to me, but definitely not memorizing those facts. So I'm very confused about memorization in general. Jeremy Howard recently had a thing where it's like LMS can memorize from one example. It definitely hasn't been my experience. So maybe for background, what do you observe about this sort of LLM memorization? In my case I was trying to get it to do it, you're trying to prevent it, but what is happening there and then what's the technique that you're using to really make sure that it's not happening for your product?

Alex (01:09:04):

Where we started was training language models from random weights on a dataset from a customer, right? In that sense, the model learns from the data that it sees and it has a very high propensity to memorize and replace secrets in data. There was great paper that came out and this is towards the beginning of our company in 2020. It came out from Cal Berkeley called the Secret Share Paper Don Song's team and several others kind of working on it and what they were highlighting was that Kevin training a language model on a data, how quickly it starts to memorize even rare occurrences in the data and the chance it'll play it back when you're training. It's an interesting example you gave where you're training GPT-3 0.5 fine tuning it on an example because I haven't seen kind of written up exactly how their fine tuning works if it's actually updating all middle weights or if it's using a path to based approach or something like that.

(01:09:58)
Just adapting a model on top of it. But it gets a little harder to detect when you have this massive pre-trained corpus and you're making very small changes to only percentage of the model weights, for example, across the entire model, but it still happens. One of the really areas that we see customers doing a lot is fine tuning a model and then just running a series of tests like we call 'em canaries, but essentially trying to get a model to auto complete a credit card number or things like that. Here's what I've seen work starting with the removal of PII or personal data. It's the first thing you can use an LM to do it. You can use NER to do it whatever you want to. The first step is really removing the data. You never want to have show up inside of your model.

(01:10:39)
The second risk, and this is really the risk that trips, trips people up, particularly when you're working with patient medical data or things like that, is that some combination of attributes really easy to imagine A tabular work example, right? You might get rid of a name, but you have a height and you've got a zip code and you've got some sort of disease or something like that and just that combination of attributes can very quickly become identifying. So none of them identify by themselves, put a few attributes together and you have a real problem from privacy perspective. Same thing with text, the types of styles that people have for writing and things like that as well as the data that you're training it on. That combination of attributes can become identifying. I suspect that the open eye approach as you trained on more and more data would become more likely to start to have things like that where the combination of attributes, writing styles, anything like that can become identifying.

(01:11:34)
The answer to that across both tabular and text is actually the same type of approach works. There's an approach called differential privacy. Everyone's heard of it. No one really seems to really know how it works and always try to find a simpler and simpler way to it to describe it. What differential privacy does is it inserts a quantitative level of noise into your data. And so when you're training machine learning model, if you're training an LLM in this case and you're training it with differentially private fine tuning for example, it's inserting noise into the optimizer including gradients on the way out. And what that's making sure is that some rare combination of words inside of your data like, Hey, my name's Alex, I'm six feet. I live in whatever southern California. Something like that doesn't become memorized and replayed by the model. So essentially you can guarantee at either a training set example, right?

(01:12:28)
So per example or per entity inside of the data, it could be a set of examples about an individual user. You can guarantee that none of the tokens inside of that dataset will be replayed directly by the model. And that's so important when you are training on compliance control data. So we've got how many different customers we have going on with different healthcare organizations that are trying to train on doctor's notes or customer support records or things like that where you need to make sure the model is not memorized a customer name or replay it or a combination of attributes there. So things like differential privacy give you a tool that you can say it's no longer that I think the model didn't memorize it and I haven't been able to extract it. You can actually say with a level of confidence that given the way we look at an individual training example or record, I can guarantee that the model will not be able to replay that or will not have memorized to be able to replay that example.

(01:13:25)
So in the tabular world, this has really opened the doors up for us. We've got a couple national level healthcare organizations that have been able to get approvals to share data between hospitals by training, not just on de-identified data. So not just removing names from patient medical records, but in this case creating a synthetic version of those patient medical records where you know that the model did not memorize my combination of zip code and height and gender or things like that that could become identifying with those kind of actual mathematical guarantees. It becomes possible. So super excited about differentially private fine tuning, particularly in the LM space. When you look at small companies that are trying to train those models on their demand specific data, but they hit compliance or privacy issues, it gives you a tool that it's not just a best guess or we think that it's going to be fine. You can actually convince yourself that the model's not going to return something that it shouldn't.

Nathan (01:14:21):

As you do the training, you've taken the gradients and you're working your way through back propagation. You are literally adding a noise factor to the updates to the weights

Alex (01:14:37):

For each subsequent token generation

Nathan (01:14:40):

That basically allows you to say we've essentially blurred the picture in aggregate. There's probably I guess a trade off there where you probably the model converges more slowly, I guess almost by definition, but without learning the stuff.

Alex (01:14:56):

Yeah, that's such an interesting thing because research that's been coming out recently, we had a conference we ran on synthetic data and we had some folks from Google come in and talk about some of the research they're doing, and that's exactly right. When you're introducing a level of noise into the data, it requires more training time to get down to the same level of accuracy. One of the things that's really interesting with this approach is that, and this increases compute requirements, but there's a theory that by really increasing the batch size bigger and bigger and bigger that you're sending into the model at any given point. So it's going to increase your computational complexity here, but you can use differentially private techniques with increasing larger batch sizes and approach the same level of accuracy as real world data. So in this sense you're kind of getting privacy without real hit on utility of the data, but essentially by more compute budget and more data, it becomes possible to reach the same level of accuracy that you would if you just trained on the data itself, which is pretty exciting.

(01:15:58)
So that's pretty new from what we've seen using GRETEL today, like you're going to have a utility hit using differential privacy, especially on small data sets. It makes a lot more sense when you get into a data set where you've got a hundred thousand or more examples. Essentially that level of noise that has to introduce to a blur out somebody of a particular zip code is a lot lower. That's why you've seen differential privacy. For example, the US Census Bureau uses it, Google and Apple use it on the next keyword prediction or emoji prediction. When you're typing a text at that scale, then differential private really starts to work. But I am personally really excited about this, like public LMS trained on public data, you're fine tuning it on a private data set and you're introducing differential privacy as you do that large batch sizes plus being able to interleave public examples will help a model converge really quickly. And I think in a lot of cases we got into the weeds there a little bit, but it is like the key to unlocking AI for regulated industries that are going to have to convince a regulator that there's no way that this patient who is part of this data set, their identity is going to be compromised, right? I always love this term, you want the model to learn about a disease but not about a patient. This is a really great technique to make sure that you have that separation.

Nathan (01:17:19):

That's cool. I've learned a lot by going down this rabbit hole. So always excited for a journey into the wheats. One last thing I wanted to just get your take on a little bit is there's obviously a ton of activity going on in synthetic data and I would kind of flag philanthropics constitutional AI as kind of interesting version of this where they're constantly iterating on this HHH basis to make things more helpful and honest and harmless. And so that seems to work. The call is really good, so that's great. And then you see, I think you even mentioned earlier the synthetic textbooks project out of Microsoft, which also seems to be a great proof point for the value of synthetic data. Then you see these kind of weird stories like self consuming generative models go mad, which I think most people if they listen to this show, they probably at least saw that blurb whenever it came out not too long ago. And there they sort of say, if you do this generation after generation, things get weird. Do you think there's anything inherently about synthetic data that is kind of a long-term problem or do you think that this is all just these kind of weirdnesses are just reflections of not having figured out some of the details yet?

Alex (01:18:42):

I'm pretty strongly not in the details category.

Nathan (01:18:46):

I could have guessed that.

Alex (01:18:47):

I've also heard another, a story that I've heard as well is, well, if GT four and Andro and other LMS are creating so much content on the internet, is the next cycle of LLMs going to have a regression because it's just operating on data that's already created from previous generation lms. I think that's an interesting question. We're going to see how that kind of plays out over time, but I would maybe posit that in a lot of cases, LLMs for where we are today can generate and often do generate, which is why we do it a higher quality version of the data that was fed than it originally started with. I think so many people use this today. We use Grammarly to improve our text. Sometimes we run an email through an LLM and ask it to help us make some improvements or things like that.

(01:19:41)
So I think the signal inside of there, and that's kind of what came out of that. Textbooks are all you need. Paper is a very promising thing. I don't think this is fully understood yet, but the idea that synthetic data can be kind a cleaner, more diverse version of the limited data you might be starting with is a really powerful idea that I think we're going to see from. So I'm optimistic that these models, and I would say that maybe the Matt example is just an example of an opportunity to configure things or kind of work with them better, that we aren't moving towards some sort of mode collapse or anything like that with synthetic data, feeding synthetic data. As long as the data that we're generating is high quality and ideally improving on the data that you have, then I think we'll be in a good spot that's going to be playing out. So I'm really curious to see how that works out.

Nathan (01:20:29):

The dynamics certainly of the future of the internet and a changing mix of content being published, there is definitely going to be another just fascinating society scale story. So anything else that you wanted to touch on that we didn't get to?

Alex (01:20:45):

No, I think it's been an awesome conversation. I was just kind of laughing about the last topic, and as long as every LM generation doesn't start with I'm a helpful AI assistant, how can I help you or let me explain this for you? The things that we see coming out of LMS all the time, I think that we'll moving in the right direction. So definitely enjoy the conversation today. And then thanks for inviting me on

Nathan (01:21:06):

Alex Watson, founder and chief product officer at Gretel ai. Thank you for being part of the cognitive revolution. It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email@tcratturpentine.co or you can DM me on the social media platform of your choice.