Synthesize 2023: Opening Notes
Opening remarks of the Synthesize 2023 Conference
Generate textbook-quality synthetic data for training LLMs and SLMs
How to get started creating high quality synthetic data for LLM training
How to use Gretel's PostgreSQL connector to automate synthetic data generation
Automating synthetic data generation using Gretel and PostgreSQL
Ali Golshan (00:02):
Hi everyone. I'm Ali Golshan, I'm one of the co-founders at Gretel.ai and the CEO. We want to thank you for joining us for our first ever developer conference for synthetic data. We're excited to bring a great lineup of speakers, talks, very deep dives into what we are seeing to be really moving and innovative work that we are doing on the generative AI, large language model, and foundation model side, and really root a lot of this in what synthetic data can mean for all of us. That really is the core of what we want to talk about today, synthetic data, and why at Gretel, we are betting really the future of generative AI on what we believe to be the unlocking of its potential through synthetic data.
Maybe at the beginning, we should talk a little bit about what synthetic data is, even though we're going to have a lot of speakers who are going to talk to it in a much better way than I am, including my co-founder Alex, who's going to come on after the keynote, but synthetic data is ultimately the rooted in privacy approach to generative AI, and that's the reason why we're so excited about synthetic data. We believe synthetic data has the ability to unlock and enable workflows, functions, and types of experience that didn't exist before because it is rooted in privacy, safety. The data is cleaner, balanced, and easy to use. It doesn't create the same risks when it's actually shared, collaborated on, or potentially really used as democratizing work for all data. This is what we're really excited about is we think synthetic data can really break those traditional barriers, those walled gardens, those competitive advantages of companies or organizations that would hoard data, keep data close to themselves, and we think synthetic data is one that can actually proliferate that and create a level playing field for everyone going forward.
Now, there are three really core pillars when it comes to synthetic data that we are excited about. The first one we want to talk about is a great learning lesson we've had from a lot of our customers over the last two years. The analogy I want to start with is if you think about autonomous cars, the real value and the challenge is in the last 10 to 15%. That last 10 to 15% is really that 90% of the value and the work that needs to be done. We see the same thing with generative AI. Generative AI, and what we have seen in the industry currently, especially with large language models, has enormous capabilities and potential. The problem is to go that last mile, that last 20% where the experience is truly customized, where businesses can truly deduce models that have insights about their business, not just the general world, that requires training on sensitive data, personal data, data that should not be joined together or exposed or even shared beyond a certain level because of its privacy implications or just the difficulty to actually train on it.
This is where we think synthetic data is enormously valuable. Synthetic data can help generative AI go that last mile to be able to unlock data that otherwise would not be exposed to these models or for companies to be able to leverage more of the public infrastructure, the tooling, open source infrastructure and models or even the solutions that are available outside their organizations without exposing their users, their privacy or their safety. We think synthetic data can really help take data that is locked, learn statistical distributions, build models from it, and then be used to train that last mile.
Wouldn't it be great if your company XYZ had XYZ ChatGPT, customized to your users, their experiences, their historic context, understanding really what their use case, their questions is specific to your industry, but more importantly to your company. We think this is really one of the core values of synthetic data, which is to be able to move generative AI beyond training on public data and public domain data to that last mile where there's true insight and value locked up. We're really excited about that and we'll talk a little bit more on some of our panels about the work we're doing there with a few customers.
The second part of it is we don't think as part of generative AI, only companies that have massive amounts of resources from economics to technological to human should be the ones who should be able to take advantage of this. We see synthetic data as a way for every company with valuable data to take part in the economics of generative AI ecosystem. Now, what do we mean by that? You might be a company where you are not interested in building hundreds and hundreds of AI engineers or building massive infrastructure or working with large partners to build that infrastructure so you can train your data, but what might be valuable is if your organization has historic data, enormously valuable data.
What we are seeing is there are companies in manufacturing, transportation, logistic, health that have enormously valuable data. Their data is actually key to being able to solve some of the most challenging problems we are seeing in these various verticals. However, their data cannot just be shared openly with others. This is an area where we are seeing actually customers work with us to be able to take part in the economics. What that means is training synthetic models on top of their data and then exposing that trained model to the rest of the ecosystem or publicly, and being able to actually charge for the data that is produced from those models, so other foundation or large language models can actually be trained on that particular data set.
There's really the value of the data that is generated from models, but actually generating and building models themselves can be enormously valuable. That's a tangible way to put value on data that organizations have potentially been putting together for years, but now are looking to be able to leverage for the generative AI space. While it may not be, for example, direct companies in that particular vertical, there are an enormous amount of companies from startup to large companies that are looking for more unique data to use for their training purposes. This is where we think it's key, is in the future, it's not really going to be the model or the infrastructure or the people you have, but the differentiator and the unique value and the edge is going to be the insight, the uniqueness of the data you have. If companies with that data can actually take part in this economics in a safe way, that is really a way for the entire ecosystem to benefit.
Then, finally, the analogy I would also like to draw here is we believe synthetic data to raw data can do for AI what GPUs did for AI before CPUs were not able to do. Now, what that means is synthetic data has a lot of unique capabilities. One is being able to learn on underrepresented demographics or variables or any type of underlying indicator and boost that indicator. We've written a lot about our work that we've done with Illumina and other health companies about boosting underrepresented data sets so detection during whether it's drug trials or eventual diagnosis can improve for demographics that are traditionally underrepresented in health.
This underlying technology is something we call autocomplete or seeding for data. This is the power of generative models, which means you can learn on a portion of data and statistically boost that infinitely so it creates an even distribution across all your demographics. The reason this is very important is the same model can be applied so you can create a variety of versions of your data. If your data has a very static view and it can change over time, it's not incorporating black swan events, it doesn't have these real world events that might happen very rarely.
But with synthetic data, you can essentially simulate different scenarios. You can actually seed these conditions in the data and say, what if X or Y or Z were 100X or 10X or this demographic was suddenly introduced into this particular decision. The reason that becomes very important is now you can do the same thing for model training that GPUs can do, which is synthetic data can parallel train models with a variety of different conditions so you can see in near real time what downstream predictions and recommendations look like, what is the accuracy of that? What is the privacy of that, and, ultimately, which model is really the best approach for being able to determine the user experience and optimizing as much as possible.
Then, finally, to wrap all this together, we fundamentally believe that synthetic data really has to mirror real world data, and that means a few different things. One, it means it has to be multimodal. This is actually a really exciting area for us. One of our own researchers, Andrew, is going to talk about our first announcement and preview of our image synthetics, but on top of that, one of the things we want to talk about is when we started the company and we actually launched our first set of models, we only had two models.
Eight months, later we're actually 10 plus models that we make available from time series to structured to unstructured to free text to image, and we have a very aggressive roadmap that my co-founder, Alex, is going to talk a little bit more about after our keynote today, but we believe a single platform for multimodal data that can take prompts in any type of mode and generate a different modality is going to be very important to include a larger breadth of researchers or potentially folks who just do not want a ramp to learn different tools, so a one-stop shop for developers, researchers, AI engineers, and overall software developers to be able to use a single platform.
That's another thing we're really excited about. But ultimately, this day is really about talking about what privacy as its roots in synthetic data can do for generative AI and what that really means for the larger ecosystem of training foundation models, large language models, and as we progress from text to image to audio to video to simulation, what are all the possibilities that we're going to end up seeing here? Some of the really exciting work that we're going to hear about later on today from Danny and Jonathan from Unity and NVIDIA talk about simulation and what it means to be able to synthesize data and create a variety of different simulations before you introduce a particularly physical object with training models into the real world. As an example, being able to generate millions of miles of driving in a simulated environment before you actually put something on the road. These are some of the really exciting things that synthetic data can bring. Ultimately, we believe in the future. Synthetic data is actually the right replacement to raw data.
Then, the last part that is a really exciting discovery that we've made and we've talked previously about this at few more conferences as well as blogs about this and presentations is synthetic data done the right way can actually be as good if not better than raw data. This misperception that synthetic data is potentially just junk data or fake data and it's not as good as your real data and it should be used in pre-production or QA or lower level environment is no longer the case. This is actually one of the real great things out of this generative AI space and large language models. This is why we've decided to merge these two things together, which is the concept of synthetic data has existed for a long time, but the actual proof of synthetic data being better than real data and being able to map that to accurate measurements downstream for predictions and recommendations has never been done before. We are really excited to show you what we can do with that and how you can do that.
The good news is we believe in open learning. A lot of our work is in open source, free product tiers on our SaaS platform, which you can access for free from our website into a lot of the blogs that we've talked about, including some of the most challenging problems we've solved, for example, free text prompts into images and vice versa, and ultimately relational data, which is an enormously complex problem to solve for enterprises.
This is all to say we are really excited to have you here. We are really excited to share some of the stories from our customers, from our users, and from some of our partners in the ecosystem as to what generative AI and synthetic data can do together. We look forward to having you here for the whole day. If you have any questions, please jump on Twitter or on Discord channel, ask us questions. The speakers and the presenters are going to be available. They'll be there to answer your questions. But going forward, we're always available for you at Gretel.ai. Please reach out to us if you have any questions or better yet, jump on, use our product and provide us some feedback. Look forward to talking to you all very soon and I'll see you at the end of the day conference. Thank you.