Synthesize 2023: The Relationship Between Generative and Physics-based Synthetic Data
Learn about the relationship of two fundamental techniques for generating synthetic data
Generate textbook-quality synthetic data for training LLMs and SLMs
How to get started creating high quality synthetic data for LLM training
How to use Gretel's PostgreSQL connector to automate synthetic data generation
Automating synthetic data generation using Gretel and PostgreSQL
Nathan Kundtz (00:03):
Hello and welcome. Thanks for taking some time today. My name's Nathan Kundtz, I'm the CEOO of Rendered AI. We provide a platform for physics-based synthetic data generation, really focused on imagery typically, and some examples of the work that we do is below. But today I really want to talk about different ways that people are starting to produce synthetic data and a focus on what is often called generative synthetic data and physics-based synthetic data. And in particular, how those two are starting to harmonize in the industry.
So as we get through this presentation, there's a few things I'm hoping that I can cover in a relatively short time here. One is, what do we mean by physics-based synthetic data? How is that different from generative, and should we care about physics-based synthetic data? With everything happening in Stable Diffusion and ChatGPT, does physics-based data generation have a role? And maybe do we really pick one or the other? I've actually embedded the answers here on the slide. And then finally, hopefully, what about this is practical? What can we actually do with this and how is it really starting to show up in industry usage?
I would like to start with just a few terms here. We're here at a conference on synthetic data, but I've actually found in talking to people, there's not a collective specific definition for that. So let me start with, what do we mean when we say synthetic data? When I say synthetic data, I really mean engineered data. So not just a data set, but a data set that's been engineered for use in artificial intelligence that was generated by a computer, as opposed to collected from the real world. And when we talk about synthetic data, we're usually thinking about the quality of that data along three different lines, the scale of the data that we can generate, it's accuracy, how well it represents the real world, and the diversity of that data, which is particularly important when we get into wanting to train algorithms. We need data that is both representative of reality, but also the diversity of reality.
So when we produce that data using generative methods, we're really leaning on our previous experiences, so we're using typically artificial intelligence algorithms to generate more data that is similar to things that we've seen in the past. Now that can be including new things like prompts for text or image-based generation, but fundamentally, when you think about where does the information come from that's being used by the computer to produce new imagery, it's coming from its previous experiences from the images that were provided for training. And that's a different, philosophically different almost, from what we mean when we talk about physics-based or what we might call rules-based synthetic data. And that's where the information's actually coming, not from our past experiences directly, but actually from calculating what you might call the laws of physics.
So there's lots of different types of physics, optical, radar, x-ray, et cetera, and all of those results in different rules. And in many cases, in most cases, we have really wonderful simulation tools that can help us to estimate what we would expect to see in an image produced with that a set of rules. And so, the importance there and how we get information for generative versus physics-based synthetic data, really hearkens back to a fundamental difference, and even what we might talk about as science. One of the famous quotes around physics from the early 20th century is from a gentleman named Earnest Rutherford who refers to all science as either physics or stamp collecting. And really what he meant is, look, if you're not trying to understand the fundamental rules, then all you're doing is collecting a bunch of things that you might find out there in nature.
And at the time, this was a pejorative statement. He was looking at biologists, and they were just picking out, oh, that's this type of bird, that type of bird, and just listing them, just labeling them. And I think it's really interesting. So now that we're in the 21st century and we're starting to see what's happening with AI, and I think just at the beginning of that, I feel like the question has turned, all physics, all science is either physics or stamp collecting. But what if you had a hundred billion stamps? What if you collected stamps at the scale of the entire planet? What would that mean? Here, I'm attributing that to Sam Altman. To be clear, I don't know him or know whether or not he'd ever said this, but I'm trying to make a point here. These are really two different ways of building knowledge.
One coming from rules, and where we really try to understand the underlying rules of nature, and another coming simply from our experiences very directly, and labeling those experiences. So now that we're starting to see all sorts of data being generated really from our experiences and be able to create awesome new experiences based on those, do we need the rules of physics anymore? Is that important?
I think, I mean, at first blush, and I've had people really ask this, it's unclear. Maybe we have enough data, maybe we have so much data about the world and the way that it works, that we're done, or at least, all that we need to do is if we want to collect more, and we don't need to have this foundational understanding of the rules of the laws of nature. And it turns out for very practical reasons, we're not, and we see this in our own business, and I think you guys may see this in yours as well. There's some very practical reasons why we still, and likely always will need, to be able to introduce physics-based synthetic data as well.
The most obvious of which is, hey, a lot of the things that we want to incorporate into artificial intelligence, we simply don't see very often. We see failures all the time in AI from rare events and what are called, edge cases. So this is situations where you don't really have enough examples to provide to a generative algorithm for it to start creating more like that.
A great example might be a deeply endangered species. You may not have good examples of that. In particular if you start to talk about the combination of that and then imaging it in ways that are other than cell phone camera style images. The next is, and hearkening back to that question of how is it that we're producing this imagery? We often really naturally jump into an assumption that we're talking about RGB cameras that are really good for humans to view. And it's actually pretty easy for us to generate data sets based on that phenomenology, because we can send it to somebody and you draw a box around what's in the image and say, "Hey, that's a cat, that's a dog." But as you start to move towards other types of sensors, and sensors that are very important industrially, that becomes a lot harder.
So as you start to look at maybe thermal infrared imagery or radar imagery, you can look at it for a long time and really not be able to tell what's in there, and so data landline becomes incredibly difficult. And then finally, and I think maybe most profoundly, the assumption that we will always build AI on data that exists, really creates a foundational systems engineering problem for anything new that we want to build. And many of our own customers are building entirely new platforms, and by definition, the data that those platforms will produce, doesn't exist yet. So they need to be able to generate large amounts of data, because their business models are predicated upon the use of artificial intelligence along with those platforms. And so, if they're unable to train and understand the performances of those algorithms before actually building and developing the systems, well, they have quite a problem.
And specifically, we deal with a lot of satellite constellations, for instance, that would otherwise require hundreds of millions of dollars of investment and many years of development before deploying, before you could ever get your first image. Obviously, that's not an effective way to do engineering. So if we say, "Hey, physics is going to be really important, so maybe physics-based synthetic data is better, maybe we should just focus on physics-based synthetic data." But that's not really right either. In fact, what we have found works best is when you start to combine these things, and the simplest and maybe easiest example of that is what's called, domain adaptation.
So this is where we can use a physics-based simulation to approximate what an image might look like, but then we actually use AI to try to map that to a particular sensors characteristics. And if you look at, I'll try not to use too much jargon here, but if you look at what's called the latent space of these types of images, and that's a good way to assess whether or not to a computer, you're getting good alignment between the real world and your simulations.
The use of these GANs gives you much, much better overlap, and in turn gives you much, much better training and higher quality synthetic data. And one of the tools we use, for instance on that, is called UMAP analysis, and showing an example here on the right where we've simulated imagery from a satellite, but then mapped that to the output of a specific sensor, which tended to pick up more of this reddish color and have other artifacts in the imagery. Those get brought up at the GANs and then we end up with this improved overlap in UMAP clustering. This is not where the world is stopping. In fact, what we're increasingly seeing, and what I wanted to drive to in this talk, is that more and more we are seeing not just the use of physics-based and then generative algorithms, but that we're going to move in between these two worlds constantly.
We're going to be using the best tools for the best purposes as we generate synthetic data, and it just gives a flavor of what that can look like for imagery. At this point, it's very easy to have ChatGPT produce thousands of prompts. If you ask ChatGPT to give you 10,000 five-word prompts that describe cars, you can get those, and you can pipe that directly into Stable Diffusion or choose your favorite image generator, and then use the images generated to actually create 3D models and effectively getting a text to 3D model generation with tremendous diversity. Well, that's a powerful tool, because now those can be integrated into environments and we can actually simulate what they would look like using physics-based rules in order to assess, "Hey, what would that look like if you saw it through radar, or what would that look like if it was actually a LIDAR system that we were sensing with?"
And so, you jump from this generative environment, then use a rules-based maybe to place those objects, use generative again to apply different textures to them, go back to rules-based to sense them, then finally, generative again for domain adaptation. And I think that a lot of what we're seeing, certainly at Rendered AI, and a lot of what I think the industry broadly is going to be seeing in coming years, is that we will pull from the best tool for the best purpose as we build out these data generation pipelines. And so, this is just one example, but there's a lot of ways to put these together and get interesting outputs. And so, I think what we'll find is that the future is really much more turbulent. We'll have both physics-based and generative-based tools being used side by side, each doing what they do best.
I hope I've given you a couple examples of exactly how that can be done, some practical things, maybe some thoughts that you'll actually use in your own workflow generation. And I do want to emphasize and leave off here with, bringing these things together is what's giving us the ability to engineer the data that's used for AI training. And engineering the data, as you probably know, is really the core of building out AI algorithms. So what's possible now by combining these generative techniques and physics-based, is really getting a handle on tremendous diversity, tremendous scale, and tremendous accuracy of synthetic data, and that's ultimately our best option for engineering AI. Hey, I appreciate you spending some time with me today. I'll look forward to hearing any questions and chatting with you after this. Take care.