Alex Watson - Synthetic data could change everything

There’s a website called thispersondoesnotexist.com. When you visit it, you’re confronted by a high-resolution, photorealistic AI-generated picture of a human face. As the website’s name suggests, there’s no human being on the face of the earth who looks quite like the person staring back at you on the page.

Each of those generated pictures are a piece of data that captures so much of the essence of what it means to look like a human being. And yet they do so without telling you anything whatsoever about any particular person. In that sense, it’s fully anonymous human face data.

That’s impressive enough, and it speaks to how far generative image models have come over the last decade. But what if we could do the same for any kind of data?

What if I could generate an anonymized set of medical records or financial transaction data that captures all of the latent relationships buried in a private dataset, without the risk of leaking sensitive information about real people? That’s the mission of Alex Watson, the Chief Product Officer and co-founder of Gretel AI, where he works on unlocking value hidden in sensitive datasets in ways that preserve privacy.

What I realized talking to Alex was that synthetic data is about much more than ensuring privacy. As you’ll see over the course of the conversation, we may well be heading for a world where most data can benefit from augmentation via data synthesis — where synthetic data brings privacy value almost as a side-effect of enriching ground truth data with context imported from the wider world.

Alex joined me to talk about data privacy, data synthesis, and what could be the very strange future of the data lifecycle on this episode of the TDS podcast. Here were some of my favourite take-homes from the conversation:

  • Data synthesis is the creation of new samples of data. A good synthetic dataset maintains all the relationships between features that existed in the original “ground truth” dataset — and may even be more information-dense (more on that in a minute).
  • One old-school method of generating synthetic samples is interpolation. This strategy involves generating new samples so that they fall “between” real samples in feature space. But in recent months, a new trend has emerged: companies are using large language models (LLMs) to generate new samples. This works by fine-tuning LLMs to generate new rows in a CSV file, or new JSON objects, based on a user-provided prompt.
  • This LLM strategy has a major advantage. Language models that are trained on huge volumes of text end up learning robust representations of the world. That’s because they’re essentially trained to do autocomplete — and doing autocomplete really well means being able to finish sentences like, “The main factor that will define the trajectory of U.S. economic growth in 2022 is…” . Doing that requires a whole bunch of world knowledge, and a robust understanding of logical relationships. When these LLMs are then fine-tuned on a task like generating new samples of healthcare data, they’re able to bring their rich world models to bear on that new task, thereby infusing their generated data with knowledge they’ve gained from reading billions of high-quality words of text.
  • The net result is that synthetic data can be more private, but also more informative and valuable. It combines not only the information contained in the initial ground truth data, but also the vast world knowledge of the LLM that’s used to generate it.
  • One challenge that can arise with synthetic data is memorization. LLMs can overfit from time to time, and that can cause them to memorize sensitive information contained in their training data — things like names, credit card numbers and so on. Because this is an overfitting problem, it’s often best solved by regularization: techniques like gradient clipping, engineering the learning rate, and adding noise to the stochastic gradient descent process can be valuable for that.

You can follow Alex on Twitter here, or me here. You can also find some links that you might like to check out if you found the conversation interesting, below.

Chapters:

  • 0:00 Intro
  • 2:40 What is synthetic data?
  • 6:45 Large language models
  • 11:30 Preventing data leakage
  • 18:00 Generative versus downstream models
  • 24:10 De-biasing and fairness
  • 30:45 Using synthetic data
  • 35:00 People consuming the data
  • 41:00 Spotting correlations in the data
  • 47:45 Generalization of different ML algorithms
  • 51:15 Wrap-up