+ Illumina - Using AI to create safe, synthetic datasets for genomics

Promising evidence that state-of-the-art synthetic data models can produce artificial versions of even highly dimensional and complex genomic and phenotypic data.
Copyright (c) 2022
Copyright (c) 2022

The biggest challenge — and opportunity — for the life sciences community is learning how to safely share patient health data, while protecting the privacy of patients.

In particular, genomic data — literally, the instructions to build and grow an organism — is one of the most complex datasets that exists today. The ability to safely share genomic data will undoubtedly fuel medical breakthroughs and encourage better medical care.  

This complexity and enormous potential of genomic data are exactly why the researchers here at are thrilled to work with Illumina’s Emerging Solutions to explore the question of whether it is possible to create synthetic versions of real world genomic data sets.

The synthetic datasets created by Gretel are based on real world data, and offer enhanced privacy guarantees that can enable life science researchers to better collaborate and quickly test ideas through open access to data, without compromising patient privacy.

In our joint case study, we use state of the art generative neural networks to recreate artificial versions of the highly complex genomic sequences used by life sciences researchers. Synthetic data created by has the same size and shape as the dataset it was trained on, enabling researchers to explore the data using their preferred data science tools and run queries and statistics on the overall dataset. However, because the data is artificial, no records are based on any single genome.

Our research demonstrates encouraging evidence that state of the art synthetic data models can produce artificial versions of even highly dimensional and complex genomic and phenotypic data.  

While the initial case study results are based on a relatively small sample set (1,220 mice), we’re confident that with continued experiments in scale, accuracy, and privacy; synthetic data has the potential to enable sharing and collaboration on synthetic genomics datasets at a scale that is orders of magnitude larger than what is possible today. If you’re interested in exploring further, we’ve shared the code to synthesize genomic data and recreate all our experiments on GitHub.

You can read the entire case study here.

What’s next

We are working together to enable future genomics research and safe, private data sharing between life sciences organizations and hospitals. In our next posts, we will explore the scale and privacy guarantees that can be achieved working with synthetic data on genomic datasets. To get in on the discussions, please join our Slack community.

If you have any questions or would like to discuss use cases around synthetic data we would love to talk to you. Feel free to reach out to us at