Synthesize 2023: Genomics Innovation in the Age of Generative AI
Video description
How synthesizing complex genomic data is opening up new opportunities in life science research and development
More Videos
Read the blog post
Transcription
Vinayak Kulkarni (00:03):
Hello everyone. My name is Vinayak, and I'm an associate principal bioinformatics engineer at Illumina. And today's talk, I'm going to be talking about Genomics Innovation in the Age of Generative AI.
(00:18)
So my team at Illumina is called the Emerging Solutions Team, and what we do is we look out for technologies about three to five years from today and try to figure out how these technologies could change the landscape of what we are working on today. We prototype new technologies, and one of the technologies in the ChatGPT era that everybody's excited about is generative AI. So I'll be talking about that for the next half hour with you guys.
(00:46)
So our mission at Illumina is to improve human health by unlocking the power of the genome. We think the next decade is going to be the genome era, and by that what we mean is most of us will have our genome sequenced and we'll know what to do with that data, how to follow up based on that data.
(01:08)
A little bit of history about Illumina, to set the stage. So we are the global leader in DNA sequencing and array-based technologies. Illumina is a life sciences technology company. It was founded in 1998 and is headquartered in San Diego. It's about 9,000 employees and who we serve, just a quick overview of the markets Illumina is engaged in, we go right from oncology to basic sciences and research in molecular and cell biology. So do you have cancer? What kind of cancer do you have? What kind of treatment can you use for that cancer? Questions in oncology. In reproductive health, we have assays as well as technologies that we can use to figure out non-invasive prenatal testing, which is the test you would do between 8 to 16 weeks to figure out whether the child that you're going to have is going to have genetic abnormalities.
(02:14)
We also have a big initiative internally at the company for rare diseases. So not only genetic diseases that are commonly found in the population, which have solutions today, but also rare genetic disorders and measuring those. We provide tools and infrastructure to be able to sequence those genomes and also to measure and write a report and understand the genetic basis of the disease.
(02:44)
For our customers, we have right from big universities and academic research centers, which are doing some basic science research on these disorders, to pharmaceutical companies. And at the end we also have government agencies, the famous example being, when the Covid outbreak hit in 2020, Illumina was at the forefront, providing the technology to sequence the Covid genome, to give you a test to figure out whether you had Covid or not, and sometimes even how could your treatment be followed up.
(03:18)
So we have a global footprint across a wide range of customers and almost all areas of disease in the life sciences market. Illumina, the main product that we have, as I said, are the instruments. We have an instrument where you could put in your sample and it'll sequence your DNA. We also do, along with DNA, we do other things like RNA, we also do proteins. And we have a broad spectrum of instruments as well, fit for your purpose. At one end we have an instrument which is the MySeq, which is low-throughput, for smaller applications, smaller genomes, smaller number of patients. Right up to the NovaSeq 6,000, and the X, which we launched just in October last year. And these machines, the X especially, are basically for genomes for population scale studies, for example, they could do tens or hundreds of genomes in a week, and usually in population scale initiatives by large pharma, by companies, or even by countries, you'll have tens of these in parallel working through day and night to give you genome sequences. So we span the spectrum of lowest throughput to the highest throughput and anywhere in between, depending upon your application.
(04:53)
If there's one thing that you need to know about the genome era is that it's already here. With Illumina's reach and the amount of data that's generated... This is very important because we feel for applications, especially applications like AI, we need a lot of data. And if we just look at Illumina's global footprint, the installed user base, which is more than 20,000 instruments across the globe, just in the last year we had around 280 petabytes of data that was generated by Illumina instruments. Now that's a lot of data, and this data is growing every year. Approximately, the data that comes out of one genome is about 50 to 70 gigabytes of data, that's in the compressed form. And then once you get that raw data, then you go and do your analysis on that data to figure out, based on the application, the disease area. So a lot of data has already been produced on Illumina instruments, and our hope as a team is that data is going to be tapped in to build data applications. I'll be going through some of these in the talk today.
(06:14)
About genomics, and one of the major drivers of the company is to drive down the cost of sequencing. If you guys remember in 2001 when famously Bill Clinton had announced the Human Genome Project was complete, it was one individual for 100 million US dollars. So that certainly wasn't feasible for the population-wide scale that I'm talking about right now. But in 2022 we, through innovations in sequencing, in the instrument, in reagents and in the technology and algorithms, Illumina has been able to bring down the cost of sequencing to about $600 a genome. In fact, we announced last year that our latest instrument is roughly going to bring that down to about $200 a genome.
(07:01)
Now that is certainly a price point that we think is going to unlock a whole new bunch of applications. And $200, think of it is like a fifth the price of an iPhone. That's certainly something that individuals can get their data and run these algorithms, or healthcare institutions can do better triaging of patients based on that data. So we think this data is soon going to be a part of your record in a hospital, for example.
(07:37)
So one of the things my team tries to think about is, okay, if you've driven down the cost of sequencing and if the performance is no longer a bottleneck, what are the application areas that are ripe for disruption? And we call this cradle to grave, meaning you can use genomics and the sequencing and the data you get from it to basically do assessment from someone's birth to death.
(08:08)
The newborn assessment that was recently in the news, the Rady Center in San Diego for example, has championed that effort and we know it's now routinely used for quickly diagnosing illnesses as soon as the baby is born. You have various methodologies for disease screening, you have therapy selection in terms of cancer, you have... once you get a cancer treatment, you have effectiveness monitoring. You could take someone's blood sample from time to time and see whether the treatment is working. And you could also do a better way of finding out if there's a recurrence of a cancer, if the cancer has come back or if a different different marker of cancer has come back. And then, again, you go back to therapy selection. So the application area for genomics, that it unlocks at a $200 genome, is infinite in our opinion.
(09:05)
If you do have a genome that's that cheap and that does give you so much value, and if everybody has one, the critical question is what will it lead to in the future? And we think, in the future, once you have this kind of a genome at that price point, getting the genome is a no-brainer. But where real insights will actually come about, is if we are able to collect multiple genomes, do population-wide scale analysis of these genomes, and figure out how is a certain marker affecting a population, how are certain markers affecting a disease biology? And that's where we stumble into one of the problems that is of most interest to our team, which is data, and why is this data not being used, and what are the challenges of sharing this data? Because we believe this data, if shared responsibly, can lead to a lot of science and can lead to a lot of therapies.
(10:10)
So there are three big challenges we see in this area. Genomics data is highly protected, as it rightfully should be. There are a lot of regulations, every country has their own... US has HIPAA. In Europe we have GDPR, but every country has regulations on this data, which is very sensitive. To give you a key finding, you need about 96 variants in the genome, in anybody's genome, to uniquely characterized that individual. So all the genome has 3 billion base pairs, 3 billion locations to look at. If I know somebody's 98 specific loci, I can exactly say, "Oh, that's you in the dataset" or, "That's you not in the dataset." So this data has to be very protected.
(10:56)
There are unique challenges in sharing the genomic data, as I just mentioned. There is inherently identifiability in the data. Also there's a growing lack of trust of data in general, with all the security breaches that have been happening recently. So the concern that my team is trying to address is, here is a data which is very rich and there are so many things that can come out of it, but the isolation and the separation of these data sets makes it very hard to do any good science on it. So how do we cross that barrier?
(11:30)
And to that end, one of the ideas that my team is trying to implement is this concept called data exchange. What data exchange basically means is a network of hospitals that are already connected. So Illumina does the sequencing for these hospitals, the data is in their system. All Illumina does is just connect these different datasets together. So if a researcher has an idea or a hypothesis, they can use this data exchange network to power their analysis.
(12:02)
I want to make a very important distinction here, which is Illumina never looks at any of your data. We will never look at any of your data. We don't own any of your data, we don't even intend to own any of your data. We are just connecting peers that might be interested in doing research on your data, giving them the right framework and the right consent structure and analytics to be able to do that work.
(12:28)
But the key component, the key problem with such a technology, if we were to build such a data network or call it the data marketplace on top of our technologies, is it has to be privacy presiding. That's one of the most important characteristics of such a thing, because you cannot expect, in today's era, for data to come together from different sources in a data link and then different hospitals operate on that data. That's not happening and that doesn't happen, which is why it takes so long.
(13:03)
So there are three key questions we are asking ourselves at Illumina. Is there enough data to build highly complex models that mimic human biology? We think the answer is yes. How do we derive actionable insights from all that data? What are some analytical techniques? What are some modules that we can provide to be able to get insights from the data? And the third question is how can we balance protecting privacy but still get the benefits of data sharing?
(13:33)
Now one way to share data would be strip off all the sensitive information and share the bare minimum, but that's the extreme end. What we want to do is be able to share data for discovery but still protecting privacy. And to that end, some of the work that we have been doing as a team in generative AI, I want to showcase some of those questions now. As I go through these questions, some of these questions, we are actively involved working with some of these questions with our partners or internally within Illumina. Whereas some others, these are just questions that we are excited about and we wish to learn from partners in the future.
(14:28)
So I'm going to now switch gears and talk about these areas of interest to us and what are some of the problems that we think privacy preserving AI can help solve. So first one is creating synthetic data. One huge list of problems would be solved if we don't talk about sharing data directly, and if there's a way that this data, that is highly protected, can be synthesized and then [inaudible 00:15:03] for analysis.
(15:05)
The other application area that we are interested is data augmentation, and I'll get to this in one of my future slides, which is can you use generative AI? You have so much of data already, can you use generative AI to enable data augmentation? So you have certain samples that are underrepresented in the population but you want to learn more from them. Can you generate more of those samples in a synthetic way? Especially for studying rare diseases.
(15:32)
If we were to build such a network, we want to unlock data sharing across participants that don't know each other but can trust the algorithms and trust the privacy of the network. So we want to accelerate data sharing.
(15:47)
And finally, the holy grail is clinical trials. Can we enable sharing of data and leverage AI to have digital twins for patients and predict how likely a patient would evolve over a course of a trial or can we send them to a placebo, control arm. These are all the high level areas of interest for us as Illumina.
(16:16)
So with that, I'll get into some of the nitty grittys of the applications that we are interested in. The first one in generative AI is clinical assays for rare disease samples. This is near and dear to my heart because, in my previous company when I was working on creating clinical diagnostic assays for measuring rare disorders, and the picture I have here is of CNV detection, which is copy number detection for non-invasive prenatal testing.
(16:49)
Now to give you a background of what that is, non-invasive prenatal testing is a pregnant lady between 8 weeks to 16 weeks can get her blood drawn, and in the blood you also have the fetus's blood, which is the kid's blood. You can figure out if there are any big chromosomal abnormalities that lie in the kid. These come in all different shapes and sizes and they're highly dependent on something called a fetal fraction. So what fetal fraction is, how much of the blood that he just drew from the mom, how much of that percentage of the blood is actually coming from the kid, which is what we want to measure.
(17:39)
This fetal fraction is essential because if you have a higher fetal fraction, which is higher amount of molecules from the kid, you can have more sensitivity for the assay. You can detect smaller abnormalities in the genome. Whereas if you have very low fetal fraction, it becomes very hard to even detect large fetal fraction. And there is a lot of simulation that we had to do because we did not have this wide range of fetal fraction versus abnormalities. We couldn't possibly recruit tens of hundreds of patients for our study. You rely on certain samples and you synthetically create those variations in the genome and then run your algorithm to see whether you are able to detect them.
(18:27)
Now what if you could use, instead, generative AI to feed it thousands of samples but then recreate synthetic genomes for the variant you want? Now what that gives you is, because you've created this model using many, many genomes, it captures the underlying biology from the sample. So you are not just getting that variant which you are interested in, but it also tries to mimic the biology of things around it. And that's more realistic of a sample than the approach that is currently used, that [inaudible 00:19:12] approach that is currently used in absence of these samples.
(19:15)
So that is all to say, if you have a clinical asset that you're trying to design and it's for rare diseases, this approach can certainly help you get that needle in a haystack and develop better algorithms, develop better cures. It is true for non-invasive prenatal testing, it is true for hard to find cancer samples. Also rare gene disorders like SMN and PMS2.
(19:45)
Another similar example that I would like to emphasize is synthetic AI, and generating clinical datasets that are synthetic in nature to be able to do exactly what I mentioned in previous slide. So a lot of times when we do analysis, we need genomic data that's generated from an Illumina instrument, but you combine it with clinical data that comes from a hospital or a big research network to figure out how these two interact with each other. How does your phenotype or your... what will show up in the hospital combine with what a genotype lend itself to the disease?
(20:23)
It's very hard to share this data, and so another way to do this via our platform could be you have the data on the platform and we could give you ways to synthesize that data and make synthetic copies of that data. And now if the synthetic copy is perfect in nature, which means it has all the privacy built in it and you can't set membership attacks on it, you could develop your algorithms and isolations on that data. That data is easier to travel, it's easier to look at the distributions in that data.
(21:06)
A shout-out to NVIDIA that recently released such a model called SynGatorTron, where they took data from University of Florida, I think, and millions of records there to create patient profiles and mimic health records, real health records. I know Gretel has some products in that space and there's a recent partnership with Syntegra and Datavant to do something similar. We think this will really change the game of sharing data, where you can actually see someone's data, ask certain questions, understand the distributions before you actually even buy the data. So the time to agreement between biobanks having data and pharma, who can derive value from the data, can be significantly reduced.
(21:57)
Yet another application from generative AI, I'm sure many of you guys must have heard about AlphaFold, which is a software that DeepMind recently open sourced. It solved a very important class of problems where you could go from a sequence, give it a sequence, and it goes from the sequence to a 3D structure of the protein. Now this is a very hard problem, and it took years to solve this problem, and DeepMind, by training it's algorithms on years of data, was able to get to this model.
(22:32)
We think that we can go a step further, in which case there's a shout-out to a company here called Cradle, and we could flip the problem on its head and make it a generative brother of the problem. Which is you know a disease, you know what kind of a protein structure that disease leads to, which means you know what kind of protein you need to design to be able to target that disease. So a researcher could come in, say, "This is the kind of protein I need, this kind of docking..." And all the other parameters. And starting from there, it goes back to the query sequence and says, "This is the sequence you probably need to be able to generate that kind of protein." Which is the opposite direction of what AlphaFold is.
(23:21)
Now, why this is important is because, now once you have the sequence, at scale you can manufacture a bunch of sub-sequences and try them out in the lab in silicone, to figure out, does this really solve the problem? Does it really help us diagnose disease or cure the disease? So that is a huge possibility that is exciting and that we envision will come in the future.
(23:47)
Yet another example of generative AI, you have so much data that's already out there in repos. This is public data for gene expression across bunch of different samples, time courses, knockout experiments that... You have algorithms you have built on top of it where it's learned from this data. And now you could ask questions like instead of actually having certain SNPs, doing experiments where you knock out a gene, or you make a particular SNP and see what the effect of the SNP is, the generative part could tell you, yeah, from what I've seen so far, given the data, this kind of variation is going to lead to this kind of an effect.
(24:38)
Another way to think about this problem is you don't have to do all the experiments that are in the world today. You don't have to do every experiment. You could have hypothesis and you could actually ask the model, if I knock out this gene, that gene, under these circumstances, what do you think is going to be the effect? There are papers and there are publications which talk about this and there are models which do this today. This is a genius way to do it because you don't have to do all the experiments that can get you to the solution. Many of these have already been carried out, or sections of those have been carried out, that you can infer using generative AI.
(25:23)
I'm going to quickly switch for the next few minutes on the generative AI pilots that we conducted very recently that we can share because some of this is public news. So we did some work with Gretel and our work was focused on using Gretel's APIs to be able to share sensitive data and then do algorithms on their data and see whether, once you send us the data, do the algorithms give the same answer that you would get without the synthetic data, the original dataset. And here are three papers and three different things we tried. We tried a logistic regression algorithm, a polygenic risk score algorithm, and a linear regression algorithm on these data sets. And to summarize, the key take home that we found is that the simpler the architecture and implementation of the algorithm, Gretel was able to reproduce the same answer as we would get from the original data set. Whereas if the algorithm was very complex, then, given how biology works and given how rare events have a big say in the ultimate... have the ultimate say, some of these algorithms did not work as well.
(26:47)
We also have a paper on this one. This is the second example. We also have a paper on this one, which is at the bottom, where we worked with Gretel to do... this is called a GWAS, which is genome-wide association study. And essentially we took a paper which had certain findings, certain key variants which led to those findings, and then we synthesized that data just like the previous example and now we ran the same algorithm and said, do we get the same answers? In the ideal case scenario, you will get the exact same looking distributions from top and bottom. Bottom is the synthetic one, top is the real world one. It's kind of close, was our conclusion. And again, I would recommend you read the paper if you're interested more about what we did with the synthetic data GWAS study.
(27:45)
So to conclude, we have a long road ahead of us to usher in the genome era, as we call it at Illumina. We think there is a lot of data already out there that generative AI can learn and infer and make all these predictions, but there are no good ways to access or share that data. In our opinion, for niche assays, for sharing data, for where data's hard to access or create, generative AI and synthetic data can make a huge difference.
(28:25)
We are still not completely sure about the privacy guarantees that it provides, some of these technologies provides. So we're looking deeper, is it absolutely safe, the synthetic data, or that mimics the original dataset? Is it present in the set attack, for example? Is it possible to do that? And those are the things we are still evaluating with our partners, including Gretel. But in the long run, there will be all sorts of data sets that can play a role in our data ecosystem that we are trying to build. And then the play will be whoever generates the more specific datasets, the more specialized dataset, the more specialized experiments, that will be the key to differentiation.
(29:10)
With that, I will open it up for questions. Thank you very much.