Synthesize 2023: Synthetic Data Revolutionizing Clinical Trial Data Collaboration and Research
Video description
Explore the impact that synthetic data is having on clinical trial collaborations and research.
More Videos
Read the blog post
Transcription
Afrah Shafquat (00:32):
Hello, my name is Afrah Shafquat. I'm a senior data scientist at Medidata AI. And today I will talk about synthetic data and its impact on clinical trial data collaboration and research.
(00:45)
Synthetic data is revolutionizing AI across industries where the impact of this technology has resulted in enhancing training data resulting in building better machine learning models, solving problems such as under-representation, lack of diversity and limitations in sample size. It has been predicted that by 2030, synthetic data will completely overshadow real data, a prediction that has a real impact on how we build current and future machine learning technology.
(01:18)
At Medidata AI, we are at a unique position where we have access to over 27,000 clinical trials comprising of 8 million subjects and over 110 clinical variables. These clinical trials span rare syndications including those in oncology, central nervous system, and other therapeutic areas. As you can imagine, clinical trial data is extremely valuable for research and drug development, but it remains siloed due to high privacy concerns protecting patient information and HIPAA regulations.
(01:54)
In the context of clinical trial data generating synthetic clinical trial data will provide the means for data sharing of these data sets, encouraging innovation and furthering drug development research. It also allows the ability to up sample under-represented populations as well as data augmentation of training data sets. When data is limited, improving sample size and statistical power of machine learning model is produced.
(02:22)
Though the advancement of deep neural networks have enabled synthetic data generation of large scale datasets, the performance of these models is affected where the initial data training size is limited. As is in the case of clinical trial datasets between data may be as low as around 100 sub-trial subjects.
(02:43)
Alternative techniques that can accommodate synthetic data generation for small scale datasets are needed when generating synthetic clinical trial data. Our proposed algorithm stimulants loads a source dataset, embeds records in a low dimensional space, selects a seed record and identifies its nearest neighbors. The algorithm then generates a new record by randomly selecting features from key neighbors. This process can be repeated N times to generate dataset at any scale. This methodology has been published at AMIA 2022 and NeurIPS 2022.
(03:26)
When comparing the fidelity of the synthetic data to the source data, we use a number of different tests. The quantitative tests include looking at the overlap of the clinical trial data sets in a low dimensional space. Here is an example of using t-SNE. Simulants synthetic data is shown in orange, the source data in blue. And we can see because of the decent overlap between the two dataset that the synthetic data does seem to be representing similar population.
(04:02)
We also use compare the mean of the different features included in the synthetic data. From the template, which is the source data, how does the synthetic data compare? And we can see comparing means that most of the properties across the different features are preserved.
(04:24)
As these data sets are clinical trial data sets, what is very important to us, in addition to the numerous univariate and bivariate comparisons such as comparison of the mean median, as well as the bidirectional correlation between across the different features as is shown in these heat maps. What is really important to us is clinical trial specific tests. That is what is the survival probability of the patient's looks like in the original source data versus the synthetic data.
(05:04)
As you can see here in the top visual, the stimulants preserves... This is a Kaplan Meier curve. And as you can see, the overall survival rate in the source dataset indicated in gray versus the synthetic shows a very well alignment in terms of preserving the overall survival rate as well as progression-free survival rate. This algorithm was compared to other algorithms that are provided by the MIT's synthetic data vault, such as the Gaussian Copula GAN, the Tabular Variational AutoEncoder, the Copula GAN, the Conditional Tabular GAN as well.
(05:51)
And as you can see, a lot of these deep neural networks that perform really well on large scale datasets are not able to preserve the clinical trial specific properties of the original data. Whereas you can see in the Kaplan Meier curve, even though for the first two, the progression-free survival rate seems to have some overlap, they don't perform as well when computing overall survival rate.
(06:20)
In addition to fidelity and in the context of clinical trial datasets, what's really important or what is the most important is preservation of privacy of the patients that are included in this dataset. And in addition to the standard privacy tests like membership disclosure and attribute disclosure, one of the tests that we have come up with is to quantify differential privacy in a different way. What this test is basically doing is that, for each single predictor in our data set, we try to look at when training on the entire synthetic data, what is the performance of predicting that particular feature.
(07:07)
And here, we're looking at accuracy, where prediction of that particular feature is denoted as P. And what does that accuracy compare or how does a accuracy compare when you're looking at the same feature, but you remove a small chunk of the dataset? Here that is our prime. And in an ideal scenario, so if a block of patients or a block of subject do not have an impact on that prediction, depending on the size of the block obviously, if they do not have an impact on accuracy, then Y would be equals to X. That is that the prediction would be just as well when those patients are included or excluded. However, if a certain set of patients have a really big impact then the reduction in the prediction would be greater.
(08:12)
What this test is really trying to see is that what does this accuracy look like when those chunk of patients, which are known as our prime are included versus excluded. Which is in a way trying to assess or quantify the differential privacy risk for those patients. Sorry.
(08:34)
And here, what you can see is that when looking at just the de-identified dataset, we quantify the overall. We do this across different prediction, different features where P shuffles from across all the features in the dataset. This is an exhaustive run. But a summary metric that we see is we quantify as A0, where A0 is just the Y-intercept for across all the prediction scenarios of what that privacy risk actually is.
(09:09)
And as we can see for the de-identified only, just this anonymized data set, this risk is quite high, which is closer to one. However, for our stimulant scenario, our high fidelity versus a optimized scenario, this A0 is quite low. It's almost is close to the Y equals X trend line. When quantifying this privacy risk across all synthesizers, we observe that most of the synthesizers, even the deep neural network ones, do a great job in terms of preserving privacy of the dataset. But when considering the fidelity that simulants provides in generating synthetic clinical trial data as well as preserving privacy, we think that simulants is doing a great job at generating synthetic clinical trial data.
(10:04)
In summary, semblance provides a fast and robust way to generate synthetic data that reflects the properties of real data. It reduces computational cost, does not depend on domain knowledge, allows preservation and scaling of sample size. The product allows tunable features to adjust fidelity and privacy of clinical trial data sets that are produced in the context of synthetic clinical trial data sets, where this particular algorithm is applicable on data sets with limited sample size regardless of the domain where generative methods may perform poorly.
(10:48)
Thank you for your time. And if you want to reach me, here's my information. We would love to hear from you. Thank you.