Community Insights: Overcoming Medical Class Imbalance with Synthetic Data
In this case study, the Gretel team sat down with Reetam Ganguli, a Brown University medical candidate and leader of a bioincubator, to learn why medical practitioners turn to Gretel-generated synthetic data when overcoming challenges with clinical data. Reetam’s team leverages synthetic data to predict postpartum hemorrhages for expecting mothers in Cameroon and Nigeria.
Biased data, or class imbalance, is a large problem when dealing with medical data. Bias can be introduced through a number of factors: limited medical data collection from underrepresented demographics, historically low mortality rates for commonly treated disease, or gender biases stemming from both societal and clinical factors. Today, combatting this problem is costly and resource-intensive, requiring practitioners to collect and annotate more data; in some cases, collecting this data is even impossible, such as when experimental conditions can be impossible to recreate in a lab or clinical setting.
Even with these challenges, the stakes couldn’t be higher. Clinical applications built on biased data lead to unreliable results in settings where real human lives are at stake. Reetam Ganguli is committed to overcoming this critical data challenge. Read below (or watch the full interview) to learn more about his work and how Brown University is building the next generation of data-driven clinical applications with synthetic data.
Interview with Reetam Ganguli
Can you introduce yourself for our readers and explain your role at Brown University?
I’m a medical candidate at Brown, aiming to be an OB GYN physician. My research focuses on the intersection of deep learning and inferential statistics as they relate to predicting clinically salient outcomes. I also lead the Medical Bioincubator at the Alpert Medical School where we pair physicians with interdisciplinary groups of students that innovate data-science solutions to address clinical needs.
What is your technical background?
I have completed one basic introductory computer science course on data structures and algorithms. I mainly learned machine learning through self-taught instruction from online YouTube tutorials and publicly available websites. Really found Towards Data Science, Analytics Vidhya, and Machine Learning Mastery to be helpful resources.
Can you explain the difficulties faced when working with medical data?
Almost all medical data will face the class and balance problem. The primary reason for this is because, as we as a society become more technologically advanced, our care becomes better. This means we will have less adverse outcomes, meaning patients that have life-threatening symptoms where there is a serious threat of mortality. The lack of this type of data means machine learning classifiers now have less training data for positive samples. This can be difficult if you are trying to base your clinical decisions off of classifiers that are biased towards predicting “no:” because you won't be able to adequately interfere if a patient is about to have an adverse outcome—due to the high false negative rates that most models produce. If you extrapolate this issue 50 years into the future, where our medical care is even better, we will see diminishing positives within our medical data. This is where I believe synthetic data will be one of our most powerful tools.
When speaking in more data-science terms, ultimately, I just want to have better classifier accuracy, better area under the receiver operating characteristic curves, better F1 scores and better positive predictive value. It is my hope that platforms like Gretel will be best at helping to generate diverse synthetic datasets to enable better research outcomes.
What alternatives to synthetic data did you try?
Some other class imbalance rectification methods I used were SMOTE, data undersampling, and data oversampling, but these did not improve my classifier accuracy and ultimately ended up introducing unwanted bias into the data. This is why I turned to synthetic data.
How did you find out about Gretel?
One dataset I was using for a particular project involved detecting postpartum hemorrhage for pregnant mothers. One salient issue when dealing with medical data is that there is a pretty significant class imbalance within your data. For me, over 99.5% of my patients were negative for postpartum hemorrhage and less than 0.5% were positive for postpartum hemorrhage. This leads to classifiers that are significantly biased toward predicting “no” and ultimately are clinically useless. When looking for solutions I came across Gretel by a simple Google search and was amazed by how accessible your platform was as someone who is not as technical. Gretel's console provided a very easy way for me to generate synthetic data to augment my existing dataset and combat the class imbalance.
I really like the fact that the console is very user-friendly and doesn't even require you to have any kind of coding experience. I think this kind of plug and play platform would be accessible to most healthcare providers who don't have a technical background and don't have any technical skill sets to manually code generative recurrent neural networks of their own. I also really like how simple the API is to use for people who are not technical and can get started within a matter of minutes.
Tell us more about the postpartum hemorrhage and maternal transfusion use case in Cameroon?
We are prospectively testing our data on a cohort of Cameroonian patients as well as a cohort of Nigerian patients. This is despite the fact that our original models were trained on American patients. We want to see how well our models are able to generalize and be applied in low-resource settings.
If we are successful, then our machine learning models will be able to predict adverse outcomes like postpartum hemorrhage at the point-of-care. This means that we will be able to make this prediction for women as early as the first week of pregnancy and give their doctors and medical staff plenty of time to adequately prepare for any adverse outcomes.
The primary challenge with this issue is that the distribution of Cameroonian and Nigerian patients follows a very different distribution than American patients. Some reasons for this may include the fact that these patients grew up in different backgrounds and have different physiological characteristics as well as other social determinants of health. This becomes a challenge when testing our machine learning models because these models have been optimized to fit an American patient distribution. We are hoping that platforms like Gretel can add synthetic data and balance out the distribution in favor of Cameroonian and Nigerian patients to combat any model over-fitting that we might have to American patients.
What are potential next steps that excite you both on the clinical and research sides for your use of synthetic data?
If we successfully train our models without the overfitting problem, then we will be able to make our platform accessible to pregnant mothers and doctors all across the world, particularly those in low-resource settings and developing countries who may greatly benefit from it. In these countries, blood banks are either lowly stocked or entirely non-existent. Having a donor lined up beforehand for a blood transfusion can literally save a life. In Cameroon, there are no extra blood reserves and any time a patient needs blood they need a donor. Blood can also not be stored for a long time, thus, predictive applications are essential to clinical preparedness.
Synthetic data has tremendous potential to help users overcome critical data challenges such as class imbalance in clinical data. Try generating your own data by uploading a medical dataset in the Gretel Platform today. If you need more support, feel free to join our Discord community, email us at email@example.com, or reach out to Reetam to learn more about his work with postpartum hemorrhages.