Predicting Patient Stay Durations in the ER with Safe Synthetic Data

Here's how a hospital uses Gretel to help forecast staffing and resource needs for their emergency care unit, and to identify emerging trends in outbreaks.

In today's data-driven world, collaborative data sharing has demonstrated its success with initiatives like Covid-19 and cancer research. This success has primed life sciences institutions, research organizations, and hospitals to work with data at a scale that has never been possible. These organizations generate and work with extensive amounts of data that could reshape healthcare and advance research in ways we are only beginning to experience.

A key to unlocking this potential lies in the power of AI to learn from vast amounts of medical data, where models can be trained and used to predict disease outcomes, enhance clinical decision-making, and support drug discovery and development. However, due to the need to protect patient privacy, enabling access to medical data across organizations, between teams, and even to machine learning models for training is difficult and often ends up never happening. 

In this post, we will explore a real-life case study of how a major hospital uses Gretel to help predict the duration of stay for new emergency department patients, to identify gaps in hospital resources or coverage, and to identify trends in outbreaks. 

Why synthetic data for healthcare?

While traditional approaches to maintaining data privacy have their merits, they also come with significant drawbacks. Secure enclaves, although efficient in preserving privacy, pose challenges due to their complex setup, management, and inability to export data. Homomorphic encryption, while limiting exposure of patient data, is computationally intensive and prevents direct interaction and exploration of data.

The creation of synthetic data using LLMs and privacy-enhancing technologies such as differential privacy offers several unique advantages:

  1. Provable anonymity: Differential privacy offers mathematical guarantees of privacy. By introducing calibrated noise into the data, it ensures individual data points are effectively anonymized, mitigating the risk of re-identification. At scale, this enables the creation of synthetic data containing the same insights as the real world version, but that cannot be linked to any single patient.
  2. Portability and shareability: Synthetic data can be shared freely across teams and organizations, as no combination of identifiable attributes (such as patient age, height, and zip code) could be linked to a real patient. This fosters collaboration among researchers and allows data to be utilized to its fullest extent without privacy concerns. Furthermore, models trained on synthetic data can also be shared without compromising the privacy of the original data sources.
  3. Flexibility for exploratory data analysis: With synthetic data, researchers are not restricted in their choice of tools or platforms for data analysis. This is crucial for the machine learning life cycle, where exploratory data analysis plays a significant role. Researchers can interact directly with synthetic data and use whatever tools they prefer, fostering creativity and efficiency.

Synthetic data allows for the safe and streamlined sharing of data, empowering life sciences researchers to deepen their understanding of diseases and quickly test their ideas without compromising patient privacy.

Case study: Predicting patient ER stay durations

A major hospital system in the midwestern US reached out to Gretel as they were planning to start a project using machine learning to predict the duration of stay for a patient given an initial diagnosis in the emergency department. Being able to accurately forecast how long each patient will stay can help hospitals provide better care — the hospital network could quickly respond to trends in public health such as outbreaks, and ensure optimal allocation of limited resources.

The initial challenge that the hospital encountered was in how to share access to the sensitive patient records that form the training data with the development team. While the records could easily be exported from the Epic EHR system, they need to be anonymized with the following requirements:

  1. Remove all Personally Identifiable Information (PII) in accordance with GDPR, HIPAA and other privacy regulations.
  2. Ensure that no combination of patient attributes (e.g., height, weight, zip code) in the training set could be used to identify a patient.
  3. Maintain maximum accuracy for ‌downstream machine learning prediction.

First, let’s take a look at the medical records and their format. Note, these are real but anonymized records that we have been given permission to use for this post.

Figure 1: Anonymized medical records exported from the Epic EHR system

From initial exploration, we can see that the dataset contains 20 columns and approximately 30,000 medical records from the past year. The dataset is a mix of categorical, numeric, and short natural language text descriptions. 

Removing PII

From the dataset above, we can see two columns with potentially identifiable information- specifically:

  • MRN - The “Medical Record Number” in Epic is used as a link between a patient and their medical records.
  • CSN - The “Customer Serial Number” field in Epic is defined as a unique identifier for each hospital visit. 

Using Gretel’s Transform API, we can define a simple transformation to encrypt or replace each MRN and CRN, protecting against any linkage attacks while still maintaining the statistical significance and distribution of these fields in the dataset. Below is an example transform policy that we will run as a preprocessing step.

schema_version: "1.0"
name: "gdpr-transforms"
models:
  - transforms:
      data_source: "_"
      policies:
        - name: encrypt_ids
          rules:
            - name: encrypt_mrn_crn
              conditions:
                field_name:
                  - mrn
                  - crn
              transforms:
                - type: hash
                  attrs:
                    secret: abacabb-idkfa # update with your own secret
                    length: 8

Creating private synthetic data

The next step is to use our de-identified dataset to train synthetic data models using Gretel. These models can then be used to create artificial, synthetic versions of the training data — containing the same statistical properties, but with various levels of privacy protections and guarantees — for example, ensuring that no combination of non-PII based patient attributes (such as height + weight + age + zip code) could be used to uniquely identify a patient.

We will experiment with training two of Gretel’s core models on the dataset below. Gretel ACTGAN is a Generative Adversarial Network (GAN)-based model that creates highly accurate generative models and datasets, utilizing privacy-enhancing technologies to offer state-of-the-art accuracy that is critical for machine learning models. Second, we will train Gretel’s TabularDP model, which is a graph network-based generative model that offers rigorous mathematical guarantees for privacy.

Since we know the model will be used to create training data to predict patient stay durations (e.g. 0–3 hrs, 3–6 hrs, 6–12 hrs, or 12+ hours), we can use Gretel’s Evaluate API to automatically test the downstream machine learning accuracy of synthetic data versus real world data, as shown in the config examples below.

Gretel ACTGAN

schema_version: "1.0"
name: "tabular-actgan"
models:
  - actgan:
        data_source: __tmp__
        params:
            epochs: auto
            generator_dim: [1024, 1024]
            discriminator_dim: [1024, 1024]
            generator_lr: 0.0001
            discriminator_lr: .00033
            batch_size: auto
            auto_transform_datetime: True
        generate:
            num_records: 5000
        privacy_filters:
            outliers: auto
            similarity: auto
      evaluate:
        task: classify
        target: duration_stay

Gretel TabularDP

schema_version: "1.0"
name: "tabular-differential-privacy"
models:
  - tabular_dp:
      data_source: __tmp__
      params:
        epsilon: 1
        delta: auto
        infer_domain: True
      evaluate:
        task: classify
        target: duration_stay

Comparing results

Below are the results of comparing the synthetic data generated by Gretel ACTGAN and Gretel TabularDP models versus real-world data. The SQS score provides an objective, statistical measure of dataset quality. The classification task shows how synthetic data stacks up against actual data when trying to predict the length of a patient's emergency room stay based on their initial diagnosis.

Figure 2: Comparing performance of a model trained on synthetic data vs. real data.

In this example, the synthetic data created by Gretel ACTGAN outperformed real world data across the mean f1-score on the top 3 classification tasks measured (a Gradient Boosting Classifier, Ada Boost Classifier, and Random Forest Classifier). Tabular DP’s stronger privacy guarantees came with a slight reduction in accuracy vs real world data. When using synthetic data, the goal is to find the right balance between utility (in this case, classification accuracy) and privacy protections. 

As this data was being used internally, the hospital opted for using the ACTGAN configuration which offers real-word accuracy with strong privacy. For cases where the data would be shared outside the hospital, differential privacy offers formal privacy guarantees that meet the strictest compliance regulations and definitions of anonymity. 

We would love to hear from you

We would be delighted to discuss your specific use case for synthetic data in healthcare further. Feel free to reach out to us at hi@gretel.ai or join our Discord channel to start a conversation. We look forward to exploring how Gretel can assist you in maximizing the benefits of synthetic data for your healthcare use cases.