Introducing Gretel MLOps

Use Gretel's synthetic data platform to replace, augment, or balance training datasets within MLOps pipelines like Vertex AI, Azure ML, and Amazon SageMaker.

Introduction

Generating high-quality, labeled datasets in sufficient quantities is a significant challenge in machine learning (ML). Factors such as cost, time constraints, and privacy concerns often hinder the availability of necessary data. In this blog post, we introduce how the Gretel synthetic data platform can be utilized to replace, augment, or balance training datasets within MLOps pipelines like Vertex AI, Azure Machine Learning, and SageMaker Pipelines. This integration is designed to enhance the performance of ML models while addressing several critical issues.

We are excited to announce the release of the open-source gretel-mlops GitHub repository. This repository provides a comprehensive guide and code library to integrate Gretel's synthetic data capabilities with the major MLOps platforms.

Advantages of Synthetic Datasets for MLOps

The use of synthetic datasets for MLOps, has numerous advantages for MLOps:

  1. Privacy Protection: Synthetic data offers a powerful solution to privacy concerns by generating new datasets that mimic the statistical properties of original data without exposing individual records. This approach helps ensure compliance with stringent data privacy regulations such as GDPR and HIPAA.
  2. Data Availability: The ability to quickly generate large volumes of synthetic data alleviates the problems of data scarcity and dataset incompleteness. This is particularly valuable in scenarios where data collection is impractical or too slow.
  3. Bias Mitigation: Synthetic data can be carefully crafted to reduce the biases inherent in real-world data. By adjusting the parameters during the data generation process, synthetic datasets can provide a more balanced and diverse data set for training ML models.
  4. Cost Efficiency: Generating synthetic data is often more cost-effective than the traditional methods of collecting and annotating real-world data. It reduces the need for extensive data gathering campaigns and manual labeling, thereby saving both time and resources.

In the upcoming sections, we will explore how to construct such ML pipelines using the open-source gretel-mlops GitHub repository. We'll demonstrate how to integrate Gretel's synthetic data models with various MLOps platforms and show how synthetic data can be used to improve the performance of ML models on various tasks.

Overview of gretel-mlops

Gretel-MLOps is code library designed to facilitate the integration of Gretel's synthetic data capabilities with major MLOps platforms such as Vertex AI, Azure Machine Learning, and Amazon SageMaker Pipelines. The library allows you to select two deployment modes offered by Gretel: Gretel Cloud and Gretel Hybrid.

  • Gretel Cloud: provides a fully managed service that takes care of all aspects of compute, automation, and scalability within Gretel's cloud infrastructure. It offers a hassle-free way to generate synthetic data without the need for extensive setup or management.‍
  • Gretel Hybrid: allows for deployment within your own cloud environment, utilizing Kubernetes Service to keep your data secure. This mode is ideal for handling sensitive or regulated data that cannot leave your cloud environment, offering a perfect blend of control and convenience. For more information about Gretel Hybrid check out our documentation.

Getting Started with Gretel-MLOps

To begin using the Gretel-MLOps library, you'll first need to set up your environment and install the necessary components. This section provides a step-by-step guide to get you started, whether you're using Azure, Vertex AI, or Amazon SageMaker Pipelines.

Here are some prerequisites:

  • A Gretel account: Sign in to the Gretel console and obtain an API key.
  • Access to an MLOps platform (Azure Machine Learning, Vertex AI, or Amazon SageMaker).
  • A dataset to train a ML model on, or alternatively use the examples we provided with the repo.

Configuring the Gretel-MLOps Pipeline

Expanding on the configuration of your MLOps pipeline is crucial for tailoring the integration of Gretel’s synthetic data capabilities with your specific needs. The YAML config file provides you a simple way to define how the pipeline operates, interacts with datasets, applies machine learning models, and integrates synthetic data generation. 

The dataset section of the YAML file specifies where your data is stored and how it should be handled within the pipeline:

dataset:
 name: healthcare-stroke-data
 train_path: 's3://your-bucket/train.csv'
 validation_path: 's3://your-bucket/validate.csv'
 test_path: 's3://your-bucket/test.csv'
 target_column: stroke
 drop_columns: 
  - id
  - timestamp

The sections are

  • name: Identifies the dataset for reference in logs and outputs.
  • train_path, validation_path, test_path: Specify the locations of your training, validation, and test datasets, respectively. If no validation and/or test is defined, held-out sets for those will be created from the training data.
  • target_column: Defines the column in your dataset that the model should predict.
  • drop_columns: Lists any columns to be excluded from the model training to avoid leakage or irrelevance.

The ML section tailors the machine learning aspects of the pipeline, defining the type of task, the objective function, and the evaluation metric:

ML:
 ml_task: classification
 objective: 'binary:logistic'
 ml_eval_metric: auc
 ml_deployment_threshold: 0.7
  • ml_task: Specifies whether the task is classification or regression.
  • objective: Sets the loss function or objective that the training algorithm should minimize or maximize.
  • ml_eval_metric: Determines the metric used to evaluate the model performance.
  • ml_deployment_threshold: Establishes a performance threshold that models must meet to be deployed.

The gretel section is pivotal in integrating synthetic data into your pipeline. This component of the pipeline leverages Gretel Tuner, an optional module of Gretel’s Python SDK, for tuning synthetic model parameters to find the optimal configuration. It’s tailored to the downstream ML task using an ML model similar to the final ML application.

gretel:
 strategy: augment
 generate_factor: 1.5
 mode: hybrid
 sink_bucket: 's3://your-synthetic-data-bucket'
  • strategy: Chooses how synthetic data is used. Options like replace, augment, and balance control whether synthetic data replaces the original, augments it, or balances the target class distributions.
  • generate_factor: Specifies the amount of synthetic data to generate relative to the original dataset size.
  • mode: Determines whether the synthetic data generation occurs in the cloud (Gretel Cloud) or within a specific infrastructure (Gretel Hybrid).
  • sink_bucket: For the hybrid mode, this is where the generated synthetic data is stored.

By adjusting these configurations, users can experiment with different strategies and parameters, making the pipeline highly adaptable to various scenarios and optimizing outcomes based on the specific characteristics of the data and the requirements of the task. Additionally with these configurations, you can ensure that your MLOps pipeline not only performs optimally but also aligns with broader business objectives and compliance requirements. This makes the configuration process a critical step in deploying effective and responsible AI solutions.

Running Gretel MLOps pipelines on GCP, Azure, and AWS

We provide notebooks to demonstrate how to set up and utilize Gretel MLOps pipelines on Google Cloud Platform (GCP), Microsoft Azure, and Amazon Web Services (AWS). Each notebook provides a step-by-step guide to configure the necessary environment, authenticate the user, and execute the pipelines using Gretel's synthetic data capabilities. By following these instructions, users can seamlessly integrate Gretel's MLOps pipelines into their respective cloud services.

Azure AI

The Azure notebook notebook connects to the Azure ML Workspace and creates an ML client. It sets up an environment using a Conda dependencies file and a base Docker image. The notebook creates a data asset in the Azure ML workspace and defines the pipeline, which includes preprocessing, synthetic data generation, training, evaluation, and model registration steps. The pipeline job is executed, and an ML model evaluation report is downloaded and printed.

Figure 1. How Gretel integrates into an Azure MLOps pipeline.

GCP

The GCP notebook sets up the environment for Vertex AI, authenticates the Google Cloud account, and configures a storage bucket for pipeline artifacts. It includes steps to retrieve and configure a service account with appropriate permissions and integrates Gretel MLOps configurations. The notebook builds and submits the pipeline execution job, and finally provides an ML model evaluation report.

Figure 2. How Gretel integrates into a GCP MLOps pipeline.

Amazon SageMaker

The Amazon SageMaker notebook configures the environment by cloning the Gretel MLOps repository and setting up AWS-specific configurations such as the region, role, and default bucket. The pipeline includes steps for preprocessing, training, and evaluation. The results are fetched from an S3 bucket, and the evaluation report is printed.

Figure 3. How Gretel integrates into a Amazon Sagemaker pipeline.

Example results

Introduction to Datasets

We used several publicly available datasets from Kaggle to evaluate the effectiveness of Gretel MLOps in training binary classifiers. Below are brief descriptions of the datasets used:

  • Phishing Dataset for Machine Learning: This dataset contains 48 features extracted from 5000 phishing webpages and 5000 legitimate webpages. The data was collected using Selenium WebDriver to ensure precise and robust feature extraction, useful for anti-phishing research and classification model benchmarking.
  • Company Bankruptcy Prediction: This dataset includes 6819 observations with 96 attributes, aimed at predicting company bankruptcy based on data collected from the Taiwan Economic Journal from 1999 to 2009. The dataset is highly imbalanced, with 'Y' as the class label indicating bankruptcy.
  • Heart Failure Prediction: Containing 918 observations with 12 attributes, this dataset is used to predict heart disease. It focuses on cardiovascular diseases, which are the leading cause of death globally.
  • Stroke Prediction Dataset: With 5110 observations and 12 attributes, this dataset aims to predict the likelihood of a stroke based on various input parameters such as age, gender, and medical history. The dataset is imbalanced with 'stroke' as the class label.
  • Telco Customer Churn: This dataset includes customer information to predict churn, encompassing details about services signed up for, account information, and demographic data. Each row represents a customer and their respective attributes.

Training and Evaluation with Gretel MLOps

Using the Gretel MLOps platform, we trained binary classifiers on these datasets. As previously described, the process involved:

  1. Data Splitting: Each dataset was split into training, validation, and test sets. The validation set was used for hyperparameter tuning of the Gretel synthetic data model and the downstream classifier model. The final evaluation was performed on a held-out test set.
  2. Synthetic Data Generation: The synthetic data model was trained without using any examples from the validation or test sets to ensure no bias was introduced. This step is crucial to maintain the integrity and generalizability of the synthetic data.
  3. Model Training: The binary classifiers were trained on both real and synthetic data, leveraging the strategies discussed in the blog post: real only, synthetic only, and a combination of real and synthetic data.

Results

Below is a heat-map showing the F1 and AUCPR scores for each dataset across the different data strategies:

Figure 4. Results for various data strategies.

Key Observations:

  • The combination of real and synthetic data generally provided the best balance between F1 and AUCPR scores, highlighting the effectiveness of augmenting real data with synthetic data.
  • The synthetic-only strategy also performed competitively, demonstrating the potential of synthetic data to replace real data in certain scenarios.

These results illustrate the robustness and flexibility of Gretel MLOps in handling diverse datasets and improving model performance through synthetic data integration.

We encourage you to explore the following resources to delve deeper: