Synthetic Data and the Data-centric Machine Learning Life Cycle
In this series of posts, we’ll cover how Gretel’s synthetic data platform helps you overcome challenges across the data-centric machine learning life cycle to help you successfully build, deploy, maintain, and realize value from your AI projects.
The data-centric machine learning life cycle
The life cycle outlined below is a common framework or workflow process for building machine learning and AI solutions. It’s focused on streamlining the stages necessary to develop machine learning models, deploy them to production, and maintain and monitor them. These steps are a collaborative process, often involving data scientists and DevOps engineers. The process below was inspired by the value chains created by The Sequence, Databricks, Google Cloud, and Microsoft.
How can Gretel help at each stage of the machine learning life cycle?
Below, we dive into specific examples of how synthetic data is increasingly playing a key role in different stages of the machine learning life cycle. In future posts in this series, we'll create and walk through code examples for each use case and stage.Â
1. Data collection
The machine learning life cycle begins with obtaining raw and/or unstructured data. Kaggle’s data science survey of over 13,000 data scientists shows that just gathering or getting access to data can take up 50% of an overall AI project's time. In practice, privacy and regulatory concerns with sensitive training data often cause this bottleneck, though it can be due to a lack of data examples to build with, too. Use Gretel Synthetics to:
- Generate anonymized training datasets that maintain the accuracy of real-world data while meeting or exceeding the anonymization standards of GDPR, CCPA, and other compliance protocols.Â
2. Exploratory data analysisÂ
During this stage, data practitioners may filter, model, and visualize data, while searching for insights through a range of tools and methodologies. These insights, such as data distributions and relationships, inform the life cycle, but particularly help data practitioners as they prepare data for labeling, training, and testing machine learning models. Use Gretel’s Evaluate API and Console interface to:
- Measure dataset quality and discover correlations in data.
- Share and collaborate with secure data.
3. Data preparation and annotation
Data preparation often includes cleaning, de-duplicating, aggregating, and transforming data so that it can be used effectively for a specific machine learning task and framework. Within many tasks, data is annotated, or labeled, to capture the outcomes that you want your machine learning model to recognize and predict. When the data is imbalanced or insufficiently numerous, the preparation process may entail generating specific annotated data, such as from under-represented classes, to enable models to generalize to previously unseen data and increase accuracy. Use Gretel Synthetics to:
- Augment training data that has gone “stale” due to changing usage in a production environment to help models generalize to new data and adapt to new usage patterns.
- Generate additional labeled training data with conditional generation to address class imbalances to improve downstream accuracy at a fraction of the cost of traditional data collection and labeling techniques.
{{cta-demo}}
4. Model training and evaluation
Model training teaches your model to perform a task. Usually, model training fits into one of four categories: supervised, unsupervised, semi-supervised, and reinforcement learning. Check out NVIDIA’s blog post for an excellent overview of different training methods. Use Gretel Synthetics and a parameter optimization framework such as Wandb, Pycaret, or Optuna to:
- Create optimized synthetic training datasets to enable maximum downstream machine learning task accuracy.
5. Model deployment
Model deployment places a finished machine learning model into a live environment where it can be used for its intended task as part of an application, whether customer-facing or internal. Models are typically developed to work with carefully prepared test, training, and validation datasets. Few models pass those tests, and typically only a small number of models are deployed to a production environment. Using synthetic data can drastically improve these outcomes. Use Gretel Synthetics to:
- Create synthetic demo datasets to demonstrate or evaluate new ML-based tools
- Generate synthetic data to test a machine learning model’s response to new inputs
6. Model monitoring
Keeping a close eye on your ML model enables data practitioners and developers to detect problems with model performance. These problems can include data drift (where the data the production model is seeing begins to vary due to a change in the collection process) and concept drift (where the behavior in the production shifts due to seasonality or other rare events such as the Covid pandemic and subsequent shift to remote work and social distancing).Â
Use Gretel Synthetics and Gretel Amplify to:
- Test model outcomes using synthetically-generated simulations of scenarios.
- Stress test your prediction pipeline with millions (or billions) of examples.
7. Re-training
When a model no longer performs at peak levels, the life cycle can be reset. Data practitioners and developers can determine the cause of the drop in performance and retrain the model on an updated training dataset that’s accurate to the current data.Â
Use Gretel Synthetics to:
- Generate training data with unique edge cases to account for data and concept drift in production.
- Minimize downtime by automatically generating new training datasets to account for drift in underlying data.
What’s next
Gretel is a synthetic data platform that simplifies generating, anonymizing, and sharing access to data for thousands of engineers and data science teams around the world. Even if you have no data or limited data, Gretel can help you start building machine learning and AI solutions. Stay tuned, in the next post in this series, we'll dive in with code examples and case studies for stages in the machine learning life cycle.