Automate Synthetic Data Pipelines with Gretel Workflows

Gretel Workflows orchestrate synthetic data generation, ensuring users have accurate, up-to-date data for software development, analytics, and ML/AI.

Synthetic data projects typically consist of two phases: choosing a synthetic model that provides optimal accuracy and utility, and deploying the selected model(s) in a pipeline so the results are up-to-date and accessible throughout your organization. This isn’t a one-time process. It needs to be done over and over, on a regular basis, to make sure you and your team always have updated and relevant data as you build software, analyze business data, and train machine learning or AI models. 

A key factor in ensuring the success of your synthetic data strategy is the ability to quickly integrate data generation tasks into your existing processes. To help achieve this goal, we’re thrilled to announce the general availability of Gretel Workflows, a powerful orchestration tool now available to the Gretel community. Use Gretel Workflows to accelerate deployment times by 30%, eliminating the hassle of writing scheduled batch jobs and propelling your data strategy forward efficiently.

With Workflows you can:

  • Maintain control over your data: Gretel connectors process input and output data from your data sources without it having to leave your environment. Multiple files and tables can be processed in parallel and the generated data written to a destination, to be accessed within the company or with par. Our AWS S3 connector is live, with support for GCS, Azure, Snowflake, and others coming over the next few weeks. 
  • Set it and forget it: Use schedules to run workflows automatically. The resulting synthetic data will be available wherever you need it — no additional steps required. 
  • Create complex workflows, with multiple models: Chain models together with a user-friendly yaml configuration file and unlock end-to-end use cases such as making sensitive data GDPR-compliant. 

What are Gretel Workflows?

Gretel Workflows provide a config-driven API for automating and operationalizing Gretel. Easy-to-read YAML configuration files are used to create a sequence of instructions for the Gretel platform to execute. Not so keen on editing YAML? Use the Gretel Console to build a Workflow using a step-by-step interface.

Figure 1. Create connections to remote data sources
Figure 2. Run immediately or create a schedule

With Workflows, you can connect to various data sources such as AWS S3 or MySQL (coming soon), train and run one or more models, and schedule recurring jobs to automate your synthetic data pipeline. The resulting synthetic data can be accessed from within your data services, thus staying within your control, while also being shareable with authorized teams in your organization.  

Here are some ways in which Workflows can be used:

  • Continuous PII obfuscation: Detect and transform PII from a production database and write it to one or more development environments
  • Single-step synthesis and evaluation: Generate data using multiple models and evaluate them at once
  • Automated pre-production and demo data generation: Train a time series model and run a generate and transform workflow every week to create a dataset that’s fresh and current
  • Post-processing and validation: Add one or more transformation steps to modify or drop synthetic data fields that don’t match your business logic, for example constraining numeric values to a certain range.

Workflows are available now in the Gretel Console as well as the SDK/CLI. Read the docs or visit the Gretel Console to start building today. 

Frequently Asked Questions (FAQ)

How are workflows different from models?

Workflows can train and run one or more models. Typically, you’d train a single model to get the optimal accuracy and privacy for your use case, and then create a scheduled workflow. The workflow could connect to your S3 bucket or other data store, train one or more models, run them, and output data back to the remote destination. 

Are workflows available for hybrid environments?

We’re actively working on supporting workflows in hybrid environments and the functionality will be live shortly.

Which connectors do you support?

We’re launching with AWS S3 support, and Google Cloud Storage (GCS) and Azure Blob Storage will both be available soon. We’re also building connectors for relational databases and data warehouses, such as MySQL, PostgreSQL, and Snowflake. 

How much will Workflows cost?

At this time, there's no additional cost to using Workflows. Note that on the free tier, processing multiple files at once or creating lots of scheduled workflows could increase consumption because of the number of models being created. However, we're currently offering 60 free credits per month instead of 15, so you have the flexibility to try out all these new features without running out of credits.

I already use a workflow management tool. Will Gretel Workflows replace it?

Gretel Workflows aren't intended to replace a workflow tool like Airflow. They’re a step in your data ops orchestration, alongside ETL and other processing tasks. 

Questions? Feedback? Hop on our Discord. We’d love to hear how you use Gretel’s workflows and connectors in your organization.