Gretel's Workflow Builder Streamlines Multi-Step Synthetic Data Generation for Financial Services

Gretel's Workflow Builder streamlines multi-step synthetic data generation for financial services, boosting privacy and efficiency with no coding required.

Overview

We're excited to announce the launch of Gretel's new Workflow Builder, a powerful feature designed to simplify the process of creating complex, multi-step synthetic data workflows. Over the past few months, we have seen some of our most avid users leverage multiple synthetic data models and/or pre-processing steps in sequence to create synthetic data that meets their exact needs.This new addition to the Gretel Console enables users to visually chain together multiple synthetic data models in a few clicks, without the need to manually edit code-based YAML configurations. 

For those new to Gretel, Workflows makes it easy to create pipelines that automate and operationalize synthetic data generation. Its true strength is in its flexibility and integration options, allowing customers to build pipelines that connect to their preferred data sources and run on a set schedule.

In this blog post, we'll explore how the new Workflow Builder can be used to tackle a common challenge in the financial sector: creating privacy-preserving synthetic data for analytics and better AI model training. We'll walk through the process of first transforming, that is detecting and de-identifying, redacting, or removing all sensitive entities within the dataset, and then synthesizing a statistically similar financial transaction dataset stored in Amazon S3. We will also be doubling the database in size while maintaining data utility and privacy.

You can also follow along with this demo via video.

The Dataset: Financial Transactions

For this demonstration, we'll be working with a financial transactions dataset stored in an Amazon S3 bucket named "customer-financial-transactions". The specific file we'll be using is called "customer_transactions.csv". This dataset contains approximately 9,000 records with sensitive information such as transaction amounts, account numbers, and customer details. Our goal is to create a synthetic version of this dataset that:

  1. Protects customer privacy by anonymizing sensitive information
  2. Doubles the size of the original dataset providing more training data for AI models
  3. Maintains the statistical properties and relationships within the data
Figure 1. This dataset contains PII including the customer's first and last name and their International Bank Account Number (IBAN) number. It has numeric, categorical, text, and free-text data.

Creating the Workflow: A Step-by-Step Guide

Let's walk through the process of creating our workflow using the new Workflow Builder:

After logging into the Gretel Console and clicking on the "Workflows" option in the left sidebar, we’ll click the "New Workflow" button to begin.

To start, we’ll select a project and give our workflow a descriptive name, such as "Financial Transactions Synthesis". Then, click "Create Workflow" to enter the new Workflow Builder interface.

To start we will define our input data. We can connect directly to our source data using one of Gretel’s Connectors. For this demo, we’ll use an Amazon S3 connection, choose the "customer-financial-transactions" bucket, and specify "customer_transactions.csv" as our input file.

Define Input Data

Select Models:

Now that we’ve defined our input data, we’ll move on to model selection. Transforming data before synthesizing it adds an additional layer of protection against common privacy attacks [link to blog].

Edit the model step and select "Transform V2" as your first model. This will handle the initial data transformation and anonymization.

Transform V2 is our latest data transformation model, designed to offer enhanced flexibility and power in data anonymization and modification. It features custom transformation logic, an expanded library of detectable and fake-able entities, and improved PII detection capabilities. This makes it easy to create domain or format-specific transformation rules. Transform V2 is particularly useful for de-identifying datasets while maintaining their utility for downstream analysis.

Next, we’ll add a second model and select "Navigator Fine Tuning". This model will generate our synthetic data.

  • Configure the Navigator Fine Tuning model to generate exactly 18,000 records. This will double the size of our original dataset, which contains approximately 9,000 records.

Navigator Fine Tuning is our latest flagship AI model, which supports multiple tabular data modalities within a single model, including numerical, categorical, free text, and sequential (time-series). It also excels at capturing complex relationships and distributions within data, making it the ideal choice for our current task. 

By chaining these two models together, we ensure that our financial data is first properly anonymized and then synthesized with high fidelity, providing a privacy-preserving dataset that mimics the characteristics of the original data.

Configure Output: 

Now that we have our models chained, let’s specify where to write the generated data.  Data can be stored either in the Gretel Cloud or it can be output to an object storage bucket using a supported Gretel Connector. 

We’ll edit the output step to write the generated dataset back to the same Amazon S3 bucket, into a folder called “gretel.” 

Set up Workflow Schedule:

Edit the trigger to run the workflow on a weekly basis. 

Finally, we edit the trigger to choose a schedule for our workflow.  You can run a workflow manually, or you can schedule it to run on a recurring basis. That schedule could be daily, weekly, monthly, or you could do a custom schedule using a cron string. For this example, we’ll schedule the workflow to run weekly. This weekly cadence aligns with the update frequency of the input "customer_transactions.csv" file, ensuring that the synthetic data reflects the most recent trends and patterns in the actual financial data.

This automation is crucial for maintaining up-to-date synthetic data. By setting the Workflow to run automatically once per week, you ensure that you always have fresh synthetic data where you need it, when you need it.  

Regular updates to your synthetic dataset mean that data scientists, analysts, and other stakeholders always have access to current, privacy-compliant data for their models, reports, and applications. Automated Workflow runs reduce manual overhead and help mitigate the risks of model or concept drift with fresh data pipelines, improving overall AI efficiency and power continuous improvement.

This automated approach to synthetic data generation exemplifies how the Workflow Builder can streamline data operations and maintain data current with minimal human intervention.

If we switch over to the advanced tab to see the YAML configuration that the Workflow Builder has generated based on your selections. This gives you the flexibility to make further customizations if needed.

To recap, here's the full workflow:

Finally, we’ll click "Create Workflow" to finalize our configuration. Since we chose to schedule the workflow, it will not run immediately. But, we can select the “Also run the workflow” checkbox to execute it immediately and see it in action. 

Results: An Anonymized and Expanded Dataset

Once the workflow completes, the generated data can be found in the Amazon S3 bucket. The output file, which can be found in the “gretel” folder, now contains a CSV with:

  • 18,000 records, double the original dataset's approximately 9,000 records
  • Fully anonymized sensitive information
  • Preserved statistical properties and relationships from the original data

Let's compare a sample of the original and synthetic data:

Figure 2. 4-5 lines of original input data
Figure 3. 4-5 lines of synthesized output data

As you can see, the synthetic data looks a lot like the input data. The names and IBAN codes were transformed to redact PII. Then Navigator Fine Tuning generated the synthetic data mimicking the input data, but twice as large and without any of the privacy risks. It’s especially worth noting how well Navigator Fine Tuning does at generating free text values, like those in the “description” column.

Figure 4. Text Synthetic quality and privacy scores of synthesized data.

Looking at our synthetic data quality and data privacy scores, we can see that both our quality and privacy are excellent.  This dataset can now be used for analytics, machine learning model training, or sharing with partners without risking customer privacy.

Conclusion

Here at Gretel, we believe that better data makes for better models, which is why we help you generate artificial synthetic data sets with the same characteristics as real data, so you can improve your AI models without compromising on privacy or data quality. Workflow Builder is just our latest feature to make generating quality data that much easier. 

The new Gretel Workflow Builder makes it easier than ever to visually create a Workflow that chains multiple models to integrate the power of synthetics into your data pipeline, no code needed.  

We encourage you to try out the new Workflow Builder in the Gretel Console today. Experience firsthand how it can streamline your data operations, enhance privacy, and unlock new possibilities for your data-driven projects.

Ready to get started? Log in to the Gretel Console and create your first workflow with the new Workflow Builder!