How to Generate Best-in-Class Synthetic Time Series Data
Generating synthetic time series data that accurately mirrors complex business rules and sequences poses a significant challenge. This post explores leveraging Gretel's DGAN and Gretel Tuner to produce time series data. Our goal is to create synthetic data that not only statistically resembles the original data, but also upholds logical consistency and the correct sequence of events. This approach enables us to generate data compliant with intricate business rules, providing a robust solution for simulations, testing, and enhancing data privacy.
You can run this end-to-end example yourself with this Colab notebook or GitHub.
Let's dive in.
Dataset Description
Our dataset focuses on a series of events that are present in the lifecycle of a typical project management cycle, from initiation to completion, as shown in Figure 1. Crafted with Gretel Navigator, it reflects the essential stages of project management:
- A - Initiation: Establishing the project's base.
- B - Planning: Detailing execution strategies.
- C - Execution: Creating the project's outputs.
- D - Monitoring and Controlling: Ensuring the project meets its goals and quality standards.
- E - Closure: Officially completing the project.
This dataset categorizes events into two types:
- Mandatory Events: important for the project’s forward movement, like the Project Kick-off, Requirements Gathering, and Quality Assurance Testing.
- Optional Events: add flexibility and detail to the project, such as Risk Assessments and Client Feedback Sessions.
Illustrated in a workflow diagram of Figure 1, we visually distinguish between the project phases and events, with mandatory events marked in darker shades and optional ones in lighter shades, offering a clear depiction of a project's progression and structure.
An example sequence is provided below:
Leveraging DGAN and Gretel Tuner
To start, we need to install the Gretel client, which includes the Tuner package. This package is essential for optimizing our DGAN model to better fit our specific data generation needs.
Creating a custom metric
The goal is to generate a dataset containing synthetic time-series reflecting the statistical properties of the events in the source data. To this end, we will be using Gretel Tuner to fine-tune the DGAN model hyperparameters to optimize a custom metric capturing these properties.
Let’s first analyze our source dataset. Figure 3 shows the normalized histograms across event types on the source data. Figure 4 shows the event transition matrix of the source data. The metric we are interested in will need to ensure that the synthetic data closely matches the distribution of event types and the probabilities of event transitions of the original dataset. Therefore, we define the metric as a weighted combination of (a) Euclidean distance between the normalized histograms of events in the original and synthetic dataset, and (b) the Frobenius norm of the transition matrices of the original and synthetic sequences.
Preparation for DGAN model training
In preparing the time-series data for the DGAN model, we start by identifying the maximum sequence length across our dataset, ensuring all time series are of equal length through padding. This uniformity is important for the model to effectively learn and generate synthetic sequences that accurately reflect the original dataset's temporal patterns. These parameters can then be specified in the DGAN config as follows:
The example_id_column parameter is needed to instruct the model to group the data into separate sequences for training identified by the PROJECT_ID columns.Â
Given the above defined base config, we now define the Gretel Tuner config for the DGAN model as follows:
Given the custom metric and the config, we now run the tuner:
The tuner will start various training tasks and automatically start new training jobs as soon as the previous ones conclude. In this example, we've set the number of trials to n_trials=24 and the number of concurrent jobs n_jobs=4.
Validating Synthetic Data
Once we have found our optimal DGAN model settings using the Gretel Tuner, we can select the best model and use it to generate synthetic time-series data.
To assess the accuracy and quality of the synthetic data produced, let’s examine how the generated sequences align with the real data. Figure 5 shows sample event sequences generated by the DGAN model in comparison with sample data sequences of the source dataset. Figure 6 illustrates the distribution of events within the synthetic dataset versus the original, allowing us to see how well the model captures the overall event frequency of the non-mandatory events. Lastly, Figure 7 shows the probabilities of transitions between events showing how well they match with the original dataset.
Ready to Synthesize Your Own Time Series Dataset?
Give Gretel DGAN and Gretel Tuner a spin next time you're working with sensitive time series data. Whether you're aiming to enhance data privacy, fill data gaps, or simulate complex business processes, our tools are designed to support your goals. Sign up and start generating accurate, compliant, and logically consistent synthetic datasets today and unlock new possibilities for your projects.