The Evolution of Gretel's Developer Stack for Synthetic Data
When we launched Gretel almost three years ago, it was to solve a simple problem: people need data and they have a hard time getting it. Getting access to production (and usually sensitive) data is difficult. Products and technologies that claim to accelerate this process through governance, access control, and auditing just don't cut it. In fact, these products weren't even built with the consumer of data in mind: developers, scientists, analysts, and researchers.
Gretel set out to create the developer stack for data and in this blog we introduce some of our newest product and technology initiatives that will ensure our platform can grow with the needs of the data consumer for years to come.
Here's an overview of what we'll be discussing today:
- The toolkit: A quick recap of Gretel's core toolkit and the three functions of synthesizing, classifying, and transforming data.
- Mobilize all the models: Complexity in data drives the need to support a variety of synthetic models that offer state of the art performance for different tasks -- from generating time-series to images. We'll review our current infrastructure with an introduction to the Model Integration Framework (MIF).
- Automate it all: A sneak peek at the improvements we are doing to drastically lower the barrier to entry for synthetic data generation.
When we first launched Gretel, our early discussions with customers informed our beta releases to be oriented around abstract workload types. Generally, we found that users' needs could lump broadly into one of three core workloads:
- Classification: The identification and tracking of where sensitive information exists within datasets for a variety of use cases, such as alerting, access control, etc.
- Transforms: The application of discrete mutations to data for the purposes of de-identification, tokenization, and anonymization.
- Synthetics: The creation of new datasets that look and feel like the originals. Gretel's flagship gretel-synthetics model is a neural network-based language model with advanced features including differentially-private training and conditional data generation. It supports numeric, categorical, and text-based datasets.
As we worked with more users and customers, the complexity of data and use cases grew and we learned a couple key lessons:
- For synthetic data generation, we realized that depending on the types of data being processed (including text, structured data, images, and even video in the future) we will need to utilize a variety of machine learning model types.
- The classify and transform capabilities are often more powerful when combined with synthetics, which we demonstrated in our financial time-series data case study with a top global financial institution.
With the realization that synthetic data is a non-zero-sum game, Gretel engineering set out to build the next evolution of our platform.
Mobilize All The Models
The variety of machine learning models and resources for synthetic data purposes is massive! In order to take advantage of everything out there, we knew we needed a way to make any model generally available through Gretel's APIs and interfaces. So earlier this year, we internally debuted our Model Integration Framework (MIF) which enables the rapid development and deployment of arbitrary complex machine learning tasks.
The goals of Gretel's MIF are:
- Allow scientists and engineers to plug their own model code into existing scaffolding, similar to other serverless frameworks (like AWS Lambda, but with GPUs).
- Power continuous integration and deployment (CI/CD) architecture that automatically builds and releases containers for each unique model.
- When Gretel Cloud APIs are called and receive a Gretel Configuration, let our event-driven job scheduler provision and launch the appropriate container into a managed Kubernetes cluster as a job. Depending on the model’s requirements, provision appropriate CPU or GPU resources automatically!
- When jobs complete, let users access the output artifacts such as their synthetic data and synthetic quality reports.
With the MIF deployed, users can easily choose the model that best fits the type of data they have, which is illustrated below.
We have started to release blogs and documentation for these new models, and will continue to do so as new models rapidly become available. Some of these new models include:
- Gretel Synthetics: our flagship LSTM and general purpose synthetic data model, which offers state-of-the-art accuracy and flexibility.
- Gretel DGAN: a modern implementation of DoppelGANger that was also open sourced in our gretel-synthetics Python package. This model will be launching soon in Gretel's Platform!
- Gretel ACTGAN: a general purpose Generative Adversarial Network that works well for datasets containing numerical data and high column counts.
- Gretel GPT: a generative pre-trained transformer that can create synthetic free-text data based on an open-source implementation of the GPT-3 architecture. Check out our GPT blog and tutorial video.
- Gretel Evaluate: this model can compare any two datasets and evaluate the usability and quality of a synthetic dataset. It also lets users run the Synthetic Quality Score (SQS) report as a standalone job.
- Gretel Amplify: A general-purpose copula that can take existing synthetic data and drastically increase its size in a short amount of time. This can be useful for high-volume use cases such as data warehousing and analytics.
Now, with our ability to rapidly launch different models and jobs, we've set our sights on the next evolution: synthetic data automation.
Automate It All
So what’s coming next?
Gretelers are hard at work upgrading our entire job-running architecture. We are doubling down on cloud-native core components, such as Kubernetes, and adding in the ability to construct complex pipelines to bring a level of automation to our product that has yet to be seen in the synthetic data industry.
While our architecture will evolve to support more complex workloads, we will be adding a level of automation into the product that actually reduces complexity drastically for our users. Some of the key tenets of this next evolution are:
- Seamlessly integrate into modern data sources and sinks such as object stores, data warehouses, and other cloud-native data vendors.
- Remove the need to make explicit choices on which models are needed for which types of data.
- Automatically apply multiple model types to partitions of data through a divide and conquer strategy.
- Automatically utilize classification and transform policies as privacy-preserving pre-processing steps before synthetic data model creation.
The illustration below shows what these automatic pipelines will look like.
Our goal is to make Gretel the most intuitive and easy place to generate synthetic data. To that end, our next evolution will provide a thin, opinionated set of interfaces, while hiding the complexity of privacy engineering, machine learning model selection, data partitioning, and synthetic data generation from the user. While we firmly believe in allowing access to complexity as needed, our guiding light will always be user delight and simplicity in usage.
We are super excited to build the next revolution of the synthetic data stack for everyone. We can't do it without the support of our customers and user base. If this is exciting to you or you have feedback, questions, or just want to chat about synthetic data, feel free to reach out at firstname.lastname@example.org or join and share your ideas in our Discord community.