Build differentially private synthetic datasets in Python.
What is synthetic data?
While synthetic data can mean many things, Gretel’s definition of synthetic data is artificial data created from training a machine learning model to re-create a source dataset. The synthetic model outputs artificial data that contains many of the insights and correlations of the original data without memorizing any records from the original data.
How do I get started?
How does Gretel synthetics create artificial data?
Gretel-synthetics utilizes a sequence-to-sequence architecture to train on a text dataset and learn to predict the next characters in the sequence. Gretel-synthetics uses a Long-Short Term Memory (LSTM) artificial neural network to learn and create new synthetic examples from any kind of text or structured data.
Is there an architecture diagram?
This flow diagram walks through the process of loading source data, training a generative machine learning model, and using the model to create a synthetic dataset using gretel-synthetics.
What kinds of data can I send to Gretel-synthetics?
You can create synthetic data from any kind of text data, whether structured or unstructured. The simpler the data format the better, we recommend CSV, Pandas DataFrames, or unstructured text delimited by line. Gretel-synthetics thrives on patterns. For example, it does particularly well on machine learning datasets as it’s able to replicate both within field distributions as well as cross field correlations. Conversely, if all your dataset field values are highly unique, the model will struggle to find exploitable patterns.
What are the outputs from Gretel-synthetics?
The outputs include a synthetic data model that can be used to generate synthetic data records, an initial set of synthetic data records, and an available premium reporting module that assesses the quality and correlations between the synthetic data and the original dataset.
Can I run gretel-synthetics on premises?
Some customers need to train synthetic data models within a compliance-approved environment. Gretel-synthetics is open source and can be deployed as a Python package or Docker container. However, some premium features require an API key and connection to Gretel’s APIs.
What are gretel-synthetics premium features?
With Gretel API access, developers can access premium features such as automated data format validation to ensure that all data generated by the neural network matches the structure and distributions of the source data, field correlation and clustering to improve insights captured by the synthetic data model, and a reporting module that generates an HTML report assessing the quality and correlations between the synthetic data and the original dataset.
Do I still need to de-identify sensitive data when using gretel-synthetics?
Gretel synthetics will memorize and replay repeated data that it observes in the training set. When anonymizing sensitive identifying data types within a dataset, such as fields containing social security or credit card numbers, it is best to anonymize data as best as possible before training the synthetic model. Gretel helps you automate this process with our data labeling APIs and transformation SDKs.
What kinds of privacy protections can Gretel Synthetics help with?
Gretel-synthetics is designed to help developers and data scientists create safe, artificial datasets with many of the same insights as the original dataset, but with greater guarantees around protecting personal data or secrets in the source data. Gretel’s implementation of differential privacy helps guarantee that individual secrets or small groups of secrets, such as a credit card number inside structured and unstructured data fields will not be memorized or repeated in the synthetic dataset. Gretel’s synthetic data library also helps to defend against re-identification and joinability attacks, where traditionally anonymized data can be joined with another dataset, even ones that have not been created yet, to re-identify users.
How is Gretel-synthetics differential privacy different from traditional implementations?
Several companies including Uber have built libraries that help apply differential privacy to SQL queries, by injecting noise into the results of a query aggregation. This approach is powerful but requires you to know what questions that you want to ask of data, without the ability to see or inspect sensitive data directly. Gretel-synthetics is a sequence-to-sequence model that trains on a source dataset, injects noise during the learning process rather than at query time, and creates a secondary dataset that can be shared and viewed directly by data scientists or developers or queried using any database technology.
How is synthetic data different from the original source data it was trained on?
Gretel-synthetics creates artificial data by training a machine learning model to create data just like the input data it was trained on. For example, if you train on a CSV dataset the output will be CSV.
How many lines of input data do I need to train a synthetic model?
We generally recommend 5000+ examples. If you have a smaller dataset that is only a few hundred lines, try training for 100+ training epochs to learn the structure. If you are working with a highly dimensional dataset (e.g. 15+ columns) we recommend 15000+ examples.
How many columns of training data can I have?
Gretel works best on learning models for densely-packed datasets with 50 or less columns of data. However, there is no limit on the columns (dimensionality) of your dataset. Gretel-synthetics clusters highly correlated columns into batches to be trained independently and then joins the results. We have tested for datasets up to 1,500 columns of sparse data.
How many epochs should I train my model with?
The right number of epochs depends on the inherent perplexity (or complexity) of your dataset. A good rule of thumb is to start with a value that is 3 times the number of columns in your data. If you find that the model is still improving after all epochs complete, try again with a higher value. If you find that the model stopped improving way before the final epoch, try again with a lower value as you may be overtraining. If you have only a small number of records in your dataset or are having a large number of records fail validation, you may need to increase the number of epochs significantly to help the neural network learn the structure of the data.
Does training a synthetic model require a GPU?
A GPU is highly recommended by not required to get started with gretel-synthetics. For a rule of thumb, you can expect training the synthetic model to be 10x faster or more on GPU. However, inference is not nearly as parallelizable as training and we recommend CPUs and Gretel’s available multi-processing support for text generation.
What is differential privacy?
Differential privacy is a framework for measuring the privacy guarantees provided by an algorithm. Through the lens of differential privacy, we can design machine learning algorithms that responsibly train models on private data. Learning with differential privacy provides provable guarantees of privacy, mitigating the risk of exposing sensitive training data in the synthetic data model or its output. Intuitively, a model trained with differential privacy should not be affected by any single training example, or small set of training examples in its data set.
How does Gretel-synthetics leverage differential privacy?
Gretel-synthetics uses differential privacy to defend against memorization while learning on a private dataset. Imprecisely speaking, the output of a synthetic model trained over a dataset D that contained one occurrence of a secret training record X versus another synthetic model D1 that did not contain X should be nearly identical. Thus, we have mathematical assurances that our model did not memorize the secret.
How does Gretel-synthetics implement differential privacy?
The TensorFlow team has taken on a lot of the heavy lifting of implementing and releasing TensorFlow Privacy, an extension to TensorFlow that allows differentially private learning. Gretel synthetics implements TensorFlow’s open source code for DP-SGD in the Tensorflow-Privacy library with slight modifications to adapt it to recurrent neural networks, and improved the baseline performance by replacing the plain SGD optimizer with an RMSProp optimizer as it often gives higher accuracy than vanilla SGD (Tijmen Tieleman and Geoffrey Hinton, COURSERA: Neural networks for machine learning, 4(2):26–31, 2012).
If my model trained in batches using differential privacy, what is my final epsilon (privacy guarantee)?
When differential privacy models are trained on disjoint subsets of a private database, their combined use has an epsilon value equal to the maximum across all models.
What are good epsilon (ε) and delta (δ) values in differential privacy?
Epsilon is your quantitative privacy guarantee. It gives a ceiling on how much the probability of a particular output can increase if you were to add or remove a single training example. Stringent privacy needs usually require an epsilon value of less than one. However, in some domains it’s not uncommon to see epsilons of up to 10 being used. Delta is a bound on the external risk that won’t be restricted by epsilon. External risk is that which inherently exists no matter what you do with your dataset. By default Gretel will initialize this value to be 1/#training samples. Delta values such as e-05 or less should not compromise utility.
How is Stochastic Gradient Descent (SGD) modified to be differentially private?
SGD works by stochastically sampling a set of training examples, computing the loss (difference between predicted value and real value), computing the gradient of the loss, then after modifying these gradients by the learning rate, uses the resulting values to update the model parameters. The iteration of this process is what’s meant by descent. There are few main changes to this process to make it differentially private. First the gradients are clipped such that no single training example can unduly impact the model, and second, random noise is added to the clipped gradients to make it impossible to deduce which examples were included in the training. Additionally, instead of clipping gradients at a batch level, they are clipped in micro-batches. The more clipping, noise adding and micro-batching you have, the more differentially private your model will be. As there is often a trade-off between privacy and utility, Gretel-synthetics exposes each of these elements as modifiable parameters in the training.
What does RDP order mean?
Gretel uses a variation of differential privacy referred to as Rényi differential privacy (RDP). RDP makes use of the Rényi divergence to measure the distance between distributions. Rényi divergence is a generalization of Kullback-Leibler divergence that works in the notion of a parameter referred to as it’s “order”. In RDP, the idea is to search for the order that optimizes epsilon (e.g. your privacy guarantee). When running Gretel-synthetics, the “optimal RDP order” will be printed along with epsilon and delta once training completes.