Advanced Data Privacy: Gretel Privacy Filters and ML Accuracy

A look at how using Gretel’s Privacy Filters to immunize synthetic datasets against adversarial attacks can impact machine learning accuracy.
Copyright (c) 2021 Gretel.ai
Copyright (c) 2021 Gretel.ai

Last year, we explored the efficacy of using synthetic data in machine learning by creating synthetic versions of eight  popular Kaggle datasets and testing them on a variety of popular ML algorithms. We were excited to find that the use of synthetic data had minimal impact on model accuracy. Back in September, we introduced Gretel’s new Privacy Filters. The Privacy Filters are but one of several available advanced data privacy mechanisms. The use of these mechanisms results in a Privacy Protection Level (PPL) score now shown in the Synthetic Performance Report. Now, on top of the privacy inherent in the use of synthetic data, users can add further protection to ensure their synthetic data is safe from adversarial attacks. In this article, we explore the impact of these new Privacy Filters on machine learning accuracy.

Gretel Privacy Filters

The Gretel Privacy Filters were the culmination of much research on the nature of adversarial attacks on synthetic data. The Privacy Filters prevent the creation of synthetic data with weaknesses commonly exploited by adversarials. We have two Privacy Filters, the first is the Similarity Filter, and the second is the Outlier Filter. The Similarity Filter prevents the creation of synthetic records that are overly similar to a training record. These are prime targets of adversarials seeking to gain insights into the original data. The second Privacy Filter is the Outlier Filter. This prevents the creation of synthetic records that would be deemed an outlier in the space defined by the training data. Outliers revealed in a synthetic dataset can be exploited by Membership Inference Attacks, Attribute Inference, and a wide variety of other adversarial attacks. They are a serious privacy risk.

The Gretel Privacy Filters can each be set to either `null` (meaning off), `med`, or `high`. These values correspond to thresholds over which a synthetic record is deemed too similar or too much of an outlier. Choosing the right Privacy Filter setting depends highly on the intended use of the synthetic data. For example, if you intend to share the data publicly, we recommend that both filters be set to `high`. If you intend to only share internally in your organization, then you’re fine using the default setting of `med` for both filters. If the whole point of your intended analysis is to study outliers, then we recommend turning the Outlier Filter off. Typically, we find that a setting of `med` has little to no impact on synthetic data utility.  In fact, sometimes it even helps! With a setting of `high` however, not always, but sometimes there is a modest hit on utility.

Our Experiments

We use the same datasets that were used in our earlier machine learning accuracy blog:

Public Kaggle Datasets used in our experiments

When training the synthetic model, all rows and columns were used with no modifications. This time, however, we got a little fancy and used our publicly available Optuna notebook to find the ideal set of hyperparameters for each dataset. You can find more information on the notebook in our recent blog: “Optuna Your Model Hyperparameters”. Below are the improvements Optuna gave us over the synthetic default configuration. A score of 80 or above is considered excellent. As you can see, the models were pretty good to begin with but Optuna gave them each a nice lift.

Figure 1: Optuna hyperparameter tuning improvements in our datasets

We then tuned a Gretel Synthetics model and created synthetic data first with our Privacy Filters turned off, then  with them set to “med” and lastly  with them set to “high”. The original training data and all synthetic data can be found in our research repository on GitHub.

We start by building the list of machine learning algorithms we’ll be testing, as shown below. The complete code used in these experiments can be found here.

models = []
models.append(('Logistic Regression', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('Naive Bayes', GaussianNB()))
models.append(('SVM', LinearSVC()))
models.append(('Random Forest', RandomForestClassifier()))
models.append(('XGBoost', XGBClassifier()))

We next read in both the original and the synthetic data, and for each dataset/model combination, we'll run a 5-fold stratified cross-validation five times. As shown below, we set up a pipeline to handle missing fields, encode categorical fields, standardize all fields and then run the model.

def score_model(model, X, y):
 
   numeric_columns = list(X.select_dtypes(include=['int64', 'float64']).columns)
   nominal_columns = list(X.select_dtypes(include=['object']).columns)
  
   numeric_transformer = Pipeline(steps=[
       ('imputer', SimpleImputer(strategy='constant', fill_value=0)),                                
       ('scaler', StandardScaler())])
   categorical_transformer = Pipeline(steps=[
       ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
       ('encoder', TargetEncoder(smoothing=.2)),
       ('scaler', StandardScaler())])
  
   preprocessor = ColumnTransformer(
       transformers=[
           ('num', numeric_transformer, numeric_columns),
           ('cat', categorical_transformer, nominal_columns)])
 
   pipe = Pipeline(steps=[('preprocessor', preprocessor),
                         ('model', model)])
 
   cv = StratifiedKFold(n_splits=5, shuffle=True)
 
   all_scores = []
   for i in range(5):
     all_scores += list(cross_val_score(pipe, X, y, scoring='accuracy', cv=cv, n_jobs=-1))
 
   mean_score = round((np.mean(all_scores) * 100), 2)
 
   return mean_score

Results

Let’s first cut to the chase and look at  how the average machine learning accuracy varied across tests:

Figure 2: ML accuracy in our datasets when using original vs synthetic data

Not bad! You can see that the synthetic data average accuracy with no Privacy Filters is usually not far from the original data’s average accuracy. When we turn the Privacy Filters to “med” the accuracy remains similar and sometimes even exceeds the original data accuracy. When we turn the Privacy Filters to “high”, the results are variable but still quite good. Sometimes there’s a modest hit on accuracy, sometimes there’s no impact and sometimes it actually helps.

Digging Deeper

Let’s take a closer look at an example where accuracy improves, as well as one where there’s a modest degradation. We’ll start with the UCI Credit Card Default Dataset, where we consistently see a modest impact on accuracy when the Privacy Filters are set to high.

Figure 3: ML accuracy results for UCI_Credit_Card_Default dataset

If we visualize the similarity of synthetic records to training records, we can see that nothing gets filtered. If you’re testing this yourself, you can click on the “Filtered Records Only” button to verify this. Note that we show both training-to-training and synthetic-to-training similarity in the graph for context, but only the synthetic set gets filtered, never the training set.

Figure 4: Similarity Filtering in UCI_Credit_Card_Default dataset

If we visualize the outlier scores for our synthetic records, however, we can see that quite a few do get filtered when our Privacy Filters are set to high. Click on the “Filtered Records Only” to verify.

Figure 5: Outlier Filtering in UCI_Credit_Card_Default dataset

The classification task in the UCI Credit Card Default dataset is to determine who will and who won’t default on their payment next month. You could imagine that someone with unusual billing and payment characteristics might be more prone to defaulting. In fact, in the outlier records, 62.5% default on their credit card as compared to only 20.7% in the entire dataset. Clearly, the outliers are helping the model build strong classification accuracy. 

Now let’s look at a dataset where the opposite is true.

Figure 6: ML Accuracy in Data_Scientist_Job_Candidates dataset

If we visualize the similarity of synthetic records to training records, we can see that several records get filtered. You can click on the “Filtered Records Only” button to verify this.

Figure 7: Similarity Filtering in Data_Scientist_Job_Candidates dataset

And when we visualize the outlier scores for this data, we also can see that several get filtered.

Figure 8: Outlier Filtering in Data_Scientist_Job_Candidates dataset

The classification task in the Data Scientist Job Candidates model is to determine who will get hired. In this case, the outlier candidates (and possibly the overly similar candidates) actually hinder the model’s ability to make an accurate prediction. Outliers are often problematic in any statistical analysis and it is not uncommon to remove them in machine learning. Thus, the Privacy Filtering of these records both improves the data’s privacy as well as the ML accuracy.

Conclusion

One of the many insightful comments made in a recent blog by OpenMined, is that the tradeoff between privacy and accuracy is more complicated than you would expect. If the data wasn't made private, then less people would be inclined to share their data. There's also the risk that people would be less honest about their data. Both of these aspects also have a direct impact on accuracy. At Gretel, we’re both passionate about making data private and at the same time, preserving data utility.