Gretel’s New Data Privacy Score

Gretel releases industry standard synthetic tabular data privacy evaluation and risk-based scoring system.

Introduction

In this blog, we introduce Gretel's new Data Privacy Protection Score. The metric comprises a membership inference attack (MIA) simulation as well as an attribute inference attack (AIA) simulation. These metrics complement the existing Privacy Configuration Score which is only model and mechanism based. This existing metric consists of our Privacy Filters, Overfitting Protection and Differential Privacy. Our new metrics simulate adversarial attacks and thus provide a tangible, practical scenario in which to measure privacy risk.

There is very little standardization when it comes to privacy evaluation metrics for synthetic tabular data. MIA's and AIA's are common in literature but less common in industry. MIA in a synthetic data context means that an attacker tries to identify if real records have been used to train the synthetic data generation algorithm. In an AIA, an attempt is made to take a known portion of the attributes of the real data and guess the value of an unknown attribute of the synthetic data.  Together, Gretel's new membership inference protection and attribute inference protection scores result in our new Data Privacy Score

Recent research highlights that traditional record-level de-identification techniques often fail to balance privacy and utility effectively. Instead, combining formal methods like differential privacy and data synthesis with empirical evaluations of robustness against attacks proves to be the best approach. This method ensures that data remains useful while being protected from privacy breaches. At Gretel, we align with this approach, leveraging both formal defenses and empirical evaluations in our privacy metrics to help businesses comply with GDPR and similar regulations.

The European Union’s General Data Protection Regulation (GDPR) and other privacy laws mandate stringent protection of personal data. Gretel’s Data Privacy Score, part of our industry-standard AI privacy risk management system for synthetic tabular data, aids companies in adhering to these regulations. By integrating privacy-enhancing technologies such as de-identification, overfitting prevention, differential privacy, similarity filters, outlier filters, and continuous monitoring, Gretel supports data governance and privacy by design. Our tools help companies minimize privacy risks and ensure robust compliance with GDPR, CCPA, and other global privacy standards.

Background

When designing privacy metrics it's important to take into account the specific threat environment. In a white-box threat environment, the attacker has full knowledge of all components of 𝑀, including its architecture, parameters and gradients. In a black-box threat environment, the attacker has no knowledge of the model but they can query the model and typically get access to record probabilities or loss. In a no-box threat environment the attacker only has access to the synthetic data.  They can't query the model or access record probabilities. Gretel is fortunate to be in the no-box category. This rules out the majority of adversarial attacks in current literature.

In a no-box threat environment, attacks typically focus on the level of similarity between the training and synthetic datasets. In the ideal privacy/utility tradeoff scenario, the goal is to have the generated data be in the ’Goldilocks’ zone: not too similar to the training data, but also not too dissimilar. When generative models are overfit, this can result in synthetic data that's overly similar to the training data.  The connection between overfitting and black-box membership inference was initially investigated by Shokri et al. [1]. The authors showed experimentally that overfitting can lead to privacy leakage but also noted that it is not the only condition, since some models that were not overfit were more prone to membership leaks. There is research [2] that shows that not all memorization is due to overfitting. Most datasets have what's called a "long tail" of infrequent values which can require some memorization in order to achieve maximal generalization of the full dataset.

Membership Inference Protection

Membership inference risk exists when an attacker is able to deduce that a record/person exists in the original training set. This is a privacy risk since, for example, if the synthetic dataset is about breast cancer, then the attacker can deduce the person they found has breast cancer.

The general methodology behind our particular membership inference attack has significant support in the literature [3-8]. To start, 5% of the training data is put aside as a test dataset and is not used in model training. The premise here is that the training data should be no closer to the synthetic data than the test data. When the attack is simulated, first a random sample is drawn from the training and test data. Each record in this attack dataset is compared to the synthetic data.  One difference in our approach is we use Euclidean distance instead of Hamming so we can scale more effectively using Faiss. If an attack record’s nearest synthetic record is less than a certain distance we call it a match. If the matching attack record was from the training dataset we call that a true positive. If the attack record was from the test dataset we call that a false positive. From here we can compute precision and accuracy. Hundreds of attacks are simulated each time varying which sample data, the size of the sample data and the distance threshold used.

Each attack has a precision and accuracy which we then translate into a 0 to 100 scale (100 indicating the strongest protection) with grades ranging from Poor to Excellent protection. Generally speaking, both precision and accuracy should be below .5 to get a grade of Excellent. When both precision and accuracy are .8 or above that results in a grade of Poor. Scores that fall between .5 and .8 translate into Moderate, Good or Very Good, depending on where they fall in the range.

Below is an example Membership Inference Protection graph. Typically hundreds of attacks are simulated and the pie chart shows the proportion of attacks falling into each protection level. The overall protection score is the average of all the individual attack protection scores.

Attribute Inference Protection

In an attribute inference attack (AIA) the adversary tries to use knowledge of some of the fields (the quasi-identifiers) to determine values for other sensitive fields. There has been research published that shows the combination of quasi-identifiers ‘gender’, ‘birth date’, and ‘postal code’ reidentify between 63 and 87% of the U.S. population [9].

The main attribute inference algorithm follows several research papers that each randomly sample which fields to use as quasi-identifiers [10-13]. This removes the burden of customers having to pick them out which can be very difficult. A few elements of their algorithm are varied to get it to scale to datasets with millions of records. In particular, we use Euclidean distance so we can scale with Faiss. We use the entropy weighting scheme from Yan's work [4].

To begin the AIA simulation, an attack dataset is created by randomly sampling 1% of the training data. Each record is then processed one at a time. For each record we randomly pick X quasi-identifier values. The default is X = 3, but this can be changed in the configuration. Using just the quasi-identifiers, we find the K nearest neighbors in the synthetic data. We varied K in our research and settled on a value of 5, as it results in the strongest attack simulations. From these K nearest neighbors we compute a mean record. This is done by taking the mode for categorical fields and the mean for numeric. Then all the fields that are not quasi-identifiers are compared between the attack record and this mean nearest neighbor synthetic record. If the fields match, this is accumulated as a correct prediction for that column and if they don't match it's accumulated as an incorrect prediction. To be a match, categorical fields must be exactly the same; numeric fields must be within 1%.

After processing all records we compute accuracy for each column by comparing the total correct and incorrect predictions. This accuracy is then normalized by the entropy of the column to get both a column and overall risk level. You can think of entropy as being the opposite of predictability. As an example, if the column is a constant value then the entropy is low and inherent predictability high. There is less privacy loss in this scenario so the entropy will downplay the risk. On the other hand, if every value is distinct in a column then the entropy is high and inherent predictability is low. There is more privacy loss in this scenario so the entropy will increase the risk score.

AIA scores range from 0 to 100 with 100 being the greatest protection. The corresponding grades range from Poor to Excellent. Below is an example AIA graph resulting from a simulation. The x axis is the 0 to 100 protection score and the y axis is all the columns. A customer would look at this graph and see if any columns offer only Moderate or Poor protection.  If so, then they need to decide how private the data in that column is. If they'd like to improve their protection, there are a variety of ways to achieve this which we'll review in the next section.

Recourse

There are several recourse options if your Data Privacy Score is lower than you'd like. The first is to combat overfitting by training with fewer epochs. The second is to turn the Privacy Filters to high which will remove synthetic records particularly vulnerable to adversarial attacks. Third, if you are running with Differential Privacy you can try reducing your epsilon value. If you are not, you could try running with Differential Privacy. And finally, if you're able to, increase the size of your training set. Larger training sets offer more protection as the contribution of each individual sample lessens. Generally speaking, each of these recourse options can have the potential for lowering your Synthetic Quality Score, so care should be taken to get an appropriate balance of privacy to utility.

Conclusion

At Gretel, privacy is a top priority. With our new Data Privacy Score customers can get immediate feedback as to the privacy of the synthetic data they create. We'd love to hear from you! Please send questions and comments to amy@gretel.ai. 

References

[1] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In IEEE S&P, 2017. [link]

[2] Vitaly Feldman and Chiyuan Zhang. What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems, 33:2881–2891, 2020. [link]

[3] Hernadez, M., et al., Synthetic tabular data evaluation in the health domain covering resemblance, utility, and privacy dimensions. Methods of Information in Medicine, 2023

[4] Yan, C., et al., A multifaceted benchmarking of synthetic electronic health record generation models. Nature communications, 2022. 13(1): p. 7609 [link]

[5] Goncalves, A., et al., Generation and evaluation of synthetic patient data. BMC medical research methodology, 2020. 20(1): p. 1-40. page 30 [link]

[6] El Emam, K., L. Mosquera, and X. Fang, Validating a membership disclosure metric for synthetic health data. JAMIA open, 2022. 5(4): p. Ooac083. [link]

[7] Li, J., et al., Generating synthetic mixed-type longitudinal electronic health records for artificial intelligent applications. NPJ Digital Medicine, 2023 [link]

[8] Yoon, J., et al., EHR-Safe: generating high-fidelity and privacy-preserving synthetic electronic health records. NPJ Digital Medicine, 2023 [link]

[9] P. Golle, Revisiting the uniqueness of simple demographics in the us population, in 5th ACM Workshop on Privacy in Electronic Society. ACM, 2006, pp. 77–80. [link]

[10] Kaur, D., et al., Application of Bayesian networks to generate synthetic health data. Journal of the American Medical Informatics Association, 2021. 28(4): p. 801-811. [link]

[11] Choi E, Biswal S, Malin B, et al. Generating multi-label discrete patient records using generative adversarial networks. In: proceedings of the 2nd Machine Learning for Healthcare Conference, Vol. 68; 2017. [link]

[12] Ghadeer O. Ghosheh, Jin Li, and Tingting Zhu. 2024. A Survey of  Generative Adversarial Networks for Synthesizing Structured Electronic Health Records. ACM Comput. Surv. 56, 6, Article 147 (January 2024), 34 pages. [link]

[13] Ghosheh, J. Li and T. Zhu, A review of Generative Adversarial Networks for Electronic Health Records: applications, evaluation measures and data sources. Arxiv 2022 [link]