Common misconceptions about differential privacy
This article clarifies some common misconceptions about differential privacy and what it guarantees.
There is a plethora of content on differential privacy (DP), ranging from academic works, such as The Privacy Book written by the founders of DP, to blogs explaining its core principles, interpretation and application, such as this series. While these are well-written resources, they require deep study to fully grasp. So if you have a basic understanding of differential privacy but haven’t had the chance to delve into such resources, this post provides clarification on some common misconceptions.
Differential Privacy is not an algorithm.
Rather, DP is a standard that algorithms must meet. The standard is simply that the output of the algorithm, such as a count, should not depend too much on any singular record. Algorithms typically achieve the differential privacy standard by some type of probabilistic noise addition to mask the presence of any record. For example, a small amount of noise can be added to an exact count to make it differentially private.
Differential Privacy doesn’t provide blanket protection for all sensitive information.
To borrow from McSherry, “differential privacy is a formal distinction between your secrets and secrets about you.” If you choose to contribute your secret, DP will protect your secret in an equivalent way to the scenario where you don’t contribute your secret. But it does not provide guarantees about other ways your secrets become public.
Consider this example. You’re a data scientist making $100k a year at Rainforest Corp, a fictional company that practices pay equity. The talent team at Rainforest wants to publish the average income of data scientists in an effort to attract more applicants, and they have decided to use differential privacy to share this aggregate information safely. You know that your peers’ income is within 5k of yours, but you’ve never shared your income with anyone — you consider it your secret. You want to continue safeguarding your secret, so you opt-out of being included in this average. A few weeks pass and you see the data scientist job posting has been updated with “average salary of ~$101k”. Anyone reading this job posting would infer that you make roughly $101k, which is true. So even though you withheld your data from the calculation, a secret about you has become public.
For another example, check out this lecture where Kamath talks through why a widely circulated story of targeted marketing by Target is not a privacy breach.
Differential privacy is not a panacea.
DP is not suitable for every single analysis. It is a way of quantifying that an algorithm is going to tell you more about the large-scale trends in a dataset than it is about any specific individual. In other words, it’s designed to help individuals cleverly hide in a crowd.
If outlier analysis is of interest, DP is not the appropriate tool for privacy protection. Further, DP is not appropriate for studying small populations. At its core, DP is intended to allow aggregate information about large populations to be shared safely.
“Differentially private data” is not a clearly defined term.
“Differentially private data” is an ambiguous term that’s thrown around quite often. It can mean a few different things, leaving it open to misinterpretation. Let’s break it down.
If you’ve read some introductory materials, you likely think of “differentially private” as an adjective appropriate for algorithms that do something in aggregate, such as a sum, median, or even a neural network, and their outputs. Typically, the privacy protection stems from some form of calibrated, probabilistic noise addition. So how does noise addition work for a process that produces data records instead of an aggregate as the output?
Some possible interpretations of “differentially private data” are below.
- “Differentially private data’’ can refer to the output of local DP algorithms. If you are unfamiliar with central vs. local DP, scroll down to the appendix. Local DP algorithms such as randomized response produce a noisy database of answers to a sensitive question (e.g. do you smoke?). The algorithm is provably differentially private, and the noisy database that the algorithm produces can also be described as differentially private.
- “Differentially private data” can also refer to outputs of generative models that are trained with techniques to meet the standard of differential privacy.
Generative modeling is described by Brownlee as “a task in machine learning that involves automatically discovering and learning the regularities or patterns in input data in such a way that the model can be used to generate or output new examples that plausibly could have been drawn from the original dataset.” Generative models most often use neural networks, which can be designed to meet the standard of DP.
The outputs of differentially private generative models are most commonly called differentially private synthetic data. These models can produce higher fidelity synthetic data than randomized response can for complex multivariate analyses, such as predictive modeling. There is a huge field of research dedicated to developing novel techniques that can optimally balance the need for high quality synthetic data with the need for privacy and differential privacy is just one of them, though a very popular one.
So when using differential privacy as an adjective to describe data, I’d encourage you to clarify the broad class of algorithm that was used, to be mindful of the intended use of data produced, and to correctly term the data as synthetic when appropriate.
I hope this article helped clear up some common misconceptions about differential privacy. Drop me a line at firstname.lastname@example.org if you have any questions or would like to chat about my work in applied privacy at Gretel!
Central vs. Local Differential Privacy
There are two different models of differential privacy — central and local. The standard of privacy remains the same, however the distinction is in where data is stored and when noise addition occurs.
You are probably familiar with the central model, where all the real, sensitive data is available in some central location. For example, you and I sign up for rewards at an iconic Seattle-based coffeehouse chain. We trust them with our names, phone numbers and birth dates, which they now store in some central database. If they wanted to share the median age of customers who place orders for caffeinated beverages after 7pm, they could do so with the guarantee of differential privacy by calculating the true value and adding some calibrated noise to it. The real, sensitive data is still stored without modification and is unrelated to differential privacy. It’s the process of sharing the median age (i.e. an algorithm, see the first misconception) that is differentially private, not the central database being queried.
In contrast, the local model was conceived for cases where there is a lack of trust or central location for storing the raw data. For example, my curious friend, George, is conducting a study about whether aspiring data scientists take longer than the allotted time to complete take-home assignments, and asks me to contribute my data. While I am also similarly curious to know aggregate information about my peers, I don’t trust George enough to reveal that it took me a whole day to complete my last take-home, even though the time limit was 4 hours. So George tells me to flip a coin; if it lands on heads, I respond truthfully. Else, I flip a coin again, and if it lands on heads I say yes, and if it lands on tails I say no. If George asked all survey participants to do the same, he would have now collected a noisy version of the dataset. This algorithm is called randomized response. The purpose of collecting this noisy dataset is to aggregate it, which George does by accounting for the bias introduced by the second coin flip. Note that randomized response, as described above, is differentially private (see section 3.2 for a proof). So, the algorithm is differentially private, and the output of the algorithm, this noisy database of answers, can also be described as differentially private.
For more on the flavors of DP and boundaries between them, read this post.