What is Data Anonymization?
Everything you need to know about anonymizing data and the techniques for mitigating privacy risks.
Data anonymization is the process of mitigating direct and indirect privacy risks within data, such that there is a measurable way to ensure records cannot be attributed to a specific individual or entity. With an estimated 2.5 quintillion bytes of data being generated every day and an increasing reliance on data to power new applications, machine learning models and AI technologies, the importance of implementing effective anonymization techniques and removing any bottlenecks is crucial to accelerating future developments and innovations.
This post is a general introduction to anonymization, and the tools and techniques for providing sufficient privacy protections, so that personally identifiable information (PII) is safe from exposure and exploitation.
Data Anonymization: It's a Process
There’s no silver bullet for creating safe, shareable versions of real-world data. Data anonymization should be considered a continuous process; one that can require rapid iteration of applying various privacy engineering techniques and then measuring those privacy outcomes until a desired end state is reached.
When implementing data anonymization methods there are three main tenets to consider:
- Policy-based de-identification – what are we required to do?
- Direct privacy risks – how can we mitigate the exposure of specific PII?
- Evaluation – how can we address other indirect or re-identification risks that exist?
With this framework in mind, we can construct a high-level process for data anonymization:
In the following sections, we'll dive deeper into our core tenets of the data anonymization process, and then walkthrough how you might apply them to a notional dataset. Finally, we’ll explore various anonymization techniques you can use.
Beyond best practices and organizational policies, there are several country-level and international regulations for data uses and consumer privacy protections that define de-identification measures we must take. In the U.S., some prominent privacy regulations include the Health Insurance Portability and Accountability Act (HIPAA), which governs how personal health information (PHI) is handled; and the California Consumer Privacy Act (CCPA), and has inspired a growing number of state-level privacy laws. The CCPA is loosely modeled off of the European Union’s General Data Protection Regulation (GDPR), which we’ll focus on for this post.
The GDPR doesn’t require a specific anonymization technique be used, but it does set a high standard for what ‘anonymized data’ means; namely, it’s “information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.” In other words, anonymized data must be irreversibly altered, and neither directly or indirectly traceable back to an individual.
And even though the GDPR applies only to EU citizens and companies, it places strict controls on data transfers to non-EU countries and international organizations, so it has an outsized impact on anyone working with or moving data across and beyond its borders.
Let’s now look at some practical first steps to applying these standards in our de-identification processes:
- Remove or obfuscate first and last names. Depending on the use case, such as powering a development environment, removing a name from the dataset might be undesirable as it does not produce data that is similar to real-world data for development teams to test with. In this situation, we may opt to replace each name with a fictional name, so the development team still has realistic data to work with.
- Reduce the granularity of addresses to city and state only. This would remove the specific street and house number from any given address.
- Reduce the specificity of GPS coordinates. Some organizations may consider GPS coordinates that are specific to a certain radius to be PII. This can be mitigated by reducing the granularity of a coordinate such as 37.38238 being masked to 37.382XX
Once these policy- or regulation-based de-identification techniques are applied, additional steps should be taken to detect and mitigate other direct privacy risks.
Direct Risk Mitigation
A direct privacy risk would be any piece of information that can specifically identify an individual or entity. This might include phone numbers, email addresses or physical addresses.
First, you must identify these risks. This can be done in one of two ways:
- Knowledge and introspection of data schemas. In this case, you may have knowledge of certain fields or columns containing direct privacy risks. To address this, you might choose to apply techniques directly to every value in that field.
- Automated detection. When dealing with more complex datasets, where the schema itself may not reveal where direct risks lie, automated scanning measures should be used. This might include the use of regular expressions, named entity recognition (NER), or other information extraction heuristics. Once identified, the data can be transformed or obfuscated, as needed.
Once additional direct risks (identifiers) are located, depending on the desired use cases, a variety of privacy techniques may then be applied. These options will be discussed later on. First, let’s consider the third and final tenet of our anonymization process.
Evaluation and Indirect Risk Mitigation
The evaluation step of the anonymization process is critical. It includes analyzing the additional combinatorial elements in the data for any possible re-identification risks, also known as indirect privacy risks. These risks are best illustrated by Benitez and Malin’s study, where they used a notional combination of medical records and voter records to show how an indirect risk could be exploited to re-identify an individual, even if direct identifiers are removed from the dataset.
Data Anonymization Techniques
There are a variety of techniques that can be used to anonymize data. The list below provides a significant amount of flexibility depending on the end use case for the anonymized data. These techniques can be applied iteratively, as part of the risk mitigation and evaluation cycles in the anonymization process.
Data Removal and Reduction
An obvious step to anonymization is simply to remove a field in the dataset. But while this is effective, it often changes the utility of the data, depending on your use case. Usually, you would only be removing fields that pose direct risks and that contain personal identifiers. The end use case should dictate when to do this, but here's a general rule of thumb that might help:
- If the data is being shared for purposes of aggregate analysis, often direct identifiers do not affect your desired results, and you can consider removing the fields.
- If the data is being used to power pre-production or development environments, you often want to keep some form of identifiers, since user interfaces and APIs should have realistic data to test with, for visualization and data validation purposes.
Data reduction is similar, but instead of removing the entire field you reduce the granularity of values. This might be useful for partial aggregate analysis on the data, where some position of an identifier is needed. Specific techniques for reduction are:
- Redaction or masking. This can include converting something like a zip code, like 90210 to 90XXX, where the first two digits are still useful in identifying a general area.
- Data removal. This might be removing entire parts of a value, leaving other parts intact. For example reducing an address only to a city/state pair.
Replacing personal identifiers with fake but realistic versions is by far one of the better approaches to preserving the semantics of data in its entirety. This method is also very important for data that powers development and testing environments, as it can help ensure schema consistency between real-world data and anonymized data. For example, if we chose to mask the phone number 867-5309 to 867-XXXX, this could potentially be a change to the schema, because only numbers and dashes would be the expected and valid contents. With masking, we've introduced a masking letter, "X", which may break data validation for the downstream use case. Replacing the phone number with an entirely fake value would preserve the requirement of only allowing numbers and dashes.
There are two distinct methodologies to entity replacement to consider: deterministic and probabilistic.
- Probabilistic replacement would generate a randomized fake replacement for each occurrence of a real-world entity or identifier. For example, for every occurrence of the name "Rocky Balboa", a new fake name would get created. So, after processing that name three times, we could end up with "Apollo Creed", "Emily Johnson", and "Ryan Seacrest", and so on.
- Deterministic replacement would generate the same replaced value for a given real-world value. So, using our same example but with deterministic replacement, "Rocky Balboa" would get replaced with "Apollo Creed" every time.
Numerical and Date Shifting
Numerical data and dates and/or timestamps are often leading sources of indirect risks, where these values can still uniquely be mapped back to a specific individual or entity. If your downstream use case involves aggregate or trend analysis, removing or reducing the granularity of this data might be sub-optimal.
Shifting techniques randomly increase or decrease numerical and date values, such that joinability with other datasets becomes more difficult for re-identification attacks, while preserving the ability to do analysis on the data.
Shifting will generally include defining some minimum and maximum value to change a specific value in the dataset. For example, if we were looking at purchasing data for customers, consider the following:
- User: A | Date: 1/1/2022 | Amount: 5.34
- User: B | Date: 1/1/2022 | Amount: 9
- User: A | Date: 1/2/2022 | Amount: 7.80
We could apply a numerical shift with the following parameters:,
- Minimum: -3
- Maximum: 3
- Decimal Precision: 2
For each record in the data, a random shift amount will be applied to the "Amount" field. With the configuration above, a random value will be generated between -3 and +3, and then added to the existing value. This value may have up to 2 decimal places, which are also randomized. So, for our first record, our random value might be 2.73 which will mutate 5.34 to 8.07.
However, this becomes problematic if you want to do trend analysis on the data. The two records for "User A" could receive different random shift amounts, which would create significant drift from the trend in their spending.
An advanced extension for shifting techniques is to provide a field reference to shift values consistently for that field's value. In the example above if we selected the "User" field as our reference field, then we would preserve the shift amount for that specific entity and re-use it everywhere. Since the first shift amount for user A was 2.73, we'll cache that value and re-apply it for the third record, mutating 7.80 to 10.53. Now, the difference between these two amounts will be preserved.
User B would receive a totally different shift value for the mutations applied to those fields.
The same shifting concept can be applied to timestamps, dates, and any other value that is rooted in a numerical base.
Certain elements of data might be so complex that a discrete transformation of the data is impossible or very inefficient. Here, we can apply a technique known as synthetic data generation. Synthetic data is artificially annotated information that is generated by computer algorithms or simulations, which can often be used in place of real-world data. For data anonymization purposes, synthetic data can be useful for creating realistic data for complex fields such as natural language or free text. These fields are more complex than categorical and numerical data because you cannot simply transform the values to something else. Instead, machine learning models can be used to learn how to synthesize new free text that has similar semantic meaning as the real-world data.
Consider a dataset that is fairly structured, but that has a "notes" or "comments" field. The techniques mentioned previously may be too simplistic and ineffective for processing this text data. Synthetic data models can, however, learn the semantics of the text, and re-create versions of the text that maintain the same underlying meaning, but that are not identical to the real-world data or identifiable back to a specific person.
In some scenarios, certain elements of data must be restorable to its original form. For these use cases, encryption and tokenization enable data restoration.
Encryption utilizes one or more encryption keys to mutate data to a form that cannot be reversed without accessing the same keys and decrypting the data. Oftentimes, encryption methods may transform the data to be less useful or render it invalid. For example, encrypting a phone number might introduce characters that change the schema of the data.
One variant of encryption, that can preserve data schemas, is format preserving encryption. This technique will still encrypt the data, but the result (known as the ciphertext) will be of the same format. So, encrypting something like a phone number will yield a new value that also contains digits. This can help ensure that the encrypted data can be inserted back into a database with the same schema and the real-world data.
Tokenization is a process that replaces a sensitive piece of information with a value that has no real exploitable meaning. The token is generated separately and the mapping of the real-world data to the token is tracked in a separate, secure system. If a token needs to be restored to its original value, a lookup can be done against the tokenization system and the original data can be extracted back out.
Anonymized Data Use Cases
Now that we’ve looked at anonymization techniques, let’s consider some opportunities that truly anonymized, synthetic data offers:
- Healthcare – synthetically anonymized data can improve brain-computer interfaces that benefit individuals with speech impairment and mobility issues, provide shareable versions of complex genomic data that can advance medical research, and reduce AI bias to better diagnose heart disease.
- Finance – financial institutions are interested in data sharing and creating marketplaces, where algorithms can be developed on freely available synthetic data and then sold or licensed to financial institutions that have access to the real data. By sharing data between institutions and within the research community, better solutions can be found for technical problems faced by financial institutions. Sharing of synthetic data allows financial institutions to do this in a way that satisfies their data sharing restrictions.
- Government – to better treat millions of US military veterans, the Veterans Health Administration Innovation Ecosystem (VHA IE) is using synthetic data to to improve patient care pathways in suicide prevention and chronic disease management, and to track trends in population health. This includes efforts to create predictive models for COVID-19 in Veteran populations, which improves patient care services, while ensuring the security of their personal health information.
- Education – in Maryland, the school system utilized synthetic data to increase the usefulness and accessibility of the Maryland Longitudinal Data System (MLDS) data to researchers, policy analysts, and stakeholders at the local, state, and national levels. While they are obligated to make data available, various confidentiality laws, like The Family Educational Rights and Privacy Act (FERPA) which mandates student privacy, limit the ability to share de-identified data to third parties. However, synthetic MLDS data could be safely shared and analyzed.
- Digital Media – in the wake of the Cambridge Analytica scandal, where personal data belonging to millions of Facebook users was collected without their consent by a British consulting firm, media companies have been looking to enhance privacy, particular user that’s sold to advertisers, using synthetic data. Efforts also include the generation of privacy-preserving, less biased datasets to inform recommendation systems that power, for instance, Twitter’s news feed.
- Public Utilities – data anonymization is a necessity even for utility companies. For example, energy providers collect and process personal data to serve customers and payments, as well as maintain systems and manage outages. In addition to other international regulations, utility companies must comply with state public records acts (like California’s RCW 19.29A) and provide data disclosures in the event of an investigation. And according to one annual report, the energy industry is ranked fifth in data breach costs, and only surpassed by health care, financial, pharmaceutical and technology verticals. Here, synthesizing customers’ utility data could provide increased privacy guarantees and protections against attacks.
Automate Anonymization Tasks with Gretel
When it comes to data anonymization, remember, it's a process! At Gretel, we focus heavily on developer workflows, so that data anonymization processes can be defined as code and injected directly into pipelines that developers are already building and maintaining.
By exposing data anonymization techniques through easy-to-use APIs, developers can iterate on the anonymization process, and generate data that can be shared more broadly.
We're constantly innovating on data anonymization tools, synthetic data generation, and data evaluators. Sign up for our free tier and give Gretel a try, you can get started right away and Synthesize, Transform, and Classify your data with no code required!