Automatically Reducing AI Bias With Synthetic Data

Create a fair, balanced, privacy preserving version of the 1994 US Census dataset using gretel-synthetics.

In this blog we’ll step you through example code for automatically creating a fair, balanced, privacy preserving version of the 1994 US Census dataset using Gretel SDKs. 

What is Fair AI?

Let’s start with a definition of what fair, balanced datasets are. “In fair AI, the objective is to provide systems that both quantify bias and mitigate discrimination against subgroups.”1 In this example, we will boost under-represented `race`, `gender`, and `income_bracket` classes from the 1994 US Census dataset. The Python notebook below (we call it a blueprint) can be used to support balancing datasets for AI fairness as well as generally any imbalanced dataset

Artificial Intelligence (AI) is now ubiquitous in our culture. It is often responsible for critical decisions such as who to hire and at what salary, who to give a loan or insurance policy to, and who is at risk for cancer or heart disease. Fair AI strives to eliminate discrimination against demographic groups. The code example below can help you achieve fair AI by boosting minority classes' representation in your data with synthetic data. With this approach, only a single pass is required to correct representational bias across multiple fields in your dataset (such as gender and ethnicity, for example). Please note that correlations and distributions in non-bias fields (such as age) will transfer from your training data to your synthetic data.

The blueprint lets you choose from two different modes for balancing your data. The first (mode="full") is the scenario where you'd like to generate a complete synthetic dataset with representation bias removed. The second (mode="additive"), is the scenario where you only want to generate synthetic samples such that when added to the original set will remove bias.

To get started, log into the Gretel Console with a GitHub or Google account, and create a new project. 

  • Choose "From Blueprint" and then select the "Automatically balance your data" blueprint from the "Recommended" section. 
  • Next a Gretel Project with sample data will be created.  
  • After your data is uploaded, select the "Transform" tab.  Copy the Project's Connection URI from the "Integration" menu in the top right.
  • Launch the "Automatically balance your data" Notebook
open in colab

As always, you can check out the complete code for this Blueprint on our GitHub.

Our imbalanced dataset

The blueprint starts out with the necessary code to install Gretel's SDK. The data used for this blueprint was created from the 1994 US Census Database. It consists of anonymous information such as occupation, age, native country, race, capital gain, capital loss, education, work class and more. Each row is labelled as either having a salary greater than ">50K" or "<=50K".  As shown below, we’ll request using 14,000 records from the dataset, the mode of “full” and for 1000 synthetic records to be generated.

The blueprint then enables you to view existing categorical field distributions in the dataset:


Boosting the Census with synthetic data

We’ll now choose to remove the demographic bias of “race”, “gender” and “income_bracket”. 

After choosing your fields, the blueprint takes you through the typical cells to train your synthetic data model.  Upon synthetic generation, we seed the model with the classes that need boosting to generate additional records. Upon completion, as you can see below, your new data will be perfectly balanced.


The blueprint concludes by enabling you to either save your new synthetic data to a CSV file or back onto a Gretel Project.  You’re also able to generate a full synthetic Performance Report.

Final remarks

Discriminaton in data from which important conclusions are made can have disastrous consequences.  At Gretel, creating bias-free synthetic data is a core use case. Stay tuned for continued new features in this area! We’d love to hear about your use cases- feel free to reach out to us for a more in-depth discussion in the comments, twitter, or Follow us to keep up on the latest trends with synthetic data!


1 Ahsen, Mehmet Eren, Mehmet Ulvi Saygi Ayvaci, and Srinivasan Raghunathan. "When algorithmic predictions use human-generated data: A bias-aware classification algorithm for breast cancer diagnosis." Information Systems Research 30.1 (2019): 97-116.