Comprehensive Data Cleaning for AI and ML
Data cleaning is a fundamental first step in any ML or AI project. You’ll often hear the phrase “garbage in, garbage out”. What that means is, if the data you enter is really messy, that increases the chances of your project results being messy. In this blog, we step you through some approaches to data cleaning with a particular eye towards AI/ML. Please feel free to follow along with our Jupyter notebook here.
The issues we address are as follows. Note, the order is such that earlier steps can benefit the analysis of future steps.
- Standardize empty values
- Remove duplicate records
- Highly missing fields
- Missing value imputation
- Redundant fields
- Capping high float precision
- Constant fields
- Field level outliers
- Record level outliers
The dataset we use is the popular Adult Census Income dataset that has been modified to include an example of all the different types of problems we may encounter. We can read in the dataset as follows:
Standardize empty values
Missing values can be represented by a variety of fields such as "?", "Missing", “N/A,” or “Not applicable”. Standardize all these values to be `np.nan`. This will simplify the imputation of missing values. You can see in our dataset that “?” is often used to represent missing data.
Remove duplicate records
First, show the first occurrence of all duplicated rows. For brevity's sake, we’ll just show the first five.
Your job now is to study the duplicated records and decide if they are in error or not. If they are in error, you can remove them with the following command:
Drop columns with mostly missing data
In our experience, you're better off dropping columns with 60% or more of missing data. When so much data is missing, the signal from the field can be messy. Below, we’ll show the percent missing for each field.
Note, only `dummy_col1` has more than 60% data missing, so now we drop that column.
Impute missing values
There are many simple ways to fill in missing data. For example, if the field is numeric, you could fill in missing values with the mean or the median. If the field is categorical, you could fill missing values with the most frequent categorical value. By far the most effective way to fill in missing data is with a machine learning imputation approach. KNN imputation is a machine learning-based imputation algorithm that's seen success but requires tuning of the parameter k and additionally, is vulnerable to many of KNN’s weaknesses, like being sensitive to outliers and noise.
In this blog, we’ll be using MissForest, which is another machine learning-based data imputation algorithm. It uses a random forest trained on the observed values of a data matrix to predict the missing values. There are many benefits of using MissForest. For one, it can be applied to mixed data types, numerical and categorical.
We’ll start by listing the missing percent of records per field:
Now let’s create several functions for massaging the data into the numeric format MissForest needs:
Let’s train and run the MissForest algorithm:
Now we translate the categorical values (which are currently numeric) back into strings:
Finally, let’s take a look at our new data. Note that there are no longer any missing values.
Remove redundant fields
When two fields are completely correlated, they're redundant and one of them could be removed. First, here's some code to compute field-by-field correlations.
And here’s the output from running the code:
As you can see, `education` and `education.num` are 100% correlated. The best thing to do is to remove one and then proceed with your AI/ML pipeline. If after the AI/ML you really need both fields in the data (as can be the case with creating synthetic data), then before you start, create a map between the two fields. At the end, use the map to get the removed field back in. Here’s example code removing education:
Capping high float precision
Sometimes float values can have excessively long precision, particularly when they are the result of some mathematical computation. Some models, like Gretel’s LSTM, are sensitive to values with excessive precision and do much better if the floats are rounded. Sometimes (as is the case when creating synthetic data), how much precision you need is dependent on how much precision you need in the output. If you can reduce the precision, then rounding to two spots to the right of the decimal generally works well. In our dataset, there is only one column with a long floating-point precision, and that column is `dummy_col3`.
Now we drop some of the precision by rounding every float to have two spots to the right of the decimal point, and look at the data again.
Remove constant fields
When a field only has one consistent value, it has very low predictive power and you're better off removing that field. This next code snippet will detect any constant fields:
The above returns the following line:
`Column dummy_col2 is a constant column of 0.0`
Now you can remove that column with the below code snippet:
Address field level outliers
Records that contain outlier values can be very disruptive to an AI/ML pipeline. It's important to look at each outlier found and determine if the reason it’s an outlier is that there’s an error in the data. Your choice is then to either fix the error or just remove the whole record. Remember: garbage in, garbage out. If there are errors in the training data, there will be errors in the AI/ML analysis. In this section we are looking for outliers at the field level, and in the next section we’ll look at outliers at the record level.
We’ll start by displaying the numeric columns:
Now we’ll pick the column `age` and graph a box plot. The nice thing about a box plot is you can see how outliers fit within the context of properties like minimum, first quartile, median, third quartile, and maximum.
You can see that ages above 80 are outliers. In this case, we decide that these are valid records (not errors) and leave the records in. If you had wanted to remove the outliers, the following code would do the trick:
Handle record level outliers
A record level outlier is one where all the fields in the record together form an outlier. A nice routine for measuring record level outliers is IsolationForest. This algorithm is an unsupervised decision-tree-based one originally developed for outlier detection in tabular data. You can read more about it here.
We start by defining a set of functions we’ll be using:
Now we normalize the data, train, and score the outlier model:
Our scoring mechanism has been normalized such that an outlier score of 0.7 or above means the record is for sure an outlier. A score of 0.6 to 0.7 means the record is possibly an “outlier” and should be manually looked at.
Let’s start by looking at the for sure outliers:
There is only one record, a dummy error record we entered where the person is a 6-year-old elementary school student earning more than 50K. The following code shows how to remove this outlier:
Data cleaning can often make or break the success of an AI/ML project. Gretel encourages all its users to fully clean their data before initiating a Gretel project. Most of Gretel’s models to create synthetic data use AI/ML. Data cleaning will increase the odds of creating high quality synthetic data, which then increases the odds of a downstream successful AI/ML project. Please feel free to contact us with any questions or comments: email@example.com.