Improving massively imbalanced datasets in machine learning with synthetic data
Handling imbalanced datasets in machine learning is a difficult challenge, and can include topics such as payment fraud, diagnosing cancer or disease, and even cyber security attacks. What all of these have in common are that only a very small percentage of the overall transactions are actually fraud, and those are the ones that we really care about detecting. In this post, we will boost accuracy on a popular Kaggle fraud dataset by training a generative synthetic data model to create additional fraudulent records. Uniquely, this model will incorporate features from both fraudulent records and their nearest neighbors, which are labeled as non-fraudulent but are close enough to the fraudulent records to be a little “shady”.
Our imbalanced dataset
For this post, we selected the popular “Credit Card Fraud Detection” dataset on Kaggle. This dataset contains labeled transactions from European credit card holders in September 2013. To protect user identities, the dataset uses dimensionality reduction of sensitive features into 27 floating point columns (V1–27) and a Time column (the number of seconds elapsed between this transaction and the first in the dataset). For this post, we will work with the first 10k records in the Credit Card fraud dataset- click below to generate the graphs below in Google Colaboratory.
The metric trap
Let’s see what kind of performance we can get detecting fraudulent records with a cutting edge ML classifier. We will start by dividing our dataset into a Train and Test set.
Wow, 99.75% detection. That’s awesome, right?! Maybe- looking at overall model accuracy just shows how well the model performed across the entire set, but not how well we did on detecting fraudulent records. To see how well we really performed, print a confusion matrix and accuracy report.
Above we can see that despite our 99.75% overall accuracy, we misclassified 43% of fraud examples in our test set!
Augmenting fraud examples with synthetic data
In this section we will focus on how we can improve model performance and generalization for the Fraud records, by using gretel-synthetics to generate additional examples of fraudulent records. Let’s start with what we want to accomplish- our goal is to generate additional samples of fraudulent records that will help our classifier generalize and better detect the fraudulent records in our test set.
Synthetic Minority Oversampling Technique
A popular technique in the data science community to achieve this is called SMOTE (Synthetic Minority Oversampling Technique), described by Nitesh Chawla et al. in their 2002 paper. SMOTE works by selecting examples from the minority class, finding their nearest neighbors in the minority class, and effectively interpolating new points between them. SMOTE does not have the ability to incorporate data from records outside of the minority class, which in our examples may include useful information- including fraudulent-like or mislabeled records.
Gretel synthetics with concepts from SMOTE
Having only 31 examples of fraudulent data in our training set presents a unique challenge for generalization, as gretel-synthetics utilizes deep learning techniques to learn and generate new samples, which traditionally require a lot of data to converge. Open the notebook below to generate your own synthetic fraud dataset for free with Google Colab.
By borrowing SMOTE’s approach of finding the nearest neighbors to to the fraudulent set and incorporating a few of the nearest neighbors from the majority class, we have the opportunity both to expand our training set examples, and to incorporate in some learnings from our fraudulent-like (let’s just call them shady) records. This approach will not require any changes to Gretel Synthetics, we’re just intelligently picking the dataset from the fraudulent + nearest positive neighbor (shady) records. Let’s get started!
To build our synthetic model, will use Gretel’s new DataFrame training mode defaults with a few parameters set below to optimize results:
- epochs: 7. Set epochs to the lowest setting possible to balance creating valid records without overfitting on our limited training set.
- dp: False. No need to take the accuracy hit from running differential privacy in this case.
- gen_lines: 1000. We will generate 1000 records to boost our existing 31 positive examples. Note that not all of the records generated from our model will be positive, as we incorporated in some negative examples- but we should have several hundred new positive examples at least.
- batch_size=32. Fit all 30 rows into a single neural network model to retain all field-field correlations, at cost of more records failing validation.
- Train the model, generate lines, and only keep the “fraudulent” records created by the synthetic data model
Examining our synthetic dataset
Now let’s take a look at our synthetic data and see if we can visually confirm that our synthetic records are representative of the fraudulent records that they were trained on. Our dataset has 30 dimensions, so we will use a dimensionality reduction technique from data science called Principal Component Analysis (PCA) to visualize the data in 2D and 3D.
Below we can see our training, synthetic, and test datasets compressed down to two dimensions. Visually, it looks like the 883 new fraudulent synthetic records may be very helpful to the classifier, in addition to the 31 original training examples. We added the 7 test-set positive examples (where our default model mis-classifies 3/7, and we’re hoping that the augmented synthetic data will help boost detection.
From what we can see in our graph, it appears our synthetically generated fraudulent examples may be really useful! Note what appear to be a false positive examples near the Training Negative set. If you see a lot of these examples, try reducing the NEAREST_NEIGHBOR_COUNT from 5 to 3 for better results. Let’s visualize the same PCA visualization 3 dimensions.
Looking at the datasets above, it appears that boosting our minority set of fraudulent records with synthetic data may help significantly with model performance. Let’s try it!
Boosting our training dataset with synthetic data
Now we reload the train and test datasets, but this time augment our existing training data with the newly generated synthetic records.
Train XGBoost on the augmented dataset, run the model against the test dataset and examine the confusion matrix.
As we have seen, it is a hard challenge to train machine learning models to accurately detect extreme minority classes. But, synthetic data creates a way to boost accuracy and potentially improve models ability to generalize to new datasets- and can uniquely incorporate features and correlations from the entire dataset into synthetic fraud examples.
For next steps, try running the notebooks above on your own data. Want to learn more about synthetic data? Check out Towards Data Science articles mentioning Gretel-Synthetics here and here.
Final remarks
At Gretel.ai we are super excited about the possibility of using synthetic data to augment training sets to create ML and AI models that generalize better against unknown data and with reduced algorithmic biases. We’d love to hear about your use cases- feel free to reach out to us for a more in-depth discussion in the comments, twitter, or hi@gretel.ai. Follow us to keep up on the latest trends with synthetic data!
Interested in training on your own dataset? Gretel-synthetics is free and open source, and you can start experimenting in seconds via Colaboratory. If you like gretel-synthetics give us a ⭐ on GitHub!