Downstream ML classification with Gretel ACTGAN and PyCaret

Learn about downstream machine learning tasks and synthetic data with Gretel’s new ACTGAN model and the PyCaret library

Published by

No items found.

•

Updated

December 2, 2022

In a recent post we discussed how to safely avoid linkage attacks by using synthetic data. The synthetic data was used to train a downstream machine learning classifier. Classifiers require accurate and high quality data before being usefully deployed. In this post, we dive into using synthetic data to train a machine learning model to determine if a customer will purchase a certain product.

When practitioners use the word “downstream,” they're typically referring to a step in the system that happens after the data has been processed and transformed. For example, if we were to train a machine learning classifier and then use it to make predictions about some customer behavior, those predictions could be used downstream to take further business action on behalf of the customer.

However, we can also refer to the machine learning classifier itself as being downstream if we do a series of transformations to the training data. That's the case here, where we create a synthetic version of the data that we then use to train a downstream classifier which itself may have other downstream effects. We have a notebook where you can follow along.

I’ve generated synthetic data, now what?

The data we use as an example here is our grocery store data.

order_id	order_dow	order_hour_of_day	days_since_prior_order	baby food formula	...	tortillas flat bread	yogurt
1597	1	8	4	0	...	0	1
2011	4	10	30	0	...	0	0
2822	0	8	29	1	...	2	2
2889	1	15	8	0	...	1	1
3971	2	18	8	0	...	0	0

Every food column after the first few metadata columns contains how many of an item a person bought in a single transaction. The column we’ll focus on predicting in this case is `Frozen Pizza`, but we could choose any column of interest.

Since we are going to train both a synthetic data generating model and a downstream classification model, we need to hold out a small validation set. This validation set isn’t seen by the synthetic model or the classification model, and its purpose is to test the eventual classification performance of a classification model trained purely on synthetic data and validated on unseen real data.

This is an additional step in the traditional machine learning pipeline and ensures that our classification model trained on synthetic data can be used for real world data without data leakage.

FIgure 1: Diagram showing data split into training, test, and validation sets

We can use the remaining 95% of the data to train our synthetic model.

train_df, validation_data = train_test_split(all_original_data, test_size=0.05)

In this instance, because we have over 100 columns, we want to use a model that handles high-dimensional data well. Additionally, we want to make sure that our model can handle any columns that might have a large number of purchases. For that reason, we’ll use our newly released Gretel ACTGAN model. This model is GAN-based, and highly effective for tabular data generation. Its improved memory usage, speed, and accuracy make it an excellent choice for this use case.

‍

Once we’ve trained our synthetic model and verified its quality we can generate records to train our downstream classifier. We generate the same number of records as we had in our training data, which results in a synthetic dataset with high downstream utility.

We split this synthetically generated data in the same way we would split the original data if we were using that directly for our model training.

synthetic_train_data, synthetic_test_data = train_test_split(synthetic_df, test_size=0.2)
original_train_data, original_test_data = train_test_split(train_df, test_size=0.2)

We also split the 95% data so we can have additional validation for our downstream model’s performance.

Train and evaluate downstream ML models with PyCaret

A data practitioner would spend a great deal of time selecting a model and validating its performance. These steps can be accelerated using an AutoML tool like PyCaret, which we do here.

s = setup(synthetic_train_data, target='frozen pizza')
best = compare_models()

We first see a table that outlines information about the data and proposed training run. This table shows the size of our data, number of features, and other useful information. A subset of the information is shown here:

Description	Value
Target	frozen pizza
Target Type	Multiclass
Original Data	(4000, 137)
Fold Number	10

After the 14 models have trained on a 10-fold cross-validation, we see the results. In this case, our best model, as measured by accuracy, is a Random Forest Classifier. The actual numbers will change between runs due to the stochastic nature of synthetic data generation.

	Model	Accuracy	Recall	Prec.	F1	Kappa	MCC	TT (Sec)
rf	Random Forest Classifier	0.9085	0.2500	0.8254	0.8650	0.0000	0.0000	0.0930
et	Extra Trees Classifier	0.9085	0.2500	0.8254	0.8650	0.0000	0.0000	0.1020
dummy	Dummy Classifier	0.9085	0.2500	0.8254	0.8650	0.0000	0.0000	0.0130
knn	K Neighbors Classifier	0.9078	0.2498	0.8257	0.8648	0.0018	0.0032	0.1900
ridge	Ridge Classifier	0.9068	0.2495	0.8253	0.8641	-0.0030	-0.0065	0.0160
lightgbm	Light Gradient Boosting Machine	0.9060	0.2493	0.8252	0.8637	-0.0042	-0.0097	0.0640
lr	Logistic Regression	0.9021	0.2505	0.8329	0.8632	0.0055	0.0168	0.7870
svm	SVM - Linear Kernel	0.8957	0.2476	0.8270	0.8594	-0.0077	-0.0106	0.0550
gbc	Gradient Boosting Classifier	0.8950	0.2556	0.8300	0.8595	-0.0008	0.0003	0.4820
ada	Ada Boost Classifier	0.8746	0.2418	0.8299	0.8500	0.0071	0.0121	0.0550
lda	Linear Discriminant Analysis	0.8592	0.2387	0.8268	0.8422	-0.0151	-0.0145	0.0470
dt	Decision Tree Classifier	0.8253	0.2429	0.8281	0.8264	-0.0157	-0.0158	0.1560
nb	Naive Bayes	0.2551	0.1984	0.8356	0.3561	0.0034	0.0069	0.1600
qda	Quadratic Discriminant Analysis	0.0622	0.2346	0.0054	0.0099	-0.0011	-0.0031	0.0390

We then evaluate the models on various subsets of the data as we see fit.

Model trained on synthetic data and evaluated on real test data:

test_predictions = predict_model(best, data=original_test_data)

Model	Accuracy	Recall	Prec.	F1	Kappa	MCC
Random Forest Classifier	0.9495	0.3333	0.9015	0.9249	0.0000	0.0000

Model trained on synthetic data and evaluated on real validation data:

valid_predictions = predict_model(best, data=valid_df)

Model	Accuracy	Recall	Prec.	F1	Kappa	MCC
Random Forest Classifier	0.9320	0.3333	0.8686	0.9249	0.0000	0.0000

Importantly, we can compare these results to those of downstream models trained on the original data and evaluated on the same test and valid data splits.

Model trained on original data and evaluated on real test data:

Model	Accuracy	Recall	Prec.	F1	Kappa	MCC
Random Forest Classifier	0.9474	0.2500	0.8975	0.9218	0.0000	0.0000

Model trained on original data and evaluated on real validation data:

Model	Accuracy	Recall	Prec.	F1	Kappa	MCC
Random Forest Classifier	0.9560	0.3333	0.9139	0.9345	0.0000	0.0000

We see in this run that models trained on synthetic data get slightly lower performance metrics than models trained on the original data. This result doesn’t hold true in general, and often performance is quite comparable between downstream classifiers trained on synthetic data vs those trained on the original data.

Conclusion

In this post we saw that we can train a downstream classifier on 100% synthetic data and achieve performance comparable to a downstream classifier trained purely on the original data. This is encouraging as it suggests we can reap the benefits of synthetic data (e.g., privacy, volume, cost) and still achieve acceptable performance for downstream machine learning use cases.

Check out our CPO Alex Watson working through this notebook and discussing downstream ML!