Test Data Generation: Uses, Benefits, and Tips

Test data generation is the process of creating new data that replicates an original dataset. Here’s how developers and data engineers use it.
Copyright © 2022 Gretel.ai
Copyright © 2022 Gretel.ai

Introduction

If you're a developer, data scientist, or machine learning engineer, you know that accurate and timely test data is essential to the success of your projects. But what are the best ways to generate test data? And what are the benefits of using test data generation tools? 

In this post, we'll explore those questions and give you some tips on how to get started with test data generation. We’ll also discuss why technical teams are synthesizing test data to meet the ever-growing data demand for training state-of-the-art AI models and evaluating software applications. 

What is Test Data Generation?

Test data generation is the process of creating new data that mimics aspects of an original real-world dataset, to test applications, develop features, and even train ML/AI models. This fictional data can run the gamut from completely random to being almost statistically identical to the dataset it’s based upon. The data you create depends on your use case, and privacy and security risks, among other factors. 

There are almost as many techniques and tools for making test data as there are use cases for it. Before we discuss those, let’s look at why we need to create artificial data at all. 

Why is Test Data Important?

The definition of ‘test data’ varies across teams in an organization. However, it’s used for two main purposes: evaluating software applications, and improving ML/AI algorithms. Test data generation techniques have changed over the years to keep up with the ever-evolving software development process. For ML/AI engineers though, it has only recently emerged as a powerful tool for unlocking better predictions, capabilities, and model performance. 

Testing Software Applications

For software applications and services, test data is most often used in pre-production environments. Quality assurance managers usually take a snapshot of an entire dataset, manually anonymize it and create similar mock data for use as ‘staging’ test data. This process enables the creation of effective and reliable testing environments for developers and software engineers who are working on projects related to the business's products or services. It also enables the evaluation of expected customer experiences before releasing anything publicly. 

But managing this process is a hard, slow, time-consuming, and error-prone grind, and even when successfully implemented the data used becomes outdated fast. 

Until recently, it was common to use raw production data that isn’t properly obfuscated or masked in these testing environments. However, modern privacy laws and standards like the GDPR, CCPA, and SOC2 have specific language prohibiting these uses and require the data to be effectively anonymized first. If you don’t comply, you risk legal action and heavy fines. You also increase the risk of cyberattacks. Data breaches are a serious threat to customer privacy and when they do occur they can often lead to reputational damage, too. 

Relative costs and compliance level for the four types of data used in testing. Source: SQLShack

As we’ll discuss later, test data can be fully synthetic, anonymized versions of real-world data that are used by engineers for development environments and new features. It can look just like your production environment without the privacy concerns of working with real customer data. 

Testing ML/AI Models

In contrast to software applications,  there is a different test data technique used by machine learning engineers and data scientists to train algorithms called cross-validation. They take the entire dataset and split it into three to train, validate, then test model performance. The percentages can vary, but the idea is that 80% of all data is used to train the algorithm (train data), and 10% is set aside to validate each iteration during training to make sure the model doesn’t overfit (test data), and the other 10% is used to measure the accuracy of the final model trained on the "training" data (validation data). 

In addition to the common cross-validation procedures outlined above, machine learning engineers are increasingly using synthetic data to train their ML algorithms. This can be the same test data as software and QA engineers use - assuming the synthetic data is highly accurate and representative of real-world data. 

Synthetic data can be less biased and less expensive to acquire than real-world data. For example, using a popular Heart Disease dataset that was skewed almost 2-to-1 towards male patients, we added synthetically generated patient records to a training set to boost female representation in the dataset. 

The result: we increased accuracy in 5 out of 6 classification algorithms, achieving 96.7% overall accuracy for KNN (up from 88.5%), 93% for Random Forest, and a 13% gain for the Decision Tree classifier against the models trained on the non-synthetic dataset. For a tutorial on how you can do this yourself, watch our co-founder and CTO Alex Watson’s workshop at the 2022 NVIDIA GTC Spring conference

Synthetics average accuracy: 90.16%, original: 85.79%. Average improvement: 4.37%

We demonstrated similar performance improvements when testing Gretel Synthetics on popular Kaggle tabular sets as seen below, which you can read more about in this post

ML Accuracy Using Original vs Synthetic Data on 8 Popular Kaggle Datasets

Our results reflect findings in computer vision and speech-trained models, too, where extra data has been proven to improve performance:

Source: Stanford Institute for Human-Centered Artificial Intelligence (HAI) AI Index 2022
Source: Stanford Institute for Human-Centered Artificial Intelligence (HAI) AI Index 2022

CX, Testing, and Black Swans

For both software applications and AI models, the test data used must be accurate to be useful. Specifically, it should reflect real-world experiences, like how a customer might interact with your product or service under certain circumstances. Not only must you identify all of these likely customer events for testing, but also the unlikely ‘black swan’ events, too. These edge cases might not improve your software or model’s overall performance by much, but it’s crucial that you account for them, so you’re prepared if they do occur.  

When implemented properly, test data can discover bugs in your systems that consumers would otherwise encounter. Here are three common examples of test data surfacing issues: 

  • Failure modes – scenarios where the software application or model doesn’t perform well. For example, an image identifier that doesn’t do well with low-lit images since it was only trained with pictures taken during the daytime. Each failure creates new questions that must be tested, too.  
  • Behavioral regressions – scenarios where performance decreases in specific situations. For example, the performance of a financial market forecasting model that bases its predictions only on historical data will tend to drift down since it’s not accounting for new information. 
  • Performance degradation - scenarios where the application does not scale with the load provided by users. In this scenario one would want to see how an application holds up with 2, 3, or even 4x times the current production load.

There are several high-profile examples of companies that didn’t test their models rigorously enough before putting them into production. For instance, Twitter had a tool that automatically cropped faces in photos people posted, but for some reason would ignore non-white faces. Similar forms of biases were discovered in Apple’s credit cards, where users would receive different credit limits, based on whether they were male or female. 

Types of Test Data

Test data comes in all shapes and sizes. It all depends on the model or software you’re building. But on a basic level, test data could be charted along a number line, and range from normal to extreme and invalid. To illustrate, consider a range from 1 to 5:

  • Normal Data – data that falls in the middle of the range e.g., 3
  • Borderline or Extreme Data (Outlier) – data that falls on the edges of the range e.g., 1 & 5
  • Invalid Data – data that give false information e.g., -0 & 7 

You can also test a model or software on no data to see how it responds to null values, too. Each type of test data is useful as they all help define the contours and behaviors of the information system you’re designing. 

So with that background aside, now all you have to do is make the data!

Ways to Create Test Data

Technical teams have come a long way in terms of techniques used for generating test data. Below is a brief walkthrough of this evolution. From the old school “dummy” test data method to what is fast becoming the gold standard today: synthetic test data.  

  • Dummy Data — this is simply replacing certain fields in the dataset with similar data types, like text for text. E.g., replacing first names in a field with random text. 
  • Mock Data — this is a step up from dummy data since it considers the semantics of the data, too. E.g., it would replace a first name field with other (made up) first names, so it’s more consistent with the original dataset. 
  • Production Data — this is the most relevant data for testing but also runs the highest risk of exposing sensitive information to your models or disclosing personal data publicly. It requires additional steps, like masking, to make it safe for use. 
  • Subsetting Data — this process uses smaller parts of a production dataset that are of interest for a specific use case. It’s a form of data minimization which mitigates privacy and security risks. However, relying solely on this data often only reflects a fraction of the customer experience with a product or service you’re trying to test for, and ignores many crucial edge cases. 
  • Pseudonymized or “Fake” Data — this is similar to mock data but replaces only the “identifying” fields with fake inputs. E.g., first name, last name, and DOB.
  • Anonymized Data – there are several techniques for achieving anonymization such as information redaction, detail scrambling, and encryption. 
  • Synthetic Data — the most advanced way to create test data is to synthesize it. Synthetic data is inexpensive to create, unlimited, and can be pre-labeled to fit whatever purpose or use cases you have. It also mitigates edge-case failures, as one can make fast, targeted synthetic additions to one’s dataset with each iteration. 

Putting Your Test Data to the Test

Below are several methods teams use when testing models and software. Each should be included as part of a comprehensive testing process: 

  • White box testing — Testing of all the internal structures, design, and code that are visible to the testers to ensure all workflows are efficient, usable, and secure. For example, how a model or application responds to invalid parameters in the code. 
  • Black box testing —Testing the end-user experience, also known as behavioral testing. For example, how a model handles different formats in dates, for instance. 
  • Performance testing — Testing how fast a system responds to various workloads to identify and relieve certain bottlenecks. For example, a scenario where there is a sudden rush of user traffic to your website. 
  • Security testing — verifying that only certain people have access to applications and information systems that are protected from malicious actors. 

There are also new technologies for testing the quality of the test data itself. For example, Gretel.ai offers a synthetic quality score (SQS) that provides metrics on the accuracy, utility, and levels of privacy inherent in the test data you are using. 

Generate Test Data With Gretel – No Code Required

In this post, we walked through the evolution of test data generation techniques, and why high-quality test data is critical to the development of modern software applications and ML/AI model training. Proper testing is often the only way to discover certain bugs, potential crashes, lag and latency issues, and other design problems that could negatively impact your business if released publicly. 

While techniques like relying on a subset of your production data may cover the bare minimum privacy requirements under GDPR and other privacy laws, they are insufficient methods for building high-performance state-of-the-art AI/ML models. That requires anywhere from 2-4x the volume of your original production data, not a mere slice of it. 

The best way then is to generate new data that captures the full gamut of instances you must test for. That’s why synthetics are becoming the new gold standard for testing. With synthetics, you can simply boost your original dataset and even seed new variations to cover rare events, too. This flexible approach is not only safer, faster, and more cost-effective, but it also opens the door to unlocking other innovations across organizations as well. 

At Gretel, we want to ensure that everyone has easy access to safe, compliant, high-quality test data. Whether you’re building the next great software application in your basement or running an AI initiative at a Fortune 500 company, synthetic test data can help power your projects. Speaking of tests, you can give Gretel Synthetics a try for free today by signing up to our Developer tier – no coding required. Give our console a try, and we promise we can get you generating synthetic data in 60 seconds

If you have any questions, join our Slack Community and share your thoughts there, or shoot us an email at hi@gretel.ai. We’d love to hear how your experience is using Gretel. Thanks!