# Data Simulation: Tools, Benefits, and Use Cases

Data simulation is the process of using large quantities of data to predict events and validate models.

## What Is Data Simulation?

Data simulation is the process of taking a large amount of data and using it to mimic real-world scenarios or conditions. In technical terms, it could be described as the generation of random numbers or data from a stochastic process which is stated as a distribution equation (e.g., Normal: X~N(μ, σ²)). It can be used to predict future events, determine the best course of action or validate AI/ML models.

This post will highlight the many benefits and applications of data simulation as well as how synthetic data can power and improve its utility.

## Benefits of Data Simulation

Data simulation has proven highly valuable across nearly every industry and field of study, with business executives, engineers and researchers all using it in their work. Among other benefits, data simulation can:

1. Enable the creation of comprehensive models of complex, dynamic systems;
2. Empower data-driven decision making and strategic planning;
3. Help test hypotheses, understand relationships, and improve predictions;
4. Allow the study of phenomena that is difficult or impossible to investigate directly; and
5. Generate synthetic data that is representative of specific populations or conditions which can then be used for ML and AI development.

To some technical readers, these benefits may sound like those gained from data interpolation, so it’s worth briefly discussing the distinction.

### Data Simulation vs. Data Interpolation

When working with data, it is often necessary to make estimates about what lies outside the scope of the dataset. There are two main methods for doing this: simulation and interpolation. Both methods have their advantages and disadvantages, and the best approach depends on the specific situation.

#### Data Simulation

Simulation can be thought of as an imitation of a real-world process over time. In machine learning, a simulation is an algorithm that mimics a real-world environment that can be used to test different courses of action. While simulations can never be as accurate as their real-world equivalents, the ability to test millions of scenarios against a simulation has been shown to help machine learning models learn quickly, and even out-perform models trained on more limited real world data. However, simulation is only as accurate as the underlying model it is based on, so it is important to have a good understanding of the model before using this approach.

#### Data Interpolation

Data interpolation involves using known data points to estimate values for points in between. This can be done using mathematical functions or by making comparisons with similar datasets. One advantage of interpolation is that it is typically more accurate than simulation since it is based on actual data points. However, interpolation can only be used when there is a close relationship between the variables, so it may not be appropriate in all situations.

## Use Cases for Data Simulation

With the advent of high-quality synthetic data generation technology and state-of-the-art ML/AI models, there are some fascinating use cases for data simulation that have emerged in recent years. Here are some specific examples across various fields:

### Data Science & Research

One of the biggest roadblocks to building better machine learning models is the constant need for more and new data. As an example, let’s say we are working as a data scientist for a local authority and we are asked to find a way to optimize how an emergency evacuation plan should work in the case of a natural disaster (e.g. tsunami or earthquake). Our data would likely be too imbalanced to be useful since there are few similar events that we have data for to use as a comparison.

Here, data simulation could be used to generate synthetic datasets that looked like those of other real-world natural disasters. This would allow our machine learning model to be trained on data that is more representative of these rare real-world events, and thus better inform our evacuation plan.

These methods can be used for studying how technology might impact societies, too. For example, in 2020, Gretel conducted simulations to analyze what results Americans might expect from Apple & Google’s hotly debated Contact Tracing proposal, where they’d provide privacy-preserving capabilities that inform people when they have been in close proximity to a COVID-19 infected person. The project code is available on GitHub

The big takeaway is Apple and Google’s specification substantially lowers privacy risk for contact tracing by not collecting existing data, but by generating new custom data specific to the use case. This privacy preservation happens at the lowest possible level. The handset can compute as much data as possible but share as little as possible in order to provide effective contact tracing while preserving user privacy.

### Software Development

A key part of developing any software is testing how it will perform under different conditions. By creating data simulations that mimic real-world conditions, developers can put the software through its paces and identify any potential problems. This process can be used to test everything from the user interface to the backend algorithms.

### Oil & Gas

Data simulation is increasingly being used in the oil and gas industry, too. By creating models of reservoirs, geologists can better understand how oil and gas flow through rock and whether they’re present in different geological strata. These models can be used to predict what will happen when new wells are drilled, and they can help engineers design better production facilities, too.

Companies and researchers also study the impact of environmental factors on the industry. By simulating the effects of climate change, researchers gain better understanding of how rising temperatures might affect the production of oil and gas.

### Manufacturing

Data simulation is also being used to create “digital twins” which are virtual copies of physical objects, such as a car or production factory. These models enable the study of real-world objects and their operations without ever touching them. Manufacturers can easily identify the most efficient and effective production process for a particular product, and avoid disruptions as they transition to new methods.

### Autonomous Vehicles

And of course, we can’t talk about data simulation without acknowledging its most high-profile use case: the training of self-driving cars, drones and robots. Trying to test and train these systems in the real-world is slow, costly and dangerous. But with synthetic data you can create virtual training environments for improving these emerging technologies.

Simulations like this give us a glimpse of the future, so we can better prepare for it today. But to make good predictions, simulations must rely on a wealth of past experience (in the form of raw data) to learn patterns. That’s where using synthetic data can help.

## How Do You Simulate Data?

In order to simulate data,  we need to identify the data patterns and also have information about the data features and how they are distributed.

In general, there are three main steps for simulating data:

1. Develop a hypothesis and understand your sample distribution
2. Generate a random data sample based on the distribution
3. Visualize and analyze the data simulation result

### 1. Form a hypothesis and understand your sample distribution

Before you simulate a data distribution, you first need to have a hypothesis regarding your data. This is an idea about what you think might happen when you generate your data simulation. Some distributions don’t need hypothesis testing to know that the data would follow certain paths, but most are necessary.

There are a variety of distributions that can be analyzed when utilizing data simulation, and each type has its own unique characteristics for modeling different types of data. By understanding different distributions, researchers can better utilize data simulation to study a variety of phenomena.

Below are descriptions of a handful of common distributions:

• Normal distribution – a probability function that is symmetrical around the mean, showing that data points are more likely to cluster around the mean than to be evenly distributed throughout the curve. The peak of the curve indicates the mean, median, and mode, and the spread of the curve indicates variability. The normal distribution is used in statistics to predict future events, to calculate probabilities, and to understand relationships between variables. It can also be used to create models and test hypotheses. A normal distribution is sometimes referred to as the bell curve because of its shape.
• Uniform distribution – a probability distribution in which all values have the same probability. It’s used to model events that are equally likely to occur. It can be interpreted as the likelihood of a variable taking on any value within a given range. For example, if we flip a coin 100 times, we would expect the results to be evenly distributed between heads and tails. However, if we only flip the coin 10 times, the results may not be as evenly distributed. In this case, the uniform distribution can help us to predict the likelihood of getting heads or tails.
• Exponential distribution – a continuous probability distribution that describes the time between two successive events in a Poisson process. It is often used to model the waiting time until an event occurs, such as the time until death or the time until an earthquake occurs. The exponential distribution has a mean of 1/λ and a variance of 1/λ^2, where λ is the rate parameter. For example, if the average wait time for a bus is 10 minutes, then we would say that the wait time for a bus follows an exponential distribution with a rate parameter of 1/10. This means that the probability of waiting more than 10 minutes for a bus is very low.
• Poisson distribution – a probability distribution that calculates the probability of a given number of events occurring in a fixed time period. The function is used when the events are independent and occur at a constant rate. For example, if the events are occurrences of a particular crime, the Poisson distribution can be used to calculate the probability of a certain number of crimes occurring in a given time period.
• Multinomial distribution – a probability distribution that describes the outcomes of a multinomial experiment, where there are more than two possible outcomes for each trial. Its function is to calculate the probabilities (i.e., the odds) of each possible combination of outcomes. For example, if you flip a coin three times, there are eight possible combinations of outcomes: three heads, two heads and one tail, one head and two tails, three tails, and so on.
• Laplace distribution – a probability distribution that is defined by two parameters: the location parameter (which determines the mean) and the scale parameter (which determines the spread). The Laplace distribution is often used to model data that is asymmetric or has outliers. It can also be used to model data that has been generated by a process with a very small amount of noise.

### 2. Generate Random Data Samples

Random sampling is a process used to select a group of individuals from a population in which each individual has an equal chance of being selected. Now that we understand our sample distribution we can use that function to produce a random data sample. This is the most common approach. However, there are many ways to generate random simulation data, and the method that is used will depend on the type of research that is being conducted. Let’s look at two popular methods:

Monte Carlo Simulation

Monte Carlo Simulation is done by taking multiple random samplings from a given set of probability distributions. The distribution can be of any type, e.g.: Normal, Exponential, Uniform, etc.

Monte Carlo simulations are used to analyze risk in projects before they're put into practice. It builds on mathematical models, which utilize empirical data from real system input and output values (e.g., number of supplies intake vs production yield). This allows for the early identification of potential threats and what might go wrong that could impact your bottom line or time-to market goal.

Markov Chain Monte Carlo Simulation

Markov Chain Monte Carlo sampling is a simulation technique from high-dimensional probability distributions. Unlike Monte Carlo sampling methods that are able to draw independent samples from a distribution, Markov Chain Monte Carlo methods draw samples where the next sample is dependent on the existing sample. It means the randomly generated sample only relies on the current state, and so progresses iteratively, as a “chain” of events unfolds.

Gretel used a form of Markov Chain simulation when creating simulations of e-bike location data across cities, in an effort to predict traffic patterns. The chart below is an example of simulated e-bike data. You can read more about this case study and how you can simulate location data in this post