How to conduct a synthetic data experiment

Simulation studies (sometimes called synthetic data experiments or Monte Carlo simulations) are useful tools for generating evidence about whether a statistical claim is true. For example:

  • You developed a new method for estimating some quantity. You think that for certain kinds of data, your estimator will beat some existing estimator.
  • You want to check if something is true about a statistic (definition: anything that is calculated from data) that is hard to evaluate by hand. For example, maybe you want to know something about the minimum eigenvalue of a random matrix that has a particular distribution.

Here’s the idea:

  1. Repeat the following procedure for many iterations:
    • Generate a random data set according to some specification.
    • Calculate your statistics on each data set. (For example, use your estimator and an already-existing competitor estimator to get separate estimates of the quantity you’re interested in.)
    • Evaluate your statistics according to some metric.
  2. Generate summaries of the metrics, whether through plots, tables, or both. You could also consider using statistical inference, like a t-test of the alternative hypothesis that your method has better performance than a competitor method.
  3. Share the code in a reproducible way.

Recently I taught a tutorial on the basics on simulation studies for undergraduate students as a part of the USC JumpStart program. I taught the basics of simulation studies and we walked through an example simulation study in R using the simulator package. You can view the code and slides I made here on GitHub.

Here’s the example simplified simulation study we did in the tutorial that you can see the code for:

  1. 20 repetitions of the following:
    • Random data set: generate 50 independent and identically distributed Gaussian random variables.
    • Statistic: calculate the sample mean on each data set. (We didn’t use a competitor method in class, but in code I posted to GitHub, I wrote additional code for a simulation study using the sample median as a competitor estimator.)
    • Metric: calculate squared error of the estimated mean relative to the true mean.
  2. Then we generated boxplots of the results as well as a table.

I’ll talk briefly about some of the topics I covered in the tutorial. For more details, check out the slides or contact me on Twitter.

Simplicity

As much as possible, choose data generation methods, competitor methods, and metrics that are simple and widely-recognized.

This streamlines your message, making it easy for your audience to understand the results of your simulation study. Your new method might be more complicated than previous proposals, or you may need to generate data in a specific way to illustrate your point. So keep everything else that does not need to be complicated as simple as possible.

It also makes your simulation study more convincing. If a simple metric shows that your method doesn’t perform as well as you hoped, it might be tempting to blame the metric and look for another one that illustrates the point you hoped to make. But first, consider the possibility that your method just doesn’t work as well as you thought it would. (This is why it’s a good idea to implement a method and test it as early as possible in the research process.)

From your audience’s perspective, if you use a strange metric or an oddly specific data-generating process, it might look like you kept trying different things until you found one that made your method look good.

Variety

Calculate your statistic in a variety of settings, in order to show that the results look good in many settings.

Like simplicity, variety also makes your simulation study more convincing–it doesn’t look like you just cherry-picked the only setting where your method seems to work.

It’s also good to vary the settings to the point where your method doesn’t outperform competitors anymore in order to show where your method breaks. (For example, maybe a new fancy machine learning method works better when you have a lot of data, but when the sample size is smaller, a basic linear regression model works better.) This helps your audience understand your method and its advantages better.

Reproducibility and Documentation

Set a random seed at or near the top of your code (certainly before any non-deterministic function is called) so that the same results are generated every time the code is called.

It’s okay to tweak your simulation settings, but whenever you have your code finalized, re-start your computing environment and run your code from scratch. That way, someone else who runs your code with the same versions of your software will generate the exact same plots and tables that you did. (See the code for the simulation studies for our ICML paper for an example of this.)

Sharing your code will help a lot with documenting what you did, but remember to write down the version of your programming language that you used, as well as the version numbers of any other packages, libraries, or other software you used. Also write down the computing environment you used (for example, the processor on your computer if you run the simulations locally).

Parallelization

Parallel computing is very well-suited for simulation studies, where many data sets are generated and processed completely independently. Simulation studies can be greatly sped up with code that uses parallelization. The simulator package makes this easy; see Section 4.5 of Prof. Jacob Bien’s paper describing the simulator package for details.

Additional Resources