3.5. Summary#

In this chapter, we used the analogy of drawing marbles from an urn to model random sampling from populations and random assignment of subjects to treatments in experiments. This framework enables us to run simulation studies for hypothetical surveys, experiments, or other chance processes in order to study their behavior. We found the chance of observing particular results from a clinical trial under the assumption that the treatment was not effective; and we studied the support for Clinton and Trump with samples based on actual votes cast in the election. These simulation studies enabled us to quantify the typical deviations in the chance process and to approximate the distribution of summary statistics, like Trump’s lead over Clinton. These simulation studies revealed the sampling distribution of a statistic and helped us answer questions about the likelihood of observing results like ours under the urn model.

The urn model reduces to a few basics: the number of marbles in the urn; what is written on each marble; the number of marbles to draw from the urn; and whether or not they are replaced between draws. From there, we can simulate increasingly complex data designs. However, the crux of the urn’s usefulness is the mapping from the data design to the urn. If samples are not randomly drawn, subjects are not randomly assigned to treatments, or measurements are not made on well-calibrated equipment, then this framework falls short in helping us understand our data and make decisions. On the other hand, we also need to remember that the urn is a simplification of the actual data collection process. If in reality, there is bias in data collection, then the randomness we observe in the simulation doesn’t capture the complete picture. Too often, data scientists wave these annoyances aside and address only the variability described by the urn model. That was one of the main problems in the surveys predicting the outcome of the 2016 US presidential election.

In each of these examples, the summary statistics that we have studied were given to us as part of the example. In the next chapter, we address the question of how to choose a summary statistic to represent the data.

Learning Data Science

Summary

3.5. Summary#