2. Questions and Data Scope#

As data scientists we use data to answer questions, and the quality of the data collection process can significantly impact the validity and accuracy of the data, the strength of the conclusions we draw from an analysis, and the decisions we make. In this chapter, we describe a general approach for understanding data collection and evaluating the usefulness of the data in addressing the question of interest. Ideally, we aim for data to be representative of the phenomenon that we are studying, whether that phenomenon is a population characteristic, a physical model, or some type of social behavior. Typically, our data do not contain complete information (the scope is restricted in some way), yet we want to use the data to accurately describe a population, estimate a scientific quantity, infer the form of a relationship between features, or predict future outcomes. In all of these situations, if our data are not representative of the object of our study, then our conclusions can be limited, possibly misleading, or even wrong.

To motivate the need to think about these issues, we begin with an example of the power of big data and what can go wrong. We then provide a framework that can help you connect the goal of your study (your question) with the data collection process. We refer to this as the data scope,1 and we provide terminology to help describe data scope, along with examples from surveys, government data, scientific instruments, and online resources. Later in this chapter, we consider what it means for data to be accurate. There, we introduce different forms of bias and variation, and describe situations where they can arise. Throughout, the examples cover the spectrum of the sorts of data that you may be using as a data scientist; these examples are from science, politics, public health, and online communities.


The notion of “scope” has been adapted from Hellerstein’s course notes on scope, temporality, and faithfulness.