2.1. Big Data and New Opportunities#

The tremendous increase in openly available data has created new roles and opportunities in data science. For example, data journalists look for interesting stories in data much like how traditional beat reporters hunt for news stories. The data lifecycle for the data journalist begins with the search for existing data that might have an interesting story, rather than beginning with a research question and looking for how to collect new or use existing data to address the question.

Citizen science projects are another example. They engage many people (and instruments) in data collection. Collectively, these data are made available to researchers who organize the project, and often they are made available in repositories for the general public to further investigate.

The availability of administrative and organizational data creates other opportunities. Researchers can link data collected from scientific studies with, say, medical data that have been collected for health-care purposes; these administrative data have been collected for reasons that don’t directly stem from the question of interest, but they can be useful in other settings. Such linkages can help data scientists expand the possibilities of their analyses and cross-check the quality of their data. In addition, found data can include digital traces, such as your web-browsing activity, your posts on social media, and your online network of friends and acquaintances, and they can be quite complex.

When we have large amounts of administrative data or expansive digital traces, it can be tempting to treat them as more definitive than data collected from traditional, smaller research studies. We might even consider these large datasets to be a replacement for scientific studies and essentially a census. This overreach is referred to as the “big data hubris”. Data with a large scope does not mean that we can ignore foundational issues of how representative the data are, nor can we ignore issues with measurement, dependency, and reliability. (And it can be easy to discover meaningless or nonsensical relationships just by coincidence.) One well-known example is the Google Flu Trends tracking system.