Data Sources#

All of the data analyzed in this book are available on the book’s website, LearningDS.org and on the GitHub repository for the book. These datasets are from open repositories and from individuals. We acknowledge them all here, and include, as appropriate, the file name for the data stored in our repository, a link to the original source, a related publication, and the author(s)/owner(s).

To begin, we provide the sources for the four case studies in the book. Our analysis of the data in these case studies is based on research articles or, in one case, a blog post. We generally follow the line of inquiry in these sources, but we have usually simplified the analyses to match the level of the book.

seattle_bus_times.csv: The Seattle Transit data were provided by Hallenbeck of the Washington State Transportation Center. Our analysis is based on The Waiting Time Paradox, or, Why Is My Bus Always Late? by VanderPlas.
aqs_06-067-0010.csv, list_of_aqs_sites.csv, matched_pa_aqs.csv, list_of_purpleair_sensors.json, purpleair_AMTS: The datasets used in the study of air quality monitors were made available to us by Barkjohn of the Environmental Protection Agency. These were originally acquired by Barkjohn and collaborators from the US Air Quality System and from PurpleAir. Our analysis is based on Development and Application of a United States-Wide Correction for PM 2.5 Data Collected with the PurpleAir Sensor by Barkjohn, Gantt, and Clements.
donkeys.csv: The data for the Kenyan donkey study were collected by Kate Milner on behalf of the UK Donkey Sanctuary and made available by Rougier in the paranomo package. Our analysis is based on How to Weigh a Donkey in the Kenyan Countryside by Milner and Rougier.
fake_news.csv: The hand-classified fake news data are from Fakenewsnet: A Data Repository with News Content, Social Context, and Spatiotemporal Information for Studying Fake News on Social Media by Shu, Mahudeswaran, Wang, Lee, and Liu.

In addition to these case studies, another 20-plus datasets were used as examples throughout the book. We acknowledge the people and organizations that made these datasets available in the order in which they appeared in the book.

gft.csv: The data on the Google Flu Trends is available from Gary King Dataverse and the plot made from these data is based on The Parable of Google Flu: Traps in Big Data Analysis by Lazer, Kennedy, King, and Vespignani.
WikipediaExp.csv: The data for the Wikipedia experiment were made available by van de Rijt. These data were analyzed in Experimental Study of Informal Rewards in Peer Production by Restivo and van de Rijt.
co2_mm_mlo.txt: The CO₂ concentrations measured at Mauna Loa by the National Oceanic and Atmospheric Administration (NOAA) are available from the Global Monitoring Laboratory.
pm30.csv: These air quality measurements were downloaded for one day and one sensor from the PurpleAir Map.
babynames.csv: The US Social Security Department provides the names from all Social Security card applications.
DAWN-Data.txt: The 2011 DAWN survey of drug-related emergency room visits is administered by the U.S. Substance Abuse and Medical Health Services Administration.
businesses.csv, inspections.csv, violations.csv: The data on restaurant inspection scores in San Francisco is from DataSF.
akc.csv: The data on dog breeds come from the Information is Beautiful Best in Show visualization and was originally acquired from the American Kennel Club.
sfhousing.csv: The housing sale prices for the San Francisco Bay Area were scraped from the San Francisco Chronicle real estate pages.
cherryBlossomMen.csv: The run times in the annual Cherry Blossom 10 mile Run were scraped from the race results pages.
earnings2020.csv: The weekly earnings data are made available by the U.S. Bureau of Labor Statistics.
co2_by_country.csv: The annual country CO2 emissions is available from Our World in Data.
100m_sprint.csv: The times for the 100 meter sprint are from FiveThirtyEight and the figure is based on The fastest men in the world are still chasing Usain bolt by Planos.
stateoftheunion1790-2022.txt: The State of the Union Addresses are compiled from the American Presidency Project.
CDS_ERA5_22-12.nc: These data were collected from the Climate Data Store, which is supported by the European Centre for Medium-Range Weather Forecasts.
world_record_1500m.csv: Wikipedia 1500 meter world records were scraped from the Wikipedia page 1500 metres world record progression.
the_clash.csv: The Clash songs are obtained using the Spotify Web API. The retrieval of the data follows Exploring the Spotify API in Python by Morse.
catalog.xml: The XML plant catalog document is from the W3 School Plant catalog.
ECB_EU_exchange.csv: The exchange rates are available from the European Central Bank.
mobility.csv: These data were made available at Opportunity Insights and our example follows Where Is the Land of Opportunity? The Geography of Intergenerational Mobility in the United States by Chetty, Hendren, Kline, and Saez.
utilities.csv: The home energy consumption data is available from Kaplan and appeared in his first edition of Statistical Modeling: A fresh approach.
market-analysis.csv: These data were provided by Lipovetsky, and they correspond to the data used in his paper Regressions Regularized by Correlations.
crabs.data: The crab measurements are from the California Department of Fish and Wildlife and made available online at the Stat Labs Data repository.
black_spruce.csv: The wind-damaged tree data were collected by Rich for his thesis Large wind disturbance in the Boundary Waters Canoe Area Wilderness. Forest dynamics and development changes associated with the July 4th 1999 blowdown and made available online in the alr4 package. The analysis is based on Chapter 12 of Applied Linear Regression by Weisberg.

Learning Data Science

Data Sources

Data Sources#