To the Data 100 Student

To the Data 100 Student


This book is currently undergoing major updates as we prepare it for publication. Thanks for your patience as we rearrange and fill in missing content. We have not removed content from the book entirely. Instead, we have moved content and marked sections as [In Progress] to denote new content that is actively being worked on.

Data 100 aims to prepare you for real-world data analyses. In theory, drawing conclusions from data is simple – load a data table, make a plot, and fit a model. In practice, it is not. Data are messy. Data sources collect data in different formats. Data values go missing. A simple linear model is not always appropriate. How do we pick from many possible alternative models? And how do we generalize our conclusions outside our limited data sample?

The work of a data scientist is to understand and address these questions. To this end, Data 100 has several key concepts that we revisit throughout the course.

  1. Data Lifecycle: The data lifecycle begins with question formulation where we take a question of interest and refine it to a question that can be answered/studied with data. The data that we plan to use may need to be collected or may already be available. Either way, we need to clean, explore, and visualize these data before drawing any conclusions or modeling our data. Depending on the purpose of our investigation, we often want to generalize our findings beyond our data. In short, the life cycle involves roughly five steps: (A) Question formulation, (B) Data collection and cleaning, © Exploratatory data analysis and visualization, (D) Modeling, and (E) Generalizing and reporting findings.

  2. From Raw Data to Analyzable Data: Data scientists must develop proficiency in working with data in a variety of forms. This course introduces tools and techniques commonly required for real-world data analyses, including the pandas Python package, data visualization, regular expressions, one-hot encodings, data transformations, and querying data from databases.

  3. Loss and Estimation: Data 100 uses model loss as a general framework for model fitting. Rather than derive a separate analytic solution for each model that we introduce, we rely on gradient descent, an elegant optimization-based approach that works well in practice and can applied to many types of models.

  4. Descriptive, Predictive, and Inferential Modeling: Given a single dataset, we can pose questions that fall into several categories. Descriptive questions ask about the dataset itself – given a dataset about a subway system, which lines are most frequently used? Predictive questions ask about future data values – next Sunday, how many trains are needed to meet expected demand? And inferential questions ask about uncertainty in predictions – how many trains should we hold in reserve, in case demand spikes more than expected? A data scientist needs to distinguish between these types of questions and understand what questions can and cannot be answered with the data at hand.

Chapters 1-13 focus on the first two concepts, with a light introduction to the second two. Chapters 14-27 focus on the second two concepts. This book is further divided into parts, and each part can be read on its own with minimal reliance on the other parts.