Collected here are a variety of resources for a more in-depth treatment of the larger themes in this book. In addition to recommendations for these broad topics, we provide a list of resources for several smaller topics and big topics that we only lightly touched on. These resources are organized in the order in which the topics appear in the book.
For how to analyze time-series data, like the Google Flu trends, we refer you to Time Series Analysis and Its Applications by Shumway and Stoffer.
To learn more about the interplay between questions and data, we recommend Questions, Answers, and Statistics by Speed. In addition, Leek and Peng connect questions with the type of analysis needed in What is the question? Mistaking the type of question being considered is the most common error in data analysis.
More on sampling topics can be found in Sampling: Design and Analysis by Lohr. Lohr also contains a treatment of the target population, access frame, and sampling methods, and sources of bias.
To learn more about the human contexts and ethics of data, see the HCE Toolkit and Tuskegee University’s National Center for Bioethics in Research and Health Care.
To learn more about data privacy, see Big Data: Seizing Opportunities, Preserving Values, a concise White House report that provides guidelines and rationale for data privacy.
Ramdas gave a fun, informative talk in our class on on bias, Simpson’s paradox, p-hacking, and related topics, and we recommend his slides.
For an introductory treatment of the urn model, confidence intervals, and hypothesis tests, we recommend Statistics by Freedman, Pisani, and Purves.
Owen’s online text, Monte Carlo theory, methods and examples provides a solid introduction to simulation.
A proof that the median minimizes absolute error can be found in Mathematical Statistics: Basic Ideas and Selected Topics Volume I by Bickel and Doksum.
Python for Data Analysis by Wes McKinney provides in-depth coverage of
The classic The Essence of Databases by Roland offers a formal introduction to SQL, and the basics can be found in W3 School’s Introduction to SQL. Designing Data-Intensive Applications surveys and compares different data storage systems, including SQL databases.
A good resource for data wrangling can be found in Principles of Data Wrangling: Practical Techniques for Data Preparation by Rattenbury, Hellerstein, Heer, Kandel, and Carreras.
For how to handle missing data, see Chapter 8 in Lohr and Statistical Analysis with Missing Data by Little and Rubin.
The original text by Tukey, Exploratory Data Analysis, offers an excellent introduction to the topic.
The smooth density curve is covered in detail in Density Estimation for Statistics and Data Analysis by Silverman.
See Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures by Wilke for more on visualization. Our guidelines do not entirely match Wilke’s but they come close and it’s helpful to see a variety of opinions on the topic.
To learn more about color palettes see Brewer’s online ColorBrewer2.0.
See Statistical Calibration: A Review by Osborne for more on calibration.
For practice with regular expressions there are many on-line resources such as the W3 School tutorial Python RegEx, regular expression checkers like Regular Expressions 101, and introductions to the topic like An introduction to regular expressions by Nield. For a text see Mastering Regular Expressions by Friedl.
Tompkins has a helpful online tutorial on how to work with netCDF climate data: The Beauty of NetCDF.
There are many resources on web services. Some accessible introductory material can be found at RESTful Web Services by Richardson and Ruby.
For more on XML, we recommend XML and Web Technologies for Data Sciences with R by Nolan and Temple Lang.
The many topics related to modeling, including transformations, one-hot encoding, model-selection, cross-validation, and regularization can be found in several sources. We recommend: Linear Models with Python by Faraway, Applied Regression Analysis and Generalized Linear Models by Fox, An Introduction to Statistical Learning: With Applications in Python by James, Witten, Hastie, Tibshirani, and Taylor, and Applied Linear Regression by Weisberg.
Chapter 10 in Fox gives an informative treatment of vector geometry of least squares.
Chapter 12 in Fox and Chapter 5 in Faraway cover the topic of weighted regression.
Andrew Ng’s interview is an interesting read on the gap between the test set and the real world.
Chapter 7 of James, et al. introduces polynomial regression using orthogonal polynomials.
For more on broken-stick regression see Bent-Cable Regression Theory and Applications by Chiu, Lockhart and Routledge.
A more formal treatment of confidence intervals, prediction intervals, testing, and the bootstrap can be found in Mathematical Statistics and Data Analysis by Rice.
The The ASA Statement on p-Values: Context, Process, and Purpose by Wasserstein and Lazar provides valuable insights into \(p\)-values. Additionally, the topic of p-hacking is addressed in The Statistical Crisis in Science by Gelman and Loken.
Information about rank tests and other nonparametric statistics can be found in Nonparametric Rank Tests by Hettmansperger.
The technique for developing linear models to use in the field is addressed in The lost art of nomography by Doerfler.
Chapter 14 in Fox covers the maximum likelihood approach to logistic regression. And, Chapter 4 in James, et al. covers sensitivity and specificity in more detail.
An in-depth treatment of loss functions and risk can be found in Chapter 12 of All of Statistics: A Concise Course in Statistical Inference by Wasserman.
Programming Collective Intelligence by Segaran covers the topic of optimization.
See Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning by Bengfort, Bilbro, and Ojeda for more on text analysis.