(ax:extra_reading)=
# Additional Material

Collected here are a variety of resources for a more in-depth treatment of the larger themes in this book. In addition to recommendations for these broad topics, we provide a list of resources for several smaller topics and big topics that we only lightly touched on. These resources are organized in the order in which the topics appear in the book.

- For how to analyze time-series data, like the Google Flu trends, we refer you to [_Time Series Analysis and Its Applications_](https://doi.org/10.1007/978-3-319-52452-8) by Shumway and Stoffer.

- To learn more about the interplay between questions and data, we recommend [Questions, Answers, and Statistics](https://iase-web.org/documents/papers/icots2/Speed.pdf) by Speed. In addition, Leek and Peng connect questions with the type of analysis needed in [What is the question? Mistaking the type of question being considered is the most common error in data analysis](https://doi.org/10.1126/science.aaa6146).

- More on sampling topics can be found in [_Sampling: Design and Analysis_](https://doi.org/10.1201/9780429298899) by Lohr. Lohr also contains a treatment of the target population, access frame, and sampling methods, and sources of bias.

- To learn more about the human contexts and ethics of data, see the [HCE Toolkit](https://data.berkeley.edu/hce-toolkit) and Tuskegee University's [National Center for Bioethics in Research and Health Care](https://www.tuskegee.edu/about-us/centers-of-excellence/bioethics-center).

- To learn more about data privacy, see [_Big Data: Seizing Opportunities, Preserving Values_](https://obamawhitehouse.archives.gov/sites/default/files/docs/big_data_privacy_report_may_1_2014.pdf), a concise White House report that provides guidelines and rationale for data privacy.

- Ramdas gave a fun, informative talk in our class on on bias, Simpson's paradox, p-hacking, and related topics, and we recommend his [slides](https://drive.google.com/file/d/0B7gkaDYGT5X5c245RV93MVRRSjQ/view?resourcekey=0-8nQDM50Tta2SuLkFqAXEqQ).

- For an introductory treatment of the urn model, confidence intervals, and hypothesis tests, we recommend [_Statistics_](https://wwnorton.com/books/Statistics/) by Freedman, Pisani, and Purves.

- Owen's online text, [_Monte Carlo theory, methods and examples_](https://artowen.su.domains/mc/) provides a solid introduction to simulation.

- For a fuller treatment of probability, we suggest [_Probability_](https://doi.org/10.1007/978-1-4612-4374-8) by Pitman and [_Introduction to Probability_](https://doi.org/10.1201/b17221) by Hwang and Blitzstein.

- A proof that the median minimizes absolute error can be found in [_Mathematical Statistics: Basic Ideas and Selected Topics Volume I_](https://www.routledge.com/Mathematical-Statistics-Basic-Ideas-and-Selected-Topics-Volume-I-Second/Bickel-Doksum/p/book/9781498723800) by Bickel and Doksum.

- [_Python for Data Analysis_](https://wesmckinney.com/book/) by Wes McKinney provides in-depth coverage of `pandas`.

- The classic [_The Essence of Databases_](https://dl.acm.org/doi/book/10.5555/274800) by Roland offers a formal introduction to SQL, and the basics can be found in W3 School's [Introduction to SQL](https://www.w3schools.com/sql/sql_intro.asp). [_Designing Data-Intensive Applications_](https://www.oreilly.com/library/view/designing-data-intensive-applications/9781491903063/) surveys and compares different data storage systems, including SQL databases.

- A good resource for data wrangling can be found in [_Principles of Data Wrangling: Practical Techniques for Data Preparation_](https://www.oreilly.com/library/view/principles-of-data/9781491938911/) by Rattenbury, Hellerstein, Heer, Kandel, and Carreras.

- For how to handle missing data, see Chapter 8 in Lohr and [_Statistical Analysis with Missing Data_](https://www.wiley.com/en-us/Statistical+Analysis+with+Missing+Data,+3rd+Edition-p-9780470526798) by Little and Rubin.

- The original text by Tukey, [_Exploratory Data Analysis_](https://archive.org/details/exploratorydataa00tuke_0), offers an excellent introduction to the topic.

- The smooth density curve is covered in detail in [_Density Estimation for Statistics and Data Analysis_](https://www.routledge.com/Density-Estimation-for-Statistics-and-Data-Analysis/Silverman/p/book/9780412246203) by Silverman.

- See [_Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures_](https://clauswilke.com/dataviz/) by Wilke for more on visualization. Our guidelines do not entirely match Wilke's but they come close and it's helpful to see a variety of opinions on the topic.

- To learn more about color palettes see Brewer's online [ColorBrewer2.0](https://colorbrewer2.org/).

- See [Statistical Calibration: A Review](https://doi.org/10.2307/1403690) by Osborne for more on calibration.

- For practice with regular expressions there are many on-line resources such as the W3 School tutorial [Python RegEx](https://www.w3schools.com/python/python_regex.asp), regular expression checkers like [Regular Expressions 101](https://regex101.com/), and introductions to the topic like [An introduction to regular expressions](https://www.oreilly.com/content/an-introduction-to-regular-expressions/) by Nield. For a text see [_Mastering Regular Expressions_](https://dl.acm.org/doi/10.5555/1209014) by Friedl.

- Chapter 13 in [Fox](https://us.sagepub.com/en-us/nam/applied-regression-analysis-and-generalized-linear-models/book237254) and Chapter 10 in [James, et al.](https://www.statlearning.com/) discuss Principal Components. (See below for the titles of these resources.)

- Tompkins has a helpful online tutorial on how to work with netCDF climate data: [The Beauty of NetCDF](https://www.youtube.com/watch?v=UvNBnjiTXa0).

- There are many resources on web services. Some accessible introductory material can be found at [_RESTful Web Services_](https://dl.acm.org/doi/10.5555/1406352)
  by Richardson and Ruby.

- For more on XML, we recommend [_XML and Web Technologies for Data Sciences with R_](https://doi.org/10.1007/978-1-4614-7900-0) by Nolan and Temple Lang.

- The many topics related to modeling, including transformations, one-hot encoding, model-selection, cross-validation, and regularization can be found in several sources. We recommend: [_Linear Models with Python_](https://julianfaraway.github.io/LMP/) by Faraway, [_Applied Regression Analysis and Generalized Linear Models_](https://us.sagepub.com/en-us/nam/applied-regression-analysis-and-generalized-linear-models/book237254) by Fox, [_An Introduction to Statistical Learning: With Applications in Python_](https://www.statlearning.com/) by James, Witten, Hastie, Tibshirani, and Taylor, and [_Applied Linear Regression_](https://doi.org/10.1002/0471704091) by Weisberg.

- Chapter 10 in Fox gives an informative treatment of vector geometry of least squares.

- Chapter 12 in Fox and Chapter 5 in Faraway cover the topic of weighted regression.

- Andrew Ng's [interview](https://spectrum.ieee.org/andrew-ng-xrays-the-ai-hype) is an interesting read on the gap between the test set and the real world.

- Chapter 7 of James, et al. introduces polynomial regression using orthogonal polynomials.

- For more on broken-stick regression see [Bent-Cable Regression Theory and Applications](https://doi.org/10.1198/016214505000001177) by Chiu, Lockhart and Routledge.

- A more formal treatment of confidence intervals, prediction intervals, testing, and the bootstrap can be found in [_Mathematical Statistics and Data Analysis_](https://www.cengage.com/c/mathematical-statistics-and-data-analysis-3e-rice/9780534399429/) by Rice.

- The [The ASA Statement on p-Values: Context, Process, and Purpose](https://doi.org/10.1080/00031305.2016.1154108) by Wasserstein and Lazar provides valuable insights into $p$-values. Additionally, the topic of p-hacking is addressed in [The Statistical Crisis in Science](https://doi.org/10.1511/2014.111.460) by Gelman and Loken.

- Information about rank tests and other nonparametric statistics can be found in [_Nonparametric Rank Tests_](https://doi.org/10.1007/978-3-642-04898-2_417_) by Hettmansperger.

- The technique for developing linear models to use in the field is addressed in [The lost art of nomography](https://deadreckonings.files.wordpress.com/2008/01/nomography.pdf) by Doerfler.

- Chapter 14 in Fox covers the maximum likelihood approach to logistic regression. And, Chapter 4 in James, et al. covers sensitivity and specificity in more detail.

- An in-depth treatment of loss functions and risk can be found in Chapter 12 of [_All of Statistics: A Concise Course in Statistical Inference_](https://doi.org/10.1007/978-0-387-21736-9) by Wasserman.

- [_Programming Collective Intelligence_](https://www.oreilly.com/library/view/programming-collective-intelligence/9780596529321/) by Segaran covers the topic of optimization.

- See [_Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning_](https://www.oreilly.com/library/view/applied-text-analysis/9781491963036/) by Bengfort, Bilbro, and Ojeda for more on text analysis.