Model Selection

16. Model Selection


This chapter is under development. When it’s finished, this note will be removed.

When fitting models, we have, so far, decided which features to include in the model by:

  • assessing model fit with residual plots

  • connecting the statistical model to a physical model

  • keeping the model simple

  • comparing improvements in average loss between increasingly complex models

For example, when we examined the one-variable model of upward mobility in Chapter 15, we found curvature in the residual plot. Adding a second variable greatly improved the fit in terms of average loss (MSE and, relatedly, multiple-R-squared), but some curvature remained in the residuals. A seven-variable model made little improvement over the two-variable model, in terms of a decrease in MSE, so although the two-variable model still showed some patterns in the residuals, we opted for this simpler model.

As another example, when we modeled the weight of a donkey (see Chapter 18), we took guidance from a physical model. We ignored the donkey’s appendages and drew on the similarity between a barrel and a donkey’s body to begin fitting a model that explained weight by its length and girth (comparable to a barrel’s height and circumference). We continued to adjust that model by adding categorical features related to the donkey’s physical condition and age, collapsing categories and excluding other possible features to keep the model simple.

The decisions we made in building these models were based on judgment calls, and in this chapter we augment these with more formal criteria. To begin, we provide an example that shows why it’s typically not a good idea to include too many features in a model. This phenomenon, called overfitting, often leads to models that follow the data too closely and capture some of the noise in the data. Then, when new observations come along, the predictions are worse than those from a simpler model. The remainder of the chapter provides techniques for limiting the impact of overfitting. These techniques are especially helpful when there are a large number of potential features to include in a model. We also delve into the theory that explains the phenomenon of overfitting with a closer examination of the notion of loss minimization (first introduced in Chapter 4).