16.1. Overfitting

When we have many features available, choosing which ones to include or exclude in a model rapidly gets complicated. Even with a two variable model and eight features, there are 28 pairs of features that we could fit and check. And all together, there are over 250 models to choose from, if we consider all one-, two-, …, eight-variable models. It can be hard to tell where to stop, to decide how simple is simple, and to examine hundreds of residual plots. Unfortunately, the notion of loss and minimizing MSE isn’t entirely helpful either. With each variable that we add to the model, the MSE typically gets smaller. Recall from the geometric perspective of model fitting (Chapter 15) that adding a feature to a model adds an \(n\)-dimensional vector to the feature space, and the error between the outcome vector and its projection into the space spanned by the explanatory variables gets smaller. We might view this as a good thing because our model fits the data more closely, but there is a danger in overfitting.

Overfitting happens when the model follows the data too closely and picks up the variability in the random noise in the outcomes. When this happens, new observations are not well predicted. An example, helps clarify this idea.

16.1.1. Example: Energy Consumption

The following dataset contains the utility bills from a private residence in Minnesota. We have records of the monthly gas usage in a home (cubic feet) and the average daily temperature (degrees Farenheit)1.

Let’s begin by looking at a scatter plot of gas consumption as a function of temperature.

fig = make_subplots(rows=1, cols=2)

    go.Scatter(x=heat_df['temp'], y=heat_df['ccf'], mode="markers"),
    row=1, col=1)

fig.update_xaxes(title='Temperature (F)', row=1, col=1)
fig.update_yaxes(title='Gas Consumption (ft^3)', row=1, col=1)

    go.Scatter(x=heat_df['temp'], y=heat_df['ccf'], mode="markers"),
    row=1, col=2)

fig.update_xaxes(type="log", title='Temperature (F)', row=1, col=2)

fig.update_layout(height=250, width=500, showlegend=False)

The relationship shows curvature, but when we tried to straighten it with a log transformation a different curvature arises. Additionally, there are two unusual points. If we refer back to the documentation, we find that these points represent recording errors, so we remove them.

Let’s see if a quadratic curve can capture the relationship between gas usage and temperature. Polynomials are still considered linear models. They are linear in their polynomial features. For example, we can express a quadratic model as,

\[ y = \theta_0 + \theta_1 x + \theta_2 x^2 + error \]

This model is linear in the features \(x\) and \(x^2\), and in matrix notation we can write this model as \({\mathbf{y}} = {\textbf{X}} {\boldsymbol{\theta}} + {error}\), where \( \textbf{X} \) is the design matrix,

\[\begin{split} \left\lceil \begin{matrix} 1 & x_1 & x_1^2\\ 1 & x_2 & x_2^2\\ \vdots & \vdots & \vdots\\ 1 & x_n & x_n^2\\ \end{matrix} \right\rceil \end{split}\]

We can create the polynomial features of the design matrix with the PolynomialFeatures tool in scikit-learn.

y = heat_df[['ccf']]
X = heat_df[['temp']]

from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(X)
array([[  29.,  841.],
       [  31.,  961.],
       [  15.,  225.],
       [  76., 5776.],
       [  55., 3025.],
       [  39., 1521.]])

We set the parameter include_bias to False because we plan to fit the polynomial with the LinearRegression method in scikit-learn, and by default it includes the constant term in the model.

from sklearn.linear_model import LinearRegression

model_deg2 = LinearRegression().fit(poly_features, y)

We overlay the fitted quadratic on the scatter plot and examine the residuals to get a quick idea as to the quality of the fit.


The quadratic captures the curve in the data quite well, but the residuals show a slight upward trend in the 70-80 degree range of temperature, which indicates some lack of fit.

For comparison, we fit a few more models with higher degree polynomials and collectively examine the fitted curves.

poly12 = PolynomialFeatures(degree=12, include_bias=False)
poly_features12 = poly12.fit_transform(X)

degrees = [1, 2, 3, 6, 8, 12]

mods = [LinearRegression().fit(poly_features12[ :, :deg], y)
        for deg in degrees]


Fitting polynomials in this way is not advisable. We have used this traditional form of polynomials to focus on the notion of over fitting.

Unfortunately, these polynomial features tend to be highly correlated. For example, the correlation between \(x\) and \(x^2\) for the energy data is 0.98. Highly correlated features give unstable coefficients, where a small change in an x-value can lead to a large change in the coefficients of the polynomial. Also, when the \(x\) values are large, then the normal equations are poorly conditioned and the coefficients can be difficult to interpret and compare.

A better practice is to use polynomials that have been constructed to be orthogonal to one another. These polynomials fill the same space as the original polynomials, but they are uncorrelated with one another and give a more stable fit.

The plot below places all of the fits on the same graph so that we can see how the higher degree polynomials tend to bend more and more strangely.


We can also visualize the different polynomial fits in separate facets.


The degree 1 curve (the straight line) in the top left facet misses the curved pattern in the data. The degree 2, begins to capture it, and the degree 3 looks like an improvement but notice the upward bend at the right side of the plot. The polynomials of degree 6, 8, and 12 follow the data more and more closely, as they get more and more curvy. These polynomials seem to fit spurious bumps in the data. All together, these six curves illustrate under- and over- fitting. The fitted line in the top left under-fits and misses the curvature entirely. And, the degree 12 polynomial in the bottom right definitely over-fits with a wiggly pattern that we don’t think makes sense for this context.

In general, as we add more features, models get more complex and the MSE drops, but at the same time, the fitted model grows increasingly erratic and sensitive to the data. When we over-fit, the model follows the data too closely, and predictions are poor for new observations. One simple technique to assess a fitted model is it compute the MSE on new data, data that were not used in building the model. Since we don’t typically have the capacity to acquire more data, we set aside some of the original data to evaluate the fitted model. This technique is the topic of the next section.


These data are from Daniel T. Kaplan, Statistical modeling: A fresh approach, 2009. https://www.key2stats.com/Utility_bills_1294_92.csv