Fitting the Multiple Linear Model

15.4. Fitting the Multiple Linear Model

In the previous section we considered the case of two explanatory variables; one of these was called \(x\) and the other \(v\). Now we want to generalize the approach to \(p\) explanatory variables. Our approach of choosing new letters to represent additional variables quickly fails us. Instead, we use a more formal and general approach that represents multiple predictors as a matrix, as depicted in Figure 15.4. We call \(\textbf{X}\) the design matrix. Notice that \(\textbf{X}\) has shape \( n \times (p + 1)\). Each column of \(\textbf{X}\) represents a variable, and each row represents an observation. That is, \(x_{i,j}\) is the measurement taken on observation \(i\) of variable \(j\).


Fig. 15.4 Each row and column of \(X\) represent an observation and a feature.


One technicality: the design matrix is defined as a mathematical matrix, not a dataframe, so you might notice that a matrix doesn’t include the column or row labels that the dataframe has.

That said, we usually don’t have to worry about converting dataframes into matrices since most Python libraries for modeling treat dataframes of numbers as if they were matrices.

For a given observation, say, the second row in \(\textbf{X}\), we represent the outcome \(y_2\) by the linear approximation:

\[ \begin{aligned} y_2 \approx {\hat{y}_2} = \theta_0 + \theta_1 x_{2,1} + \ldots + \theta_p x_{2,p} \end{aligned} \]

Similar to the simple linear model, our multiple linear model predicts \(\hat{y}_2\) as a linear combination of the values \([x_{2,1}, \ldots,x_{2,p}]\).

Also, we can write the model parameters as a \(p+1\) column vector \({\boldsymbol{\theta}}\),

\[\begin{split} \boldsymbol{\theta} = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_p \end{bmatrix} \end{split}\]

Putting these notational definitions together, we can write the vector of \(\mathbf{\hat{y}}\) predictions for the entire dataset using matrix multiplication:

\[ \begin{aligned} \hat{\mathbf{y}} &= {\textbf{X}} {\boldsymbol{\theta}} \end{aligned} \]

Additionally, the errors in using \(\hat{\mathbf{y}}\) to predict \(\mathbf{y}\) can be expressed as the \(n\)-dimensional column vector:

\[ \mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}.\]

If we check the dimensions of \(\textbf{X}\) and \(\boldsymbol{\theta}\), it confirms that \(\hat{\mathbf{y}}\) is an \(n\)-dimensional column vector. Each element in this vector is the model’s prediction for one observation.

and we also represent \(\mathbf{y}\) as a column vector of outcomes:

\[\begin{split} \mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} \end{split}\]

This matrix representation of the multiple linear model can help us fit the model. Similar to the simple linear model, we find the best model using squared loss. That is, we want to find the model parameters \(\hat{\boldsymbol{\theta}}\) that minimize the average squared loss:

\[\begin{split} \begin{aligned} L({\boldsymbol{\theta}}, \textbf{X}, {\mathbf{y}}) &= \frac{1}{n} \sum_i [y_i - (\theta_0 + \theta_1 x_{i,1} + \cdots + \theta_p x_{i,p})]^2 \\ &= \frac{1}{n} \lVert \mathbf{y} - {\textbf{X}} {\boldsymbol{\theta}} \rVert^2 \end{aligned} \end{split}\]

Here, we use the notation \( \lVert \mathbf{v} \rVert^2 \) for a vector \(\mathbf{v}\) as a shorthand for the sum of each vector element squared 1: \(\lVert \mathbf{v} \rVert^2 = \sum_i v_i^2\).

We can fit our model using calculus as we did for the simple linear model. However, this approach gets cumbersome, and instead we use a geometric argument that is more intuitive and easily leads to useful properties of the design matrix, errors, and predicted values.

15.4.1. A Geometric Problem

Again, our goal is the find the \(\boldsymbol{\hat{\theta}}\) that minimizes our loss function—we want to make \(L({\boldsymbol{\theta}}, \textbf{X}, \mathbf{y})\) as small as possible for a given \(\textbf{X}\) and \(\mathbf{y}\). The key insight is that we can restate this goal in a geometric way. Since the model predictions \( \mathbf{\hat{y}} \) and the true outcomes \(\mathbf{y}\) are both vectors, we can think of them as points in variable space. When we change our model parameters \( {\boldsymbol{\theta}} \), the model makes different predictions \( \mathbf{\hat{y}} \), and the space of all possible model predictions is \( \text{span}(\mathbf{X}) \), as illustrated in Figure 15.5.


Fig. 15.5 In this simplified diagram, the space of all possible model prediction vectors \( \text{span}(\mathbf{X}) \) is illustrated as a gray plane in 3-dimensional space, and the observed \( \mathbf{y} \) as a vector.


As Figure 15.5 shows, the true outcome vector \( \mathbf{y} \) doesn’t usually lie in \( \text{span}(\mathbf{X}) \). This means that the model can’t get perfect accuracy on the training set. Although our loss can’t be exactly zero, we can instead find the point in \( \text{span}(\mathbf{X}) \) that lies as close to \( \mathbf{y} \) as possible.

The error vector \( \mathbf{e} = \mathbf{y} - \mathbf{\hat{y}} \) represents the distance between the true outcomes and our model’s predictions. Visually, \( \mathbf{e} \) has the smallest magnitude when it is perpendicular to the subspace \( \text{span}(\mathbf{X}) \), as shown in Figure 15.6. The proof of this fact is omitted for brevity.


Fig. 15.6 The error is minimized when it lies perpendicular to \( \text{span}(\mathbf{X}) \).

In other words, there are many possible model parameters \( \boldsymbol{\theta} \), but only one \( \hat{\boldsymbol{\theta}} \) minimizes the loss by making the error vector perpendicular to \( \text{span}(\mathbf{X}) \).

This fact lets us derive a formula for \( \hat{\boldsymbol{\theta}} \) as follows:

\[\begin{split} \begin{aligned} \textbf{X} \boldsymbol{\hat{\theta}} + \mathbf{e} &= \mathbf{y} \\ {\textbf{X}}^\top \textbf{X} \hat{\boldsymbol{\theta}} + {\textbf{X}}^\top \mathbf{e} &= {\textbf{X}}^\top \mathbf{y} & (\text{left-multiply by } {\textbf{X}}^\top) \\ {\textbf{X}}^\top \textbf{X} \hat{\boldsymbol{\theta}} &= {\textbf{X}}^\top \mathbf{y} & (\mathbf{e} \perp \text{span}(\textbf{X})) \\ \boldsymbol{\hat{\theta}} &= ({\textbf{X}}^\top \textbf{X})^{-1} {\textbf{X}}^\top \mathbf{y} & (\text{left-multiply by } ({\textbf{X}}^\top \textbf{X})^{-1}) \end{aligned} \end{split}\]

In this derivation, we used the fact that \( {\textbf{X}}^\top \mathbf{e} = 0 \) when \( \mathbf{e} \) is perpendicular to \( \text{span}(\mathbf{X}) \).

This general approach to derive \(\boldsymbol{\hat{\theta}}\) for the multiple linear model also gives us \(\hat{\theta}_0\) and \(\hat{\theta}_1\) for the simple linear model. If we set \({\textbf{X}}\) to contain the intercept column and one column of features, using this formula for \(\boldsymbol{\hat{\theta}}\) and some linear algebra, we can get the intercept and slope of the least-squares-fitted simple linear model. In fact, if \({\textbf{X}}\) is simply a single column of \(1\)s, then we can use this formula to show that \(\boldsymbol{\hat{\theta}}\) is just the mean of \(\mathbf{x}\). This ties back to the constant model that we introduced in Chapter 4.


While, we can write a simple function to derive the \(\boldsymbol{\hat{\theta}}\) based on the formula,

\[ \boldsymbol{\hat{\theta}} = ({\textbf{X}}^\top \textbf{X})^{-1} {\textbf{X}}^\top \mathbf{y}, \]

we recommend leaving the calculation of \(\boldsymbol{\hat{\theta}}\) to the optimally tuned methods provided in the scikit-learn or statsmodels libraries. They handle cases where the design matrix is sparse, highly co-linear, and not invertible.

This solution for \(\boldsymbol{\hat{\theta}}\) (along with the pictures) reveal some useful properties of the fitted coefficients and the predictions:

  • The residuals are orthogonal to the predicted values.

  • The average of the residuals is \(0\), if the model has an intercept term.

  • The variance of the residuals is just the ASE.

These properties explain why we examine plots of the residuals against the predictions. When we fit a multiple linear model, we also plot the residuals against variables that we are considering adding to the model. If they show a linear pattern, then we would add them to the model.

In addition to examining the SD of the errors, the ratio of the ASE for a multiple linear model to the ASE for the constant model gives a measure of the model fit. This is called the Multiple \(R^2\) and is defined as:

\[ R^2 = 1 - \frac {\lVert \mathbf{y} - {\textbf{X}}{\boldsymbol{\hat{\theta}}} \rVert^2} {\lVert {\mathbf{y}} - \bar{y} \rVert^2}. \]

As the model fits the data closer and closer, the multiple \(R^2\) gets nearer to \(1\). That might seem like a good thing, but there are problems with this approach because \(R^2\) continues to grow even as we add meaningless features to our model, as long as the features fill out the \(\text{span}(\textbf{X})\). To account for the size of a model, we often adjust the numerator and denominator by the number of fitted coefficients in the models. That is, we would normalize the numerator by \(1/[n-(p+1)]\) and the denominator by \(1/(n-1)\). Better approaches to selecting a model are covered in Chapter 16.


\(\lVert \mathbf{v} \rVert^2\) is also called the \(\ell_2\) norm of a vector \(\mathbf{v}\).