15.2. Fitting the Simple Linear Model¶

To fit a model, we use the notion of loss minimization that was first introduced in Chapter 4). The first step is to choose the loss function. Then, we find which values of $$\theta_0$$ and $$\theta_1$$ give the smallest loss for our data. We call $$\theta_0$$ and $$\theta_1$$ model parameters, and we often represent them as the vector $${\boldsymbol{\theta}} = [\theta_0, \theta_1]$$.

To fit a linear model with a numeric outcome variable, we typically choose squared loss, where for a given data point $$(x,y)$$ and model parameters $${\boldsymbol{\theta}}$$, the squared loss is:

${\cal l}({\boldsymbol{\theta}}, x, y) = [y - (\theta_0 + \theta_1 x)]^2$

Notice that this loss function squares the error between the outcome $$y$$ and the prediction, $$\theta_0 + \theta_1 x$$, given $$x$$. That’s why squared loss is also called squared error. We saw in the previous section that these errors are in the vertical direction on the scatter plot, meaning for a specific $$x$$, the error is the vertical difference between the points $$(x, y)$$ and $$(x, \theta_0 + \theta_1 x)$$. (See Figure 15.1 for a diagram that includes two sample errors).

For a data set with $$n$$ points: $$(x_1, y_1), \ldots, (x_n, y_n)$$, the average squared loss (ASE) is:

\begin{split} \begin{aligned} L({\boldsymbol{\theta}}, {\mathbf{x}}, {\mathbf{y}}) &= \frac{1}{n} \sum_{i} {\cal l}({\boldsymbol{\theta}}, x_i, y_i) \\ &= \frac{1}{n} \sum_{i}[y_i - (\theta_0 + \theta_1 x_i)]^2. \end{aligned} \end{split}

Again, $${\mathbf{x}}$$ is the vector $$[x_1, \ldots, x_n]$$ and $${\mathbf{y}}$$ is similarly defined.

Now that we have chosen the loss function, we want to find $$\hat{\boldsymbol{\theta}}$$ that minimizes the loss. We use calculus to do this.

15.2.1. Minimizing the Loss¶

With the simple linear model, the average squared error is a function of two model parameters. This means that if we use calculus to find the minimizing parameter values, we need to find the partial derivatives of the ASE with respect to $$\theta_0$$ and $$\theta_1$$. We can also find these minimizing values by other techniques:

• Gradient descent. We can use numerical optimization techniques, such as gradient descent, when the loss function is more complex and calculus is not an option (see Chapter 20.

• Quadratic formula. Since $$L({\boldsymbol{\theta}},{\mathbf{x}},{\mathbf{y})}$$ is a quadratic function of $${\boldsymbol{\theta}}$$ we can use the quadratic formula (along with a lot of algebra) to solve for the minimizing model parameter values. We guide you through the quadratic approach in the Exercises.

• Geometric argument. Later in this chapter, we use a geometric interpretation of least squares to fit multiple linear models. This approach relates to the Pythagorean theorem and has several intuitive benefits.

We choose calculus to optimize the simple linear model since it is quick and straightforward. To begin, we take the partial derivatives of the average squared error with respect to each parameter:

\begin{split} \begin{aligned} \frac{\partial}{\partial \theta_0} L({\boldsymbol{\theta}}, {\mathbf{x}}, {\mathbf{y}}) &= \frac{1}{n} \sum_{i} 2 (y_i - \theta_0 - \theta_1 x_i ) (-1)\\ & \\ \frac{\partial}{\partial \theta_1} L({\boldsymbol{\theta}}, {\mathbf{x}}, {\mathbf{y}}) &= \frac{1}{n} \sum_{i} 2 (y_i - \theta_0 - \theta_1 x_i) (-x_i) \end{aligned} \end{split}

Then, we set the partial derivatives equal to 0, and simplify a bit by multiplying both sides of the equations by $$-n/2$$ to get:

\begin{split} \begin{aligned} 0 &= \sum_{i} (y_i - \hat{\theta}_0 - \hat{\theta}_1 x_i) \\ 0 &= \sum_{i} (y_i - \hat{\theta}_0 - \hat{\theta}_1 x_i)x_i \\ \end{aligned} \end{split}

These equations are called the normal equations. In the first equation, we see that $$\hat{\theta}_0$$ can be represented as a function of $$\hat{\theta}_1$$.

$\hat{\theta}_0 = \bar{y} - \hat{\theta}_1 \bar{x}$

Plugging this value into the second equation, gives us:

\begin{split} \begin{aligned} 0 &= \sum_{i} (y_i - \bar y + \hat{\theta}_1 \bar x - \hat{\theta}_1 x_i ) x_i \\ &= \sum_{i} [(y_i - \bar y) - \hat{\theta}_1 ( x_i - \bar x)]x_i \\ \hat{\theta}_1 &= \frac{\sum_{i} (y_i - \bar y)x_i} {\sum_{i}( x_i - \bar x)x_i} \\ \end{aligned} \end{split}

After some algebra, we can represent $$\hat{\theta}_1$$ in terms of quantities that we are familiar with:

$\hat{\theta}_1 = r({\mathbf{x}},{\mathbf{y}}) \frac{SD({\mathbf{y}})}{SD({\mathbf{x}})}$

This representation says that, a point on the fitted line at $$x$$ can be written as

\begin{split} \begin{aligned} \hat{y} &= \hat{\theta}_0 + \hat{\theta}_1 x \\ &= \bar{y} - \hat{\theta}_1 \bar{x} + \hat{\theta}_1 x \\ &= \bar{y} + r({\mathbf{x}},{\mathbf{y}}) SD({\mathbf{y}}) \frac{(x - \bar{x})}{SD({\mathbf{x}})}\\ \end{aligned} \end{split}

This equation for the line has a nice interpretation: for a given $$x$$ value, we find how many standard deviations above (or below) average it is, and then predict $$y$$ to be $$r$$ times as many standard deviations above (or below) its average.

Since pandas.Series objects have built-in methods to compute $$SD(\mathbf{x})$$, $$SD(\mathbf{y})$$, and $$r(\mathbf{x}, \mathbf{y})$$, we can quickly define functions that fit the simple linear model to our data:

def theta_1(x, y):
r = x.corr(y)
return r * y.std() / x.std()

def theta_0(x, y):
return y.mean() - theta_1(x, y) * x.mean()


Now, we can fit the model by computing $$\hat{\theta}_0$$ and $$\hat{\theta}_1$$ on the data set.

t1 = theta_1(GA['pm25aqs'], GA['pm25pa'])
t0 = theta_0(GA['pm25aqs'], GA['pm25pa'])
print(f'Model: f(x) = {t0:.2f} + {t1:.2f}x')

Model: f(x) = -3.36 + 2.10x


The model we fitted matches the line that we used in the previous section. This is not by accident. We purposely chose the least squares line as our fitted model. Of course, can use the functionality provided in scikit learn for finding the model parameters.

from sklearn.linear_model import LinearRegression

y = GA[['pm25pa']]
X = GA[['pm25aqs']]

reg = LinearRegression().fit(X, y)

print(f"Intercept: {reg.intercept_}  Slope: {reg.coef_[0]}")

Intercept: [-3.36]  Slope: [2.1]


Rather than computing the model parameters with summary statistics, we will want to use the LinearRegression method in scikit learn because it offers numerically stable algorithms. This is especially important when fitting multiple variables. We introduce the multiple linear model in the next section.