Feature Engineering for Numeric Measurements

15.6. Feature Engineering for Numeric Measurements

All of the models that we have fit so far in this chapter have used existing numeric features from a data frame. In this section, we look at variables that are created from transformations of numeric features. Transforming variables to use in modeling is called feature engineering.

We first introduced feature engineering in Chapter 9 and Chapter 10. There, we transformed features so that they had symmetric distributions. Transformations can capture more kinds of patterns in the data and lead to better and more accurate models.

We return to the data set we used as an example in Chapter 10: house sale prices in the San Francisco Bay area, but this time we restrict the data to houses sold in 2006, when sale prices were relatively stable, so we don’t need to account for trends in price over time.

We wish to model sale price. Recall that visualizations in Chapter 10 showed us that sale price was related to several features, like the size of the house, size of the lot, number of bedrooms, and city. We transformed both sale price and the size of the house to improve their relationship, and we saw that box plots of sale price by the number of bedrooms and box plots by city revealed interesting relationships. In this section, we include transformed numeric features into a linear model. In the next section, we also consider adding to the model an ordinal feature (the number of bedrooms) and a nominal feature (the city).

We begin by modeling sale price on house size, including the log transformations of them. The correlation matrix tell us which of our numeric explanatory variables (original and transformed) is most strongly correlated with sale price.

price br lsqft bsqft ... log_bsqft log_lsqft ppsf log_ppsf
price 1.00 0.45 0.59 0.79 ... 0.74 0.62 0.49 0.47
br 0.45 1.00 0.29 0.67 ... 0.71 0.38 -0.18 -0.21
lsqft 0.59 0.29 1.00 0.46 ... 0.44 0.85 0.29 0.27
... ... ... ... ... ... ... ... ... ...
log_lsqft 0.62 0.38 0.85 0.52 ... 0.52 1.00 0.29 0.27
ppsf 0.49 -0.18 0.29 -0.08 ... -0.11 0.29 1.00 0.96
log_ppsf 0.47 -0.21 0.27 -0.10 ... -0.14 0.27 0.96 1.00

9 rows × 9 columns

Sale price correlates most highly with house size, called bsqft for building square feet. We make a scatter plot of sale price against house size to confirm the association is linear.


The relationship does look roughly linear, but the very large and expensive houses are far from the center of the distribution and can overly influence the model fit. As shown in Chapter 10, the log transformation makes the distributions of price and size more symmetric (both are log base 10 to make it easier to convert the values into the original units).


Ideally, a model that uses transformations should make sense in the context of the data. If we fit a simple linear model for log(price) based on log(size), this implies that, say, a \(10\)% increase in house size is associated with a \(\theta * 10\)% change in sale price. Or, if the model relates log-transformed price to the number of bedrooms, then one additional bedroom is associated with a \(\theta\) percent change in sale price. Both of these models make sense in the context of house sale prices so we use them.

Let’s begin by fitting a model that explains log-transformed price by the house’s log-transformed size. But first, we note that this model is still considered a linear model. If we represent sale price by \(y\) and house size by \(x\), then the model is

\[ \begin{aligned} \log(y) ~&=~ \theta_0 + \theta_1\log(x) \end{aligned} \]

Note that we have ignored the variation around the line in this equation to make the linear relationship clearer. This equation may not seem linear, but, if we rename \(\log(y)\) to \(w\) and \(\log(x)\) to \(v\), then we can express this “log-log” relationship as a linear model:

\[ w ~=~ \theta_0 + \theta_1 v \]

That is, this model is linear in the log-transformed variables, \(\log(y)\) and \(\log(x)\).

Other examples of models that can be expressed as linear combinations of transformed features appear below.

\[\begin{split} \begin{aligned} \log(y) ~&=~ \theta_0 + \theta_1 x \\ y ~&=~ \theta_0 + \theta_1 x + \theta_2 x^2 \\ y ~&=~ \theta_0 + \theta_1 x + \theta_2 z + \theta_3 x z \end{aligned} \end{split}\]

Again, if we rename \(\log(y)\) to \(w\), \(x^2\) to \(u\), and \(x z\) as \(t\), then we can express each of the above models as linear in these renamed features. The above models are now, in order,

\[\begin{split} \begin{aligned} w ~&=~ \theta_0 + \theta_1 x \\ y ~&=~ \theta_0 + \theta_1 x + \theta_2 u\\ y ~&=~ \theta_0 + \theta_1 x + \theta_2 z + \theta_3 t \\ \end{aligned} \end{split}\]

In short, we can think of models that include nonlinear transformations of features and/or combinations of features as linear in their derived features. In practice, we don’t rename the transformed features when we describe the model; instead, we write the model using the transformations of the original features because it’s important to keep track of the transformations, especially when interpreting the coefficients and checking residual plots.

When we refer to these models, we include mention of the transformations. That is, we call a model log-log when both the outcome and explanatory variables are log-transformed; we say it’s log-linear when the outcome is log-transformed but not the explanatory variable; we describe a model as having polynomial features of, say degree 2, when the first and second power transformation of the explanatory variable are included; and we say a model includes an interaction term between two explanatory features when the product of these two features is included in the model.

Let’s fit a log-log model of price on size.

X1_log = sfh[['log_bsqft']]    
y_log = sfh[['log_price']]
model1_log_log = LinearRegression().fit(X1_log, y_log)

The coefficients and predicted values from this model cannot be directly compared to a model fitted using linear features because the units are the log of dollars and log of square feet, not dollars and square feet.

print(f"Model log(price) ~ log(building size):\n Intercept (log $): {model1_log_log.intercept_}\n",
      f"Coefficient of building size (log $/log ft^2): {model1_log_log.coef_}")
Model log(price) ~ log(building size):
 Intercept (log $): [2.97]
 Coefficient of building size (log $/log ft^2): [[0.9]]

Next, we examine the residuals and predicted values with a plot.

<matplotlib.lines.Line2D at 0x103cb6b80>

The residual plot looks reasonable, but it contains thousands of points which makes it hard to see curvature.

To see if additional variables might be helpful, we can plot the residuals from the fitted model against a variable that is not in the model. If we see patterns, that indicates we might want to include this additional feature, or a transformation of it. Earlier, we saw that the distribution of price was related to the city where the house is located so let’s examine the relationship between the residuals and city.


This plot shows us that the distribution of errors appear shifted by city. Ideally, the median of each city’s box plot lines up with 0. Instead, we see that houses sold in Piedmont have over 75% positive errors, meaning the actual sale price is above the predicted value. And at the other extreme, roughly 75% of sale prices in Richmond fall below their predicted values. These patterns suggest that we should include city in the model. From a context point of view, it makes sense for location to impact sale price. In the next section, we show how to incorporate a nominal variable in a linear model.