Feature Engineering for Categorical Measurements

15.7. Feature Engineering for Categorical Measurements

The first model we ever fit was the constant model in Chapter 4. There, we minimized squared loss to find the best fitting constant:

\[ \min_c \sum_i (y_i - c)^2 \]

We can think of including a nominal feature in a model in a similar fashion. That is, we find the best fitting constant to each subgroup of the data. In other words,

\[\begin{split} \begin{aligned} \min_{c_B} \sum_{i \in \textrm{Berkeley}} (y_i - c_B)^2, ~~~&~~~ \min_{c_L} \sum_{i \in \textrm{Lamorinda}} (y_i - c_L)^2, \\ \min_{c_P} \sum_{i \in \textrm{Piedmont}} (y_i - c_P)^2, ~~~&~~~ \min_{c_R} \sum_{i \in \textrm{Richmond}} (y_i - c_R)^2. \end{aligned} \end{split}\]

A much simpler way to describe this model is with one hot encoding.

One-hot encoding takes a categorical feature and creates multiple numeric features that have only \(0\) and \(1\) values. To one-hot encode a feature, we create new features, one for each unique category. In this case, since we have four cities: Berkeley, Lamorinda, Piedmont, and Richmond, we create four new features in a design matrix, called \(X_city\). Each row in \(X_city\) contains one value of \(1\) that appears in the column that matches the city. and all other columns contain 0 in that row. We illustrated this process in Figure 15.7.


Fig. 15.7 One-hot encoding for a small example dataset (left) and its resulting design matrix (right).

Now we can concisely represent the model as follows:

\[\begin{split} y_i ~\approx~ \theta_B x_{i,B} ~+~ \theta_L x_{i,L} \\ ~~~~~~~~~~ ~+~ \theta_P x_{i,P} ~+~ \theta_R x_{i,R} \end{split}\]

(Again we have ignored the errors in this equation). Here, we have indexed the columns of the design matrix by \(B\), \(L\), \(P\), and \(R\), rather than \(j\), to make it clear that each column represents a column of 0s and 1s where, say, a 1 appears for \(x_{i,P}\) if the \(i\)th house is located in Piedmont.


One-hot encoding creates features that have only 0-1 values. These features are known as dummy variables or indicator variables. The term “dummy variable” is more common in econometrics, and the usage of “indicator variable” is more common in statistics.

Our goal is to minimize least square loss over \(\boldsymbol{\theta}\),

\[\begin{split} \begin{aligned} \| \mathbf{y} - \textbf{X}\boldsymbol{\theta} \|^2 &= \sum_i (y_i - \theta_B x_{i,B} ~+~ \theta_L x_{i,L} ~+~ \theta_P x_{i,P} ~+~ \theta_R x_{i,R})^2 \\ & = \sum_{i \in Berkeley} (y_i - \theta_B x_{i,B})^2 ~+~ \sum_{i \in Lamorinda} (y_i -\theta_L x_{i,L})^2 \\ ~~~~~~~~& ~+~ \sum_{i \in Piedmont} (y_i -\theta_P x_{i,P})^2 ~+~ \sum_{i \in Richmond} (y_i -\theta_R x_{i,R})^2 \end{aligned} \end{split}\]

Here \(\boldsymbol{\theta}\) is the column vector \([\theta_B, \theta_L, \theta_P, \theta_R]\).

We can use OneHotEncoder to create this design matrix.

from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(
    # categories argument sets column order
    categories=[["Berkeley", "Lamorinda", "Piedmont", "Richmond"]],

X_city = enc.fit_transform(sfh[['city']])

categories_city=["Berkeley","Lamorinda", "Piedmont", "Richmond"]
X_city_df = pd.DataFrame(X_city, columns=categories_city)

Berkeley Lamorinda Piedmont Richmond
0 1.0 0.0 0.0 0.0
1 1.0 0.0 0.0 0.0
2 1.0 0.0 0.0 0.0
... ... ... ... ...
2664 0.0 0.0 0.0 1.0
2665 0.0 0.0 0.0 1.0
2666 0.0 0.0 0.0 1.0

2667 rows × 4 columns

We fit a model using these one-hot encoded features.

y_log = sfh[['log_price']]

model_city = LinearRegression(fit_intercept=False).fit(X_city_df, y_log)

And, examine the multiple \(R^2\).

print(f"R-square for city model: {model_city.score(X_city_df, y_log):.2f}\n")
R-square for city model: 0.57

With knowing only the city where the house is located, the model does a reasonably good job of estimating sale price. Below are the coefficients of the fit.

array([[5.87, 6.03, 6.1 , 5.67]])

As expected from the box plots, the estimated sale price (in log $) depends on the city. But, if we know the size of the house as well as the city, we should have an even better model. We saw earlier that the simple log-log model that explains sale price by house size fits reasonably well so we expect that the city feature (as one-hot encoded variables) should further improve the model.

Such a model looks like:

\[\begin{split} y_i ~\approx~ \theta_1x_i + \theta_B x_{i,B} ~+~ \theta_L x_{i,L} \\ ~~~~~~~~~~ ~+~ \theta_P x_{i,P} ~+~ \theta_R x_{i,R} \end{split}\]

Notice that this model describes the relationship between log(price), which is represented as \(y\) and log(size), which is represented as \(x\), as linear with the same coefficient for log(size) for each city. But, the intercept term depends on the city.

\[\begin{split} \begin{aligned} y_i ~&\approx~ \theta_1x_i + \theta_B ~~&\textrm{for houses in Berkeley} \\ y_i ~&\approx~ \theta_1x_i + \theta_L ~~&\textrm{for houses in Lamorinda}\\ y_i ~&\approx~ \theta_1x_i + \theta_P ~~&\textrm{for houses in Piedmont}\\ y_i ~&\approx~ \theta_1x_i + \theta_R ~~&\textrm{for houses in Richmond} \end{aligned} \end{split}\]

We make a facet of scatter plots, one for each city, to see if this relationship roughly holds.

fig = px.scatter(x=sfh['log_bsqft'], y=sfh['log_price'], 
                 facet_col=sfh['city'], facet_col_wrap=2,
                     "y": "price (log $)",
                     "x": "building size (sq ft)"},)

The shift is evident in the scatter plot, but it may be the case that the slopes also differ by city. We concatenate our two design matrices together to fit the model that includes size and city.

X_size = sfh[['log_bsqft']] 

X_city_size = pd.concat([X_size.reset_index(drop=True), X_city_df], axis=1)
log_bsqft Berkeley Lamorinda Piedmont Richmond
1 3.14 1.0 0.0 0.0 0.0
2 3.31 1.0 0.0 0.0 0.0
3 2.96 1.0 0.0 0.0 0.0
... ... ... ... ... ...
2664 3.16 0.0 0.0 0.0 1.0
2665 3.47 0.0 0.0 0.0 1.0
2666 3.44 0.0 0.0 0.0 1.0

2666 rows × 5 columns

model_city_size = LinearRegression(fit_intercept=False).fit(X_city_size, y_log)

The intercepts reflect which cities have more expensive houses, even taking into account the size of the house.

array([[0.62, 3.89, 3.98, 4.03, 3.75]])
print(f"R-square for city and log(size): {model_city_size.score(X_city_size, y_log):.2f}\n")
R-square for city and log(size): 0.79

This fit, which includes the nominal variable city and the log-transformed house size, is better than both the simple log-log model with house size and the model that fits constants for each city.

Notice that we dropped the intercept from the model so that each subgroup has their own intercept. However, we usually remove one of the feature columns from the design matrix and keep the intercept. For example, if we drop the feature for Berkeley houses and add the intercept, then the model is:

\[\begin{split} y_i ~\approx~ \theta_0 ~+~ \theta_1x_i ~+~ \theta_L x_{i,L} \\ ~~~~~~~~~~ ~+~ \theta_P x_{i,P} ~+~ \theta_R x_{i,R} \end{split}\]

The meaning of the coefficients for the dummy variables has changed in this representation. For example, consider this equation for a house in Berkeley and a house in Piedmont:

\[\begin{split} \begin{aligned} y_i ~&\approx~ \theta_0 + \theta_1x_i ~~&\textrm{for a house in Berkeley} \\ y_i ~&\approx~ \theta_0 ~+~ \theta_1x_i + \theta_P ~~&\textrm{for a house in Piedmont} \end{aligned} \end{split}\]

In this representation, the intercept is for Berkeley houses, and the coefficient \(\theta_P\) measures the typical difference between a Piedmont house and a Berkeley house. In this representation, we can more easily compare \(\theta_P\) to 0, to see if these two cities have essentially the same model.

If we had included the intercept and all of the city variables, then the columns of the design matrix will be linearly dependent, which means that we can’t solve for the coefficients. Our predictions will be the same in either case, but there will not be a unique solution to the minimization. We’ve left this proof as an exercise.

We also prefer the representation of the model that drops one dummy variable and includes an intercept term for when we include one-hot encodings of two categorical variables. This practice maintains consistency in the interpretation of the coefficients.

The relationship does not appear linear: for each additional bedroom, the sale price does not increase by the same amount. Given that the number of bedrooms is discrete, we can treat this feature as categorical, which allows each bedroom encoding to contribute a different amount to the cost.

We demonstrate how to build a model with two sets of dummy variables, using the statsmodels library and its formula language.

import statsmodels.formula.api as smf

We first fit the model with the nominal variable city and house size again to show how to use the formula.

model_size_city = smf.ols(formula='log_price ~ log_bsqft + city',

The value for the formula parameter describes the model to fit. The model has log_price as the outcome and fits a linear combination of log_bsqft and city as explanatory variables. Notice that we do not need to create dummy variables to fit the model. The smf procedure takes care of one-hot encoding the city feature. The fitted coefficients of the model shown below, include an intercept term and drops the Berkeley indicator variable.

Intercept            3.89
city[T.Lamorinda]    0.09
city[T.Piedmont]     0.14
city[T.Richmond]    -0.15
log_bsqft            0.62
dtype: float64

If we want to drop the intercept, we add -1 to the formula and a 0-1 variable is included for all four categories.

smf.ols(formula='log_price ~ log_bsqft + city - 1', data=sfh).fit().params
city[Berkeley]     3.89
city[Lamorinda]    3.98
city[Piedmont]     4.03
city[Richmond]     3.75
log_bsqft          0.62
dtype: float64

Additionally, we can add interaction terms between the city and size variables, which allows each city to have both a different intercept and a different coefficient for size. We can easily specify this in the formula by adding the term log_bsqft:city.

model_size_city_interact = \
   smf.ols(formula='log_price ~ log_bsqft + city + log_bsqft:city',
log_bsqft                      0.59
log_bsqft:city[T.Lamorinda]    0.21
log_bsqft:city[T.Piedmont]     0.21
log_bsqft:city[T.Richmond]    -0.05
dtype: float64

We do see that the effect of size on price for Berkeley and Richmond are nearly the same, but they are larger for Lamorinda and Piedmont. However, when we examine the multiple \(R^2\) we find that the model is not much improved from the simpler one where all cities had the same size effect size.


Since the multiple \(R^2\) is adjusted for the number of parameters in the model, any gain in adding 3 more terms to the model is offset by the adjustment.

We demonstrate how to fit a model with two categorical variables: the number of bedrooms and the city. Recall that we earlier reassigned the count of bedrooms that were above 6 to 6, which essentially collapses 6, 7, 8, … into the category, 6+. Below is a box plot of price (log $) by bedrooms.

fig = px.box(sfh, x="br", y="log_price")


The relationship does not appear linear: for each additional bedroom, the sale price does not increase by the same amount. Given that the number of bedrooms is discrete, we can treat this feature as categorical, which allows each bedroom encoding to contribute a different amount to the cost.

model_size_city_br = smf.ols(formula='log_price ~ log_bsqft + city + C(br)', data=sfh).fit()

Since, the number of bedrooms is numeric data type, we have used the term C(br) in the formula to indicate that we want the number of bedrooms to be treated like a categorical variable.

Let’s examine the coefficients and the multiple \(R2\) from the fit.

Intercept            3.85
city[T.Lamorinda]    0.08
city[T.Piedmont]     0.14
C(br)[T.5.0]         0.02
C(br)[T.6.0]        -0.04
log_bsqft            0.64
Length: 10, dtype: float64

Again, the multiple \(R^2\) has not increased because it is adjusted for the number of parameters in the model. This time, we added 5 more terms and do no better than the model based on size and city alone.

In this section, we introduced feature engineering for qualitative features. We saw how the one-hot encoding technique lets us include categorical data in linear models. And, one-hot encoding gives a natural interpretation for model parameters.