21.4. Modeling#

Now that we’ve obtained, cleaned, and explored our data, let’s fit models to predict whether articles are real or fake. In this section, we use logistic regression because we have a binary classification problem. We fit three different models that increase in complexity. First, we fit a model that just uses the presence of a single hand-picked word in the document as an explanatory feature. Then, we fit a model that uses multiple hand-picked words. Finally, we fit a model that uses all the words in the training set, vectorized using the tf-idf transform (first introduced in Chapter 13). Let’s start with the simple single-word model.

21.4.1. A Single-Word Model#

Our EDA showed that the word vote is related to whether an article is labeled real or fake. To test this, we fit a logistic regression model using a single binary feature: 1 if the word vote appears in the article and 0 if not. We start by defining a function to lowercase the article content:

def lowercase(df):
    return df.assign(content=df['content'].str.lower())

For our first classifier, we only use the word vote:

one_word = ['vote']

We can chain the lowercase function and the function make_word_features from our EDA into a scikit-learn pipeline. This provides a convenient way to transform and fit data all at once:

from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import FunctionTransformer
model1 = make_pipeline(
    FunctionTransformer(make_word_features, kw_args={'words': one_word}),
    LogisticRegressionCV(Cs=10, solver='saga', n_jobs=4, max_iter=10000),

When used, the pipeline above converts the characters in the article content to lowercase, creates word features data frame, and fits a logistic regression model on the data using \( L_2 \) regularization. Additionally, the LogisticRegressionCV function uses cross-validation (five-fold by default) to select the best regularization parameter. (See Chapter 16 for more on regularization and cross-validation.)

Let’s use the pipeline to fit the training data:


model1.fit(X_train, y_train)
print(f'{model1.score(X_train, y_train):.1%} accuracy on training set.')
64.9% accuracy on training set.
CPU times: user 116 ms, sys: 35.2 ms, total: 151 ms
Wall time: 133 ms

Overall, the single-word classifier only classifies 65% of articles correctly. We plot the confusion matrix of the classifier on the training set to see what kinds of mistakes it makes:

from sklearn.metrics import ConfusionMatrixDisplay

    model1, X_train, y_train, cmap=plt.cm.Blues, colorbar=False

Our model often misclassifies real articles (0) as fake (1). Since this model is simple, we can take a look at the probabilities for the two cases: the word “vote” is in the article or not:

vote_present = pd.DataFrame({'vote': [1]})
vote_absent = pd.DataFrame({'vote': [0]})

log_reg = model1.named_steps['logisticregressioncv']
print(f'"vote" present: {log_reg.predict_proba(vote_present)}')
print(f'"vote" absent: {log_reg.predict_proba(vote_absent)}')
"vote" present: [[0.72 0.28]]
"vote" absent: [[0.48 0.52]]

When an article contains the word “vote”, the model gives a high probability to the article being real, and when “vote” is absent, the probability leans slightly toward the article being fake. We encourage readers to verify this for themselves using the definition of the logistic regression model and the fitted coefficients:

print(f'Intercept: {log_reg.intercept_[0]:.2f}')
[[coef]] = log_reg.coef_
print(f'"vote" Coefficient: {coef:.2f}')
Intercept: 0.08
"vote" Coefficient: -1.00

As we saw in Chapter 19, the coefficient indicates the size of the change in the odds with a change in the explanatory variable. With a 0-1 variable like the presence or absence of a word in an article, this has a particularly intuitive meaning. For an article with “vote” in it, the odds of being fake decreases by a factor of \(\exp(\theta_{vote})\), which is:



Remember that in this modeling scenario, a label of 0 corresponds to a real article and a label of 1 corresponds to a fake article. This might seem a bit counterintuitive—we’re saying that a “true positive” is when a model correctly predicts a fake article as fake. In binary classification, we typically say a “positive” result is the one with the presence of something unusual. For example, a person who tests positive for an illness would expect to have it.

Let’s make our model a bit more sophisticated by introducing additional word features.

21.4.2. Multiple Word Model#

We create a model that uses all of the words we examined in our EDA of the training set, except for “the”. Let’s fit a model using these 15 features:

model2 = make_pipeline(
    FunctionTransformer(make_word_features, kw_args={'words': word_features}),
    LogisticRegressionCV(Cs=10, solver='saga', n_jobs=4, max_iter=10000),

model2.fit(X_train, y_train)
print(f'{model2.score(X_train, y_train):.1%} accuracy on training set.')
74.8% accuracy on training set.
CPU times: user 1.45 s, sys: 22 ms, total: 1.47 s
Wall time: 687 ms

This model is about 10 percentage points more accurate than the one-word model. It may seem a bit surprising that going from a one-word to a 15-word model only gains 10 percentage points. The confusion matrix is helpful in teasing out the kinds of errors made:

from sklearn.metrics import ConfusionMatrixDisplay

    model2, X_train, y_train, cmap=plt.cm.Blues, colorbar=False

We can see that this classifier does a better job classifying real articles accurately. However, it makes more mistakes than the simple one-word model when classifying fake articles – 59 of the fake articles were classified as real. In this scenario, we might be more concerned about misclassifying an article as fake when it is real. So we wish to have a high precision: the ratio of fake articles correctly predicted as fake to articles predicted fake:

model1_precision = 238 / (238 + 179)
model2_precision = 205 / (205 + 88)

[round(num, 2) for num in [model1_precision, model2_precision]]
[0.57, 0.7]

The precision in our larger model model is improved, but about 30% of the articles labeled fake are actually real. Let’s take a look at the model’s coefficients:

log_reg2 = model2.named_steps['logisticregressioncv']
coefs = (pd.DataFrame({'word': word_features, 'coef': log_reg2.coef_[0]})
fig = px.scatter(coefs, x='coef', y='word',
                 labels={'coef': 'Model Coefficient'},
                 width=350, height=250)
fig.add_vline(x=0, line_width=2, line_dash="dash", opacity=1)

We can make a quick interpretation the coefficients by looking at their signs. The large positive values on trump and investig indicate that the model predicts that new articles that contain these words have a higher probability of being fake. The reverse is true for words like congress and vote, which have negative weights. We can use these coefficients to compare the log odds when an article does or does not contain a particular word.

Although this larger model performs better than the simple one-word model, we had to hand-pick the word features using our knowledge of the news. What if we missed the words that are highly predictive? To address this, we can incorporate all the words in the articles using the tf-idf transform.

21.4.3. Predicting with the tf-idf Transform#

For the third and final model, we use the term frequency-inverse document frequency (tf-idf) transform from Chapter 13 to vectorize the entire text of all articles in the training set. Recall that with this transform, an article is converted into a vector with one element for each word that appears in any of the 564 articles. The vector consists of normalized counts of the number of times the word appears in the article normalized by the rareness of a word. The tf-idf puts more weight on words that only appear in a few documents. This means that our classifier uses all the words in the training set’s news articles for prediction. As we’ve done when we introduced tf-idf, we first remove stopwords, tokenize the words, then use the TfidfVectorizer from scikit-learn.

import nltk
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

stop_words = set(nltk.corpus.stopwords.words('english'))
porter_stemmer = PorterStemmer()

def stemming_tokenizer(document):
    return [porter_stemmer.stem(word)
            for word in nltk.word_tokenize(document)
            if word not in stop_words]
    return words
tfidf = TfidfVectorizer(tokenizer=stemming_tokenizer, token_pattern=None)
from sklearn.compose import make_column_transformer

model3 = make_pipeline(
    make_column_transformer((tfidf, 'content')),

model3.fit(X_train, y_train)
print(f'{model3.score(X_train, y_train):.1%} accuracy on training set.')
[Pipeline]  (step 1 of 3) Processing functiontransformer, total=   0.0s
[Pipeline] . (step 2 of 3) Processing columntransformer, total=  14.1s
[Pipeline]  (step 3 of 3) Processing logisticregressioncv, total=   6.6s
100.0% accuracy on training set.
CPU times: user 51.1 s, sys: 531 ms, total: 51.6 s
Wall time: 34.6 s

We find that this model achieves 100% accuracy on the training set. We can take a look at the tf-idf transformer to better understand the model. Let’s start by finding out how many unique tokens the classifier uses:

tfidf = model3.named_steps.columntransformer.named_transformers_.tfidfvectorizer
n_unique_tokens = len(tfidf.vocabulary_.keys())
print(f'{n_unique_tokens} tokens appeared across {len(X_train)} examples.')
23812 tokens appeared across 584 examples.

This means that our classifier has 23,804 features, a large increase from our previous model which only had 15. Since we can’t display that many model weights, we display the ten most negative and ten most positive weights below:

log_reg = model3.named_steps.logisticregressioncv
coefs = (pd.DataFrame(tfidf.vocabulary_.keys(),
fig1 = px.scatter(coefs[:10][::-1], x='coef', y='word')
fig2 = px.scatter(coefs[-10:], x='coef', y='word')
fig = left_right(fig1, fig2, horizontal_spacing=0.15)
fig.update_xaxes(title='Model Coefficient')
fig.update_yaxes(title='word', row=1, col=1)

These coefficients show a few quirks about this model. We see that several influential features correspond to punctuation in the original text. It’s unclear whether we should clean out the punctuation in the model. On one hand, punctuation doesn’t seem to convey as much meaning as words do. On the other, it seems plausible that, for example, lots of explanation points in an article could help a model decide whether the article is real or fake. In this case, we’ve decided to keep punctuation, but curious readers can repeat this analysis after stripping the punctuation out to see how the resulting model is affected.

We conclude by displaying the test set error for all three models:

    'test set error': [model1.score(X_test, y_test),
                       model2.score(X_test, y_test),
                       model3.score(X_test, y_test)]
}, index=['model1', 'model2', 'model3'])
test set error
model1 0.61
model2 0.70
model3 0.88

As we might expect, the models became more accurate as we introduced more features. The model that used tf-idf performed much better than the models with binary hand-picked word features but did not meet the 100% accuracy obtained on the training set. This illustrates a common tradeoff in modeling: given enough data, more complex models can often outperform simpler ones, especially in situations like this case study where simpler models have too much model bias to perform well. However, complex models can be more difficult to interpret. For example, our tf-idf model had over 23,804 features, which makes it basically impossible to explain how our model makes its decisions. In addition, the tf-idf model takes much longer to make predictions—it’s over 100 times slower compared to model 2. All of these factors need to be considered when deciding which model to use in practice.

In addition, we need to be careful about what our models are useful for. In this case, our models use the content of the news articles for prediction, making them highly dependent on the words that appear in the training set. However, our models will likely not perform as well on future news articles that use words that didn’t appear in the training set. For example, our models use the US election candidates’ names in 2016 for prediction, but won’t know to incorporate the names of the candidates in 2020 or 2024. To use our models in the longer term, we would need to address this issue of drift.

That said, it’s surprising that a logistic regression model can perform well with a relatively small amount of feature engineering (tf-idf). We’ve addressed our original research question: our tf-idf model appears effective for detecting fake news in our dataset, and could plausibly generalize to other news published in the same time period covered in the training data.