Case Study: Detecting Fake News

21. Case Study: Detecting Fake News


This chapter is under development. When it’s finished, this note will be removed.

Fake news—false information created in order to deceive others—is an important issue because it can harm people. For example, the social media post in Figure 21.1 confidently stated that hand sanitizer doesn’t work on coronaviruses. Though factually incorrect, it spread through social media anyway: it was shared nearly 100,000 times and was likely seen by millions of people.


Fig. 21.1 A popular post on Twitter from March 2020 falsely claimed that sanitizer doesn’t kill coronavirus.

Can we create models to automatically detect fake news? In this case study, we’ll use logistic regression to predict whether news articles are real or fake. As usual, we’ll go through the steps of the data science lifecycle. We’ll start by refining our research question and obtaining a dataset of news articles and labels. Then, we’ll wrangle and transform the data. Next, we’ll explore the data to understand its content and devise features that we can use for modeling. Finally, we’ll fit logistic regression models and evaluate their performance.

We’ve included this case study because it lets us reiterate several important ideas in data science. First, natural language data appear often, and even basic techniques can enable interesting analyses. Second, model selection is an important part of data analysis, and this case study lets us apply what we’ve learned about cross-validation and the bias-variance tradeoff. Finally, even models that perform well on the test set might have inherent limitations when we try to use them in practice, as we’ll soon see.

Let’s start by refining our research question and understanding the scope of our data.