21.1. Question and Scope#

Our initial research question is: can we automatically detect fake news? To refine this question, we consider the kind of information that we might use to build a model for detecting fake news. If we have hand-classified news stories where people have read each story and determined whether it is fake or not, then our question becomes: can we build a model to accurately predict whether a news story is fake based on its content?

To address this question, we can use the FakeNewsNet data repository as described in Shu et al. This repository contains content from news and social media websites, as well as metadata like user engagement metrics. For simplicity, we only look at the dataset’s political news articles. This subset of the data includes only articles that were fact-checked by Politifact, a non-partisan organization with a good reputation. Each article in the dataset has a “real” or “fake” label based on the Politifact’s evaluation, which we use as the ground truth.

Politifact uses a non-random sampling method to select articles to fact-check. According to their website, their journalists select the “most newsworthy and significant” claims each day. Politifact started in 2007 and the repository was published in 2020, so most of the articles were published between 2007 and 2020.

Summarizing this information, we determine that the target population consists of all political news stories published online in the time period from 2007 to 2020 (we would also want to list the sources of the stories). The access frame is determined by Politifact’s identification of the most newsworthy claims of the day. So, the main sources of bias for this data include:

  • coverage bias: The news outlets are limited to those Politifact monitored, which may miss arcane or short-lived sites.

  • selection bias: The data are limited to articles Politifact decided were interesting enough to fact-check, which means that articles might skew towards ones that are both widely shared and controversial.

  • measurement bias: The labels fake/real are determined by one organization (Politifact), and reflects the biases, unintentional or otherwise, that the organization has in its fact-checking methodology.

  • drift: Since we have only articles published between 2007 and 2020, there is likely to be drift in the content. Topics are popularized and faked in a rapidly evolving news trends.

We will keep these limitations of the data in mind as we begin to wrangle it into a form that we can analyze.