Question and Scope

21.1. Question and Scope

Our initial research question is: can we create models to automatically detect fake news?

To address this question, we’ll use the FakeNewsNet dataset [Shu et al., 2020]. This dataset contains content from news and social media websites, as well as metadata like user engagement metrics. For simplicity, we’ll only look at the dataset’s political news articles. The dataset only includes articles that were fact-checked by Politifact, a non-partisan organization with a good reputation. Each article in the FakeNewsNet dataset has a “real” or “fake” label based on the Politifact’s evaluation, which we’ll use as the ground truth for whether an article is true or false.

Note that Politifact uses a non-random sampling method to select articles to fact-check. According to their website, their journalists select the “most newsworthy and significant” claims each day. Since Politifact started in 2007, all of the articles in the dataset were published between 2007 and 2022.

With this in mind, the main sources of bias for this data include:

  • coverage and selection bias: The data is limited to the articles Politifact decided were interesting enough to fact-check, which means that articles might skew towards ones that were both widely shared and controversial.

  • label bias: The data labels only come from one organization (Politifact), which means that they also reflect the biases that the organization has in its fact-checking methodology.