Obtaining and Wrangling the Data

21.2. Obtaining and Wrangling the Data

Let’s get the data into Python. The dataset has a GitHub page with the code that downloads the data. Running the script to download the Politifact data takes about an hour. After running the script, we placed the data files into the data/politifact folder. The articles that Politifact labeled as fake and real are in data/politifact/fake and data/politifact/real. Let’s take a look at one of the articles labeled real.

!ls -l data/politifact/real | head -n 5
total 0
drwxr-xr-x  2 sam  staff  64 Jul 14 15:23 politifact100
drwxr-xr-x  3 sam  staff  96 Jul 14 15:48 politifact1013
drwxr-xr-x  3 sam  staff  96 Jul 14 15:37 politifact1014
drwxr-xr-x  2 sam  staff  64 Jul 14 15:28 politifact10185
!ls -lh data/politifact/real/politifact1013/
total 16
-rw-r--r--  1 sam  staff   5.7K Jul 14 15:48 news content.json

Each article’s data is stored in a JSON file named news content.json. Let’s load the JSON for one article into a Python dictionary.

import json
from pathlib import Path

article_path = Path('data/politifact/real/politifact1013/news content.json')
article_json = json.loads(article_path.read_text())

Below, we’ve displayed the keys and values in article_json as a table:

    pd.DataFrame(article_json.items(), columns=['key', 'value']).set_index('key'),
url http://www.senate.gov/legislative/LIS/roll_cal...
text Roll Call Vote 111th Congress - 1st Session\n\...
images [http://statse.webtrendslive.com/dcs222dj3ow9j...
top_img http://www.senate.gov/resources/images/us_sen.ico
keywords []
authors []
title U.S. Senate: U.S. Senate Roll Call Votes 111th...
meta_data {'viewport': 'width=device-width, initial-scal...
movies []
publish_date None
source http://www.senate.gov

There are many fields in the JSON file, but this analysis will only look at a few: the article’s title, text, url, and publish_date. We’ll create a dataframe where each row represents one article. To so this, we will load in each available JSON file as a Python dictionary. Then, we’ll extract the fields of interest and store the results as a pandas dataframe named df_raw.

from pathlib import Path

def df_row(content_json):
    return {
        'url': content_json['url'],
        'text': content_json['text'],
        'title': content_json['title'],
        'publish_date': content_json['publish_date'],

def load_json(folder, label):
    filepath = folder / filename
    data = df_row(json.loads(filepath.read_text())) if filepath.exists() else {}
    return {
        'label': label,

df_raw = pd.DataFrame([load_json(path, 'fake') for path in fakes.iterdir()] +
                      [load_json(path, 'real') for path in reals.iterdir()])
# Raw data from JSON, without any processing
url text title publish_date label
0 dailybuzzlive.com/cannibals-arrested-florida/ Police in Vernal Heights, Florida, arrested 3-... Cannibals Arrested in Florida Claim Eating Hum... 1.62e+09 fake
1 https://web.archive.org/web/20171228192703/htt... WASHINGTON — Rod Jay Rosenstein, Deputy Attorn... BREAKING: Trump fires Deputy Attorney General ... 1.45e+09 fake
2 https://web.archive.org/web/20160924061356/htt... Keanu Reeves has long been known to be a stell... Keanu Reeves Shook The World With Another POWE... 1.46e+09 fake
... ... ... ... ... ...
1053 NaN NaN NaN NaN real
1054 https://web.archive.org/web/20090701202353/htt... This is a rush transcript from "On the Record,... An Open Letter to 'All Barack Channel' - Greta... NaN real
1055 https://web.archive.org/web/20130209000637/htt... The State of the Union 2012\n\n“We can either ... State of the Union 2013 NaN real

1056 rows × 5 columns

Exploring this dataframe reveals some issues we’d like to address before we keep going with the analysis. For example:

  1. Some articles couldn’t be downloaded. When this happened, the url column contains NaN.

  2. Some articles don’t have text (e.g. a webpage with only video content). We’ll drop these articles from our analysis.

  3. The publish_date column stores timestamps in Unix format (seconds since the Unix epoch), not as pandas.Timestamp objects.

  4. The url column has the full URL (e.g. dailybuzzlive.com/cannibals-arrested-florida/), but we’re interested in the base URL (e.g. dailybuzzlive.com).

  5. Some articles were downloaded from an archival website (web.archive.org). When this happens, we want to extract the actual base URL from the original by removing the web.archive.org prefix.

  6. For our classifier, we want to concatenate the title and text columns into a single content column.

We can tackle these data issues using a combination of pandas functions and regular expressions. After data wrangling, we end up with the following dataframe named df:

def combine_content(df):
    return df.assign(content=df['title'] + ' ' + df['text'])

def drop_nans(df):
    return df[~(df['url'].isna() |
                (df['text'].str.strip() == '') | 
                (df['title'].str.strip() == ''))]

def url_basename(url):
    if archive_prefix_re.match(url):
        url = archive_prefix_re.sub('', url)
    site = site_prefix_re.sub('', url).split('/')[0]
    return port_re.sub('', site)

def subset_df(df):
    return df[['timestamp', 'baseurl', 'content', 'label']]

df = (df_raw
      .assign(baseurl=lambda df: df['url'].apply(url_basename))
      .assign(timestamp=lambda df: pd.to_datetime(df['publish_date'], unit='s', errors='coerce'))
timestamp baseurl content label
0 2021-04-05 16:39:51 dailybuzzlive.com Cannibals Arrested in Florida Claim Eating Hum... fake
1 2016-01-01 23:17:43 houstonchronicle-tv.com BREAKING: Trump fires Deputy Attorney General ... fake
2 2016-03-06 15:50:39 higherperspectives.com Keanu Reeves Shook The World With Another POWE... fake
... ... ... ... ...
776 NaT msnbc.msn.com Oct. 11: Levin, Graham, McCaffrey, Myers, roun... real
777 NaT foxnews.com An Open Letter to 'All Barack Channel' - Greta... real
778 NaT whitehouse.gov State of the Union 2013 The State of the Union... real

779 rows × 4 columns

Now that we’ve loaded and cleaned the data, we can proceed to exploratory data analysis.