Obtaining and Wrangling the Data
21.2. Obtaining and Wrangling the Data¶
Let’s get the data into Python.
The dataset has a GitHub page with the code that downloads the data.
Running the script to download the Politifact data takes about an hour.
After running the script, we placed the data files into the data/politifact
folder.
The articles that Politifact labeled as fake and real
are in data/politifact/fake
and data/politifact/real
.
Let’s take a look at one of the articles labeled real.
!ls -l data/politifact/real | head -n 5
total 0
drwxr-xr-x 2 sam staff 64 Jul 14 15:23 politifact100
drwxr-xr-x 3 sam staff 96 Jul 14 15:48 politifact1013
drwxr-xr-x 3 sam staff 96 Jul 14 15:37 politifact1014
drwxr-xr-x 2 sam staff 64 Jul 14 15:28 politifact10185
!ls -lh data/politifact/real/politifact1013/
total 16
-rw-r--r-- 1 sam staff 5.7K Jul 14 15:48 news content.json
Each article’s data is stored in a JSON file named news content.json
.
Let’s load the JSON for one article into a Python dictionary.
import json
from pathlib import Path
article_path = Path('data/politifact/real/politifact1013/news content.json')
article_json = json.loads(article_path.read_text())
Below, we’ve displayed the keys and values in article_json
as a table:
display_df(
pd.DataFrame(article_json.items(), columns=['key', 'value']).set_index('key'),
rows=13)
value | |
---|---|
key | |
url | http://www.senate.gov/legislative/LIS/roll_cal... |
text | Roll Call Vote 111th Congress - 1st Session\n\... |
images | [http://statse.webtrendslive.com/dcs222dj3ow9j... |
top_img | http://www.senate.gov/resources/images/us_sen.ico |
keywords | [] |
authors | [] |
canonical_link | |
title | U.S. Senate: U.S. Senate Roll Call Votes 111th... |
meta_data | {'viewport': 'width=device-width, initial-scal... |
movies | [] |
publish_date | None |
source | http://www.senate.gov |
summary |
There are many fields in the JSON file, but this analysis will only look at a
few: the article’s title
, text
, url
, and publish_date
.
We’ll create a dataframe where each row represents one article.
To so this, we will load in each available JSON file as a Python dictionary.
Then, we’ll extract the fields of interest and store the results as a pandas
dataframe named df_raw
.
from pathlib import Path
def df_row(content_json):
return {
'url': content_json['url'],
'text': content_json['text'],
'title': content_json['title'],
'publish_date': content_json['publish_date'],
}
def load_json(folder, label):
filepath = folder / filename
data = df_row(json.loads(filepath.read_text())) if filepath.exists() else {}
return {
**data,
'label': label,
}
df_raw = pd.DataFrame([load_json(path, 'fake') for path in fakes.iterdir()] +
[load_json(path, 'real') for path in reals.iterdir()])
# Raw data from JSON, without any processing
df_raw
url | text | title | publish_date | label | |
---|---|---|---|---|---|
0 | dailybuzzlive.com/cannibals-arrested-florida/ | Police in Vernal Heights, Florida, arrested 3-... | Cannibals Arrested in Florida Claim Eating Hum... | 1.62e+09 | fake |
1 | https://web.archive.org/web/20171228192703/htt... | WASHINGTON — Rod Jay Rosenstein, Deputy Attorn... | BREAKING: Trump fires Deputy Attorney General ... | 1.45e+09 | fake |
2 | https://web.archive.org/web/20160924061356/htt... | Keanu Reeves has long been known to be a stell... | Keanu Reeves Shook The World With Another POWE... | 1.46e+09 | fake |
... | ... | ... | ... | ... | ... |
1053 | NaN | NaN | NaN | NaN | real |
1054 | https://web.archive.org/web/20090701202353/htt... | This is a rush transcript from "On the Record,... | An Open Letter to 'All Barack Channel' - Greta... | NaN | real |
1055 | https://web.archive.org/web/20130209000637/htt... | The State of the Union 2012\n\n“We can either ... | State of the Union 2013 | NaN | real |
1056 rows × 5 columns
Exploring this dataframe reveals some issues we’d like to address before we keep going with the analysis. For example:
Some articles couldn’t be downloaded. When this happened, the
url
column containsNaN
.Some articles don’t have text (e.g. a webpage with only video content). We’ll drop these articles from our analysis.
The
publish_date
column stores timestamps in Unix format (seconds since the Unix epoch), not aspandas.Timestamp
objects.The
url
column has the full URL (e.g.dailybuzzlive.com/cannibals-arrested-florida/
), but we’re interested in the base URL (e.g.dailybuzzlive.com
).Some articles were downloaded from an archival website (
web.archive.org
). When this happens, we want to extract the actual base URL from the original by removing theweb.archive.org
prefix.For our classifier, we want to concatenate the
title
andtext
columns into a singlecontent
column.
We can tackle these data issues using a combination of pandas
functions and
regular expressions. After data wrangling, we end up with the following dataframe named df
:
def combine_content(df):
return df.assign(content=df['title'] + ' ' + df['text'])
def drop_nans(df):
return df[~(df['url'].isna() |
(df['text'].str.strip() == '') |
(df['title'].str.strip() == ''))]
def url_basename(url):
if archive_prefix_re.match(url):
url = archive_prefix_re.sub('', url)
site = site_prefix_re.sub('', url).split('/')[0]
return port_re.sub('', site)
def subset_df(df):
return df[['timestamp', 'baseurl', 'content', 'label']]
df = (df_raw
.pipe(drop_nans)
.reset_index(drop=True)
.assign(baseurl=lambda df: df['url'].apply(url_basename))
.assign(timestamp=lambda df: pd.to_datetime(df['publish_date'], unit='s', errors='coerce'))
.pipe(combine_content)
.pipe(subset_df)
)
df
timestamp | baseurl | content | label | |
---|---|---|---|---|
0 | 2021-04-05 16:39:51 | dailybuzzlive.com | Cannibals Arrested in Florida Claim Eating Hum... | fake |
1 | 2016-01-01 23:17:43 | houstonchronicle-tv.com | BREAKING: Trump fires Deputy Attorney General ... | fake |
2 | 2016-03-06 15:50:39 | higherperspectives.com | Keanu Reeves Shook The World With Another POWE... | fake |
... | ... | ... | ... | ... |
776 | NaT | msnbc.msn.com | Oct. 11: Levin, Graham, McCaffrey, Myers, roun... | real |
777 | NaT | foxnews.com | An Open Letter to 'All Barack Channel' - Greta... | real |
778 | NaT | whitehouse.gov | State of the Union 2013 The State of the Union... | real |
779 rows × 4 columns
Now that we’ve loaded and cleaned the data, we can proceed to exploratory data analysis.