Data Wrangling
5.2. Data Wrangling¶
Before we start our analysis, we check the quality of the data, simplify the structure where possible, and derive any new measurements that might help us in our analysis. These sorts of actions are described in more detail in Chapter 9.
The first few rows in the data table are shown here.
bus = pd.read_csv('~/Downloads/arrival_times.csv')
bus = bus.dropna(axis=0, how='any')
bus.head()
OPD_DATE | VEHICLE_ID | RTE | DIR | ... | STOP_ID | STOP_NAME | SCH_STOP_TM | ACT_STOP_TM | |
---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 | 6201 | 673 | S | ... | 431 | 3RD AVE & PIKE ST (431) | 01:11:57 | 01:13:19 |
1 | 2016-03-26 | 6201 | 673 | S | ... | 431 | 3RD AVE & PIKE ST (431) | 23:19:57 | 23:16:13 |
2 | 2016-03-26 | 6201 | 673 | S | ... | 431 | 3RD AVE & PIKE ST (431) | 21:19:57 | 21:18:46 |
3 | 2016-03-26 | 6201 | 673 | S | ... | 431 | 3RD AVE & PIKE ST (431) | 19:04:57 | 19:01:49 |
4 | 2016-03-26 | 6201 | 673 | S | ... | 431 | 3RD AVE & PIKE ST (431) | 16:42:57 | 16:42:39 |
5 rows × 9 columns
(The raw data are available as comma separated values in a file, which we have loaded into this table; see Chapter 8 for details on this process).
It looks like some of the columns in the table might be redundant, like the columns labeled STOP_ID
and STOP_NAME
. We can find the number of unique values and their counts to confirm this.
bus[['STOP_ID','STOP_NAME']].value_counts()
STOP_ID STOP_NAME
578 3RD AVE & PIKE ST (578) 19599
431 3RD AVE & PIKE ST (431) 19318
dtype: int64
There are two Third & Pike names for the stop. Could they be related to the direction of the bus? We can check the possible combinations of direction, stop ID and stop name.
bus[['DIR','STOP_ID','STOP_NAME']].value_counts()
DIR STOP_ID STOP_NAME
N 578 3RD AVE & PIKE ST (578) 19599
S 431 3RD AVE & PIKE ST (431) 19318
dtype: int64
Indeed, the northern direction corresponds to stop number 578 and the southern direction corresponds to stop 431. Since we are looking only at one stop in our analysis, we don’t really need anything more than the direction.
We can also check the number of unique route names:
bus['RTE'].value_counts()
673 13228
674 13179
675 12510
Name: RTE, dtype: int64
These routes are numbered so they don’t match the names C, D, and E from the original description of the problem. This issue involves another aspect of data wrangling: we need to dig up information that connects the route letters and numbers. We can get this info from the Seattle transit site. Yet another part of wrangling is to translate data values into ones that are easier to understand so we replace the route numbers with their letters.
bus['route'] = bus['RTE'].replace({673: 'C', 674: 'D', 675: 'E'})
bus['direction'] = bus['DIR'].replace({'N': 'northbound', 'S': 'southbound'})
We can also create new columns in the table that help us in our investigations. For example, we can use the scheduled and actual arrival times to calculate how late a bus is. To do this requires some work with date and time formats, which is covered in Chapter 9.
bus['scheduled'] = pd.to_datetime(bus['OPD_DATE'] + ' ' + bus['SCH_STOP_TM'])
bus['actual'] = pd.to_datetime(bus['OPD_DATE'] + ' ' + bus['ACT_STOP_TM'])
minute = np.timedelta64(1, 'm')
hour = 60 * minute
diff_hrs = (bus['actual'] - bus['scheduled']) / hour
bus.loc[diff_hrs > 20, 'actual'] -= 24 * hour
bus.loc[diff_hrs < -20, 'actual'] += 24 * hour
bus['minutes_late'] = (bus['actual'] - bus['scheduled']) / minute
Let’s examine the values of this new quantity to make sure that our calculations are correct.
smallest amount late: -12.87 minutes
greatest amount late: 150.28 minutes
median amount late: 0.52 minutes
It’s a bit surprising that there are negative values for how late a bus is, but this just means the bus arrived earlier than scheduled. While, the median lateness is only about half a minute, some of the buses are 2.5 hours late! Let’s take a look at the histogram of how many minutes late the buses are.
fig = px.histogram(bus, x='minutes_late',
nbins=120, width=450, height=300)
fig.update_xaxes(range=[-13, 40])
fig.show()
We saw a similar shaped histogram in Chapter 4. The distribution of how late the buses are is highly skewed to the right, but many arrive close to on time.
Finally, our wrangling concludes with creating a simplified version of the data table. Since we only need to keep track of the route, direction, scheduled and actual arrival time, and how late the bus is, we create a smaller table, and we give the columns names that are a bit easier to read.
bus = bus[['route', 'direction', 'scheduled',
'actual', 'minutes_late']].copy()
bus.head()
route | direction | scheduled | actual | minutes_late | |
---|---|---|---|---|---|
0 | C | southbound | 2016-03-26 01:11:57 | 2016-03-26 01:13:19 | 1.37 |
1 | C | southbound | 2016-03-26 23:19:57 | 2016-03-26 23:16:13 | -3.73 |
2 | C | southbound | 2016-03-26 21:19:57 | 2016-03-26 21:18:46 | -1.18 |
3 | C | southbound | 2016-03-26 19:04:57 | 2016-03-26 19:01:49 | -3.13 |
4 | C | southbound | 2016-03-26 16:42:57 | 2016-03-26 16:42:39 | -0.30 |
These table manipulations are covered in Chapter 6.
Before we begin to model bus lateness, we want to explore and learn more about these data. We do that next.