Joining
Contents
6.3. Joining#
Data scientists very frequently want to join two or more dataframes together in order to connect data values across dataframes. For instance, an online bookstore might have one dataframe with the books each user has ordered and a second dataframe with the genres of each book. By joining the two dataframes together, the data scientist can see what genres each user prefers.
We’ll continue looking at the baby names data. We’ll use joins to check some trends mentioned in the New York Times article about baby names. The article talks about how certain categories of names have become more or less popular over time. For instance, it mentions that mythological names like Julius and Cassius have become popular, while baby boomer names like Susan and Debbie have become less popular. How has the popularity of these categories changed over time?
We’ve taken the names and categories in the NYT article and put them in a small dataframe:
nyt = pd.read_csv('nyt_names.csv')
nyt
nyt_name | category | |
---|---|---|
0 | Lucifer | forbidden |
1 | Lilith | forbidden |
2 | Danger | forbidden |
... | ... | ... |
20 | Venus | celestial |
21 | Celestia | celestial |
22 | Skye | celestial |
23 rows × 2 columns
To see how popular the categories of names are, we join the nyt
dataframe
with the baby
dataframe to get the name counts from baby
:
baby = pd.read_csv('babynames.csv')
baby
Name | Sex | Count | Year | |
---|---|---|---|---|
0 | Liam | M | 19659 | 2020 |
1 | Noah | M | 18252 | 2020 |
2 | Oliver | M | 14147 | 2020 |
... | ... | ... | ... | ... |
2020719 | Verona | F | 5 | 1880 |
2020720 | Vertie | F | 5 | 1880 |
2020721 | Wilma | F | 5 | 1880 |
2020722 rows × 4 columns
For intuition, we can imagine going down each row in baby
and asking: is this name in the nyt
table? If so, then add the value in the category
column to the row. This is the basic idea behind a join. Let’s look at a few examples on smaller dataframes first.
6.3.1. Inner Joins#
We start by making smaller versions of the baby
and nyt
tables so that it’s easier to see what happens when we join tables together:
nyt_small
nyt_name | category | |
---|---|---|
0 | Karen | boomer |
1 | Julius | mythology |
2 | Freya | mythology |
baby_small
Name | Sex | Count | Year | |
---|---|---|---|---|
0 | Noah | M | 18252 | 2020 |
1 | Julius | M | 960 | 2020 |
2 | Karen | M | 6 | 2020 |
3 | Karen | F | 325 | 2020 |
4 | Noah | F | 305 | 2020 |
To join tables in pandas
, we’ll use the .merge()
method:
baby_small.merge(nyt_small,
left_on='Name', # column in left table to match
right_on='nyt_name') # column in right table to match
Name | Sex | Count | Year | nyt_name | category | |
---|---|---|---|---|---|---|
0 | Julius | M | 960 | 2020 | Julius | mythology |
1 | Karen | M | 6 | 2020 | Karen | boomer |
2 | Karen | F | 325 | 2020 | Karen | boomer |
Notice that the new table has the columns of both the baby_small
and nyt_small
tables. The rows with the name Noah
are gone. And the remaining rows have their
matching category
from nyt_small
.
Note
Readers should also be aware that pandas
has a .join()
method for joining two dataframes together. However, the .merge()
method has more flexibility for how the dataframes are joined, which is why we focus on .merge()
. We encourage readers to consult the pandas
documentation for the exact difference between the two.
When we join two tables together, we tell pandas
the column(s) from each
table that we want to use to make the join (the left_on
and right_on
arguments).
pandas
matches rows together when the values in the joining columns match, as
shown in Fig. 6.4.
By default, pandas
does an inner join. If either table has rows that don’t
have matches in the other table, pandas
drops those rows from the result. In
this case, the Noah
rows in baby_small
don’t have matches in
nyt_small
, so they are dropped. Also, the Freya
row in nyt_small
doesn’t have matches in baby_small
, so it’s dropped as well. Only the
rows with a match in both tables stay in the final result.
6.3.2. Left, Right, and Outer Joins#
We sometimes want to keep rows without a match instead of dropping them entirely. There are other types of joins—left, right, and outer—that keep rows even when they don’t have a match.
In a left join, rows in the left table without a match are kept in the final result, as shown in Fig. 6.5.
To do a left join in pandas
, use how='left'
in the call to .merge()
:
baby_small.merge(nyt_small,
left_on='Name',
right_on='nyt_name',
how='left') # left join instead of inner
Name | Sex | Count | Year | nyt_name | category | |
---|---|---|---|---|---|---|
0 | Noah | M | 18252 | 2020 | NaN | NaN |
1 | Julius | M | 960 | 2020 | Julius | mythology |
2 | Karen | M | 6 | 2020 | Karen | boomer |
3 | Karen | F | 325 | 2020 | Karen | boomer |
4 | Noah | F | 305 | 2020 | NaN | NaN |
Notice that the Noah
rows are kept in the final table. Since those rows didn’t
have a match in the nyt_small
dataframe, the join leaves NaN
values in the
nyt_name
and category
columns. Also, notice that the Freya
row in
nyt_small
is still dropped.
A right join works similarly to the left join, except that nonmatching rows in the right table are kept instead of the left table:
baby_small.merge(nyt_small,
left_on='Name',
right_on='nyt_name',
how='right')
Name | Sex | Count | Year | nyt_name | category | |
---|---|---|---|---|---|---|
0 | Karen | M | 6.0 | 2020.0 | Karen | boomer |
1 | Karen | F | 325.0 | 2020.0 | Karen | boomer |
2 | Julius | M | 960.0 | 2020.0 | Julius | mythology |
3 | NaN | NaN | NaN | NaN | Freya | mythology |
Finally, an outer join keeps rows from both tables even when they don’t have a match:
baby_small.merge(nyt_small,
left_on='Name',
right_on='nyt_name',
how='outer')
Name | Sex | Count | Year | nyt_name | category | |
---|---|---|---|---|---|---|
0 | Noah | M | 18252.0 | 2020.0 | NaN | NaN |
1 | Noah | F | 305.0 | 2020.0 | NaN | NaN |
2 | Julius | M | 960.0 | 2020.0 | Julius | mythology |
3 | Karen | M | 6.0 | 2020.0 | Karen | boomer |
4 | Karen | F | 325.0 | 2020.0 | Karen | boomer |
5 | NaN | NaN | NaN | NaN | Freya | mythology |
6.3.3. Example: Popularity of NYT Name Categories#
Now let’s return to the full dataframes baby
and nyt
:
# .head() slices out the first few rows - convenient for saving space
baby.head(2)
Name | Sex | Count | Year | |
---|---|---|---|---|
0 | Liam | M | 19659 | 2020 |
1 | Noah | M | 18252 | 2020 |
nyt.head(2)
nyt_name | category | |
---|---|---|
0 | Lucifer | forbidden |
1 | Lilith | forbidden |
We want to know how the popularity of name categories in nyt
has changed
over time. To answer this question:
Inner join
baby
withnyt
.Group the table by
category
andYear
.Aggregate the counts using a sum:
cate_counts = (
baby.merge(nyt, left_on='Name', right_on='nyt_name') # [1]
.groupby(['category', 'Year']) # [2]
['Count'] # [3]
.sum() # [3]
.reset_index()
)
cate_counts
category | Year | Count | |
---|---|---|---|
0 | boomer | 1880 | 292 |
1 | boomer | 1881 | 298 |
2 | boomer | 1882 | 326 |
... | ... | ... | ... |
647 | mythology | 2018 | 2944 |
648 | mythology | 2019 | 3320 |
649 | mythology | 2020 | 3489 |
650 rows × 3 columns
Now we can plot the popularity of boomer
names and mythology
names:
boomers = px.line(cate_counts.query('category == "boomer"'),
x='Year', y='Count')
myths = px.line(cate_counts.query('category == "mythology"'),
x='Year', y='Count')
fig = left_right(boomers, myths, width=500, height=200,
subplot_titles=['Boomer names', 'Mythology names'])
margin(fig, t=30)
fig
As the NYT article claims, baby boomer names have become less popular since 2000, while mythology names have become more popular.
We can also plot the popularities of all the categories at once. Take a look at the following plots and see whether they support the claims made in the New York Times article:
fig = px.line(cate_counts, x='Year', y='Count',
facet_col='category', facet_col_wrap=3,
facet_row_spacing=0.15,
width=600, height=400)
margin(fig, t=30)
fig.update_yaxes(matches=None, showticklabels=False)
In this section, we introduced joins for dataframes.
When joining dataframes together, we match rows using the .merge()
function.
It’s important to consider the type of join (inner, left, right, or outer)
when joining dataframes.
In the next section, we’ll explain how to transform values in a dataframe.