To connect records between two data tables, SQL relations can be joined together similar to dataframes. In this section, we introduce SQL joins to replicate our analysis of the baby names data. Recall that Chapter 6 mentions a New York Times article that talks about how certain name categories, like mythological and baby boomer names, have become more or less popular over time.
We’ve taken the names and categories in the NYT article and put them in a small relation named
# Set up connection to database import sqlalchemy db = sqlalchemy.create_engine('sqlite:///babynames.db')
query = ''' SELECT * FROM nyt; ''' pd.read_sql(query, db)
23 rows × 2 columns
Notice that the preceding code runs a query on
babynames.db, the same database
that contains the larger
baby relation from the previous sections. SQL
databases can hold more than one relation, making them very useful when we need
to work with many data tables at once. CSV files, on the other hand, typically
contain one data table each—if we perform a data analysis that uses 20
data tables, we might need to keep track of the names, locations, and versions
of 20 CSV files. Instead, it could be simpler to store all the data tables
in a SQLite database stored in a single file.
To see how popular the categories of names are, we join the
nyt relation with
baby relation to get the name counts from
7.3.1. Inner Joins#
As in Chapter 6, we’ve made smaller versions of the
nyt tables so that it’s easier to
see what happens when we join tables together. The relations are called
query = ''' SELECT * FROM baby_small; ''' pd.read_sql(query, db)
query = ''' SELECT * FROM nyt_small; ''' pd.read_sql(query, db)
To join relations in SQL, we use the
INNER JOIN clause to say which tables we want to join and the
ON clause to specify a predicate for joining the tables. Here’s an example:
query = ''' SELECT * FROM baby_small INNER JOIN nyt_small ON baby_small.Name = nyt_small.nyt_name ''' pd.read_sql(query, db)
Notice that this result is the same as doing an inner join in
pandas: the new table has the columns of both the
tables. The rows with the name Noah are gone, and the remaining rows have their matching
To join two tables together, we tell SQL the column(s) from each
table that we want to use to do the join using a predicate with the
SQL matches rows together when the values in the joining columns fulfill the predicate, as
shown in Fig. 7.2.
pandas, SQL gives more flexibility on how rows are joined. The
pd.merge() method can only join using simple equality, but the predicate in the
ON clause can be arbitrarily complex. As an example, we take advantage of this extra versatility in Section 12.2.
7.3.2. Left and Right Joins#
pandas, SQL also supports left joins. Instead of saying
INNER JOIN, we use
query = ''' SELECT * FROM baby_small LEFT JOIN nyt_small ON baby_small.Name = nyt_small.nyt_name ''' pd.read_sql(query, db)
As we might expect, the “left” side of the join refers to the table that appears on the left side of the
LEFT JOIN keyword.
We can see the
Noah rows are kept in the resulting relation even when they don’t have a match in the righthand relation.
Note that SQLite doesn’t support
right joins directly, but we can perform the same join by swapping the order of relations, then using
query = ''' SELECT * FROM nyt_small LEFT JOIN baby_small ON baby_small.Name = nyt_small.nyt_name ''' pd.read_sql(query, db)
SQLite doesn’t have a built-in keyword for outer joins. In cases where
an outer join is needed, we have to either use a different SQL engine or perform an outer join via
pandas. However, in our (the authors’) experience, outer joins are rarely used in practice compared to inner and left joins.
7.3.3. Example: Popularity of NYT Name Categories#
Now let’s return to the full
We want to know how the popularity of name categories in
nyt has changed
over time. To answer this question, we should:
nyt, matching rows where the names are equal.
Group the table by
Aggregate the counts using a sum:
query = ''' SELECT category, Year, SUM(Count) AS count --  FROM baby INNER JOIN nyt --  ON baby.Name = nyt.nyt_name --  GROUP BY category, Year --  ''' cate_counts = pd.read_sql(query, db) cate_counts
650 rows × 3 columns
The numbers in square brackets (
) in the preceding query show how each step in our plan maps to the parts of the SQL query. The code re-creates the dataframe from Chapter 6, where we created plots to verify the claims of the New York Times article. For brevity, we omit duplicating the plots here.
Notice that in the SQL code in this example, the numbers appear out of order—
. As a rule of thumb for first-time SQL learners, we can often think of the
SELECT statement as the last piece of the query to execute even though it appears first.
In this section, we introduced joins for relations. When joining relations together, we match rows using the
INNER JOIN or
LEFT JOIN keyword and a boolean predicate. In the next section, we’ll explain how to transform values in a relation.