This section introduces operations for taking subsets of relations. When data scientists begin working with a relation, they often want to subset the specific data that they plan to use. For example, a data scientist can slice out the ten relevant features from a relation with hundreds of columns. Or, they can filter a relation to remove rows with incomplete data. For the rest of this chapter, we’ll introduce relation operations using a dataset of baby names.
To work with relations, we’ll introduce a domain-specific programming language called SQL (Structured Query Language). We commonly pronounce “SQL” like “sequel” instead of spelling out the acronym. SQL is a specialized language for working with relations—as such, SQL has its own syntax that makes it easier to write programs that operate on relational data.
7.1.1. About the Data¶
There’s a 2021 New York Times article that talks about Prince Harry and Meghan’s unique choice for their new baby daughter’s name: Lilibet [Williams, 2021]. The article has an interview with Pamela Redmond, an expert on baby names, who talks about interesting trends in how people name their kids. For example, she says that names that start with the letter “L” have become very popular in recent years, while names that start with the letter “J” were most popular in the 1970s and 1980s. Are these claims reflected in data? We can use SQL to find out.
In this chapter, we’ll use SQL queries within Python programs. This illustrates
a common workflow—data scientists often process and subset data in SQL before
loading the data into Python for further analysis. SQL databases make it easier
to work with large amounts of data compared to
pandas programs. However,
loading data into
pandas makes it easier to visualize the data and build
So, in this chapter we’ll use the
pandas.read_sql function which runs a SQL
query and stores the output in a dataframe. Using this function requires some
setup. We start by importing the
sqlalchemy Python packages.
import pandas as pd import sqlalchemy
Our database is stored in a file called
babynames.db. This file is a SQLite
database , so we’ll set up a
sqlalchemy object that can process
db = sqlalchemy.create_engine('sqlite:///babynames.db')
SQL is a programming language that is implemented differently across database systems. In this book, we use SQLite, one popular database system. Other systems make different tradeoffs that are useful for different domains. For instance, PostgreSQL and MySQL are systems that are useful for large web applications where many end users are writing data at the same time.
To make matters more complicated, each SQL system has slight differences. In this book, we’ll rely on core parts of SQL syntax that are unlikely to change across implementations. While we won’t go over other systems in detail, we’ll point out where different SQL systems may differ in capabilities.
Now, we can use
pd.read_sql to run SQL queries on this database. This
database has two relations:
nyt. Here’s a simple example that
reads in the entire
# SQL query saved in a Python string query = ''' SELECT * FROM baby; ''' pd.read_sql(query, db)
2020722 rows × 4 columns
The text inside the
query variable contains SQL code.
SQL keywords. We read the query above like:
SELECT * -- Get all the columns... FROM baby; -- ...from the baby relation
The data in the
baby relation comes from the US Social Security department,
which records the baby name and birth sex for birth certificate purposes. They
make the baby names data available on their website [Department, 2021].
The Social Security website has a page that describes the data in more detail (link). We won’t go in-depth in this chapter about the data’s limitations, but we’ll point out this relevant quote from the website:
All names are from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in our data. For others who did apply, our records may not show the place of birth, and again their names are not included in our data.
All data are from a 100% sample of our records on Social Security card applications as of March 2021.
7.1.2. What’s a Relation?¶
Let’s examine the
baby relation in more detail. A relation has rows and
columns. Every column has a label, as illustrated in
Fig. 7.1. Unlike dataframes, however, individual rows in a
relation don’t have labels. Also, unlike dataframes, rows of a relation aren’t
Relations have a long history. More formal treatments of relations use the term “tuple” to refer to the rows of a relation, and “attribute” to refer to the columns. There is also a rigorous way to define data operations using relational algebra, which is derived from mathematical set algebra. Interested data scientists can find a more in-depth treatment of relations in books on database systems such as the one by Garcia-Molina, Ullman, and Widom [Garcia-Molina et al., 2008].
Slicing is an operation that creates a new relation by taking a subset of
rows or columns out of another relation. Think about slicing a tomato—slices
can go both vertially and horizontally. To slice columns of a relation, we give
SELECT statement the columns we want.
query = ''' SELECT Name FROM baby; ''' pd.read_sql(query, db)
2020722 rows × 1 columns
query = ''' SELECT Name, Count FROM baby; ''' pd.read_sql(query, db)
2020722 rows × 2 columns
To slice out a specific number of rows, use the
query = ''' SELECT Name FROM baby LIMIT 10; ''' pd.read_sql(query, db)
10 rows × 1 columns
In sum, we use the
LIMIT keywords to slice columns and rows
of a relation.
7.1.4. Filtering Rows¶
So far, we’ve shown how to use
LIMIT to slice columns and rows
of a relation.
However, data scientists often want to filter rows—they want to take
subsets of rows using some criteria. Let’s say you want to find the most
popular baby names in 2020. To do this, you can filter rows to keep only the
rows where the
Year is 2020.
To filter a relation, use the
WHERE keyword with a predicate:
query = ''' SELECT * FROM baby WHERE Year = 2020; ''' pd.read_sql(query, db)
31270 rows × 4 columns
Note that when comparing for equality, SQL uses a single equals sign:
SELECT * FROM baby WHERE Year = 2020; -- ↑ -- Single equals sign
In Python, however, single equals signs are used for variable assignment. The
Year = 2020 will assign the value
2020 to the variable
compare for equality, Python code uses double equals signs:
# Assignment my_year = 2021 # Comparison, which evaluates to False my_year == 2020
To add more predicates to the filter, use the
OR keywords. For
instance, to find the names that have more than 10000 babies in either 2020 or
2019, we write:
query = ''' SELECT * FROM baby WHERE Count > 10000 AND (Year = 2020 OR Year = 2019); ''' pd.read_sql(query, db)
44 rows × 4 columns
Finally, to find the ten most common names in 2020, we can sort the dataframe
Count in descending order using the
ORDER BY keyword with the
option (short for DESCending).
query = ''' SELECT * FROM baby WHERE Year = 2020 ORDER BY Count DESC LIMIT 10; ''' pd.read_sql(query, db)
10 rows × 4 columns
We see that Liam, Noah, and Emma were the most popular baby names in 2020.
7.1.5. Example: How recently has Luna become a popular name?¶
The New York Times article mentions that the name “Luna” was almost nonexistent before 2000 but has since grown to become a very popular name for girls. When exactly did Luna become popular? We can check this using slicing and filtering. When approaching a data manipulation task, we recommend breaking the problem down into smaller steps. For example, we could think:
Filter: keep only rows with
Slice: keep the
Now, it’s a matter of translating each step into code.
query = ''' SELECT * FROM baby WHERE Name = "Luna" AND Sex = "F"; ''' luna = pd.read_sql(query, db) luna
128 rows × 4 columns
pd.read_sql returns a
pandas.DataFrame object, which we can use to make
a plot. This illustrates a common workflow: process the data using SQL, load
it into a
pandas dataframe, then visualize the results.
px.line(luna, x='Year', y='Count', width=350, height=250)
It’s just as the article says. Luna wasn’t popular at all until the year 2000 or so. Think about that—if someone tells you that their name is Luna, you can take a pretty good guess at their age even without any other information about them!
Just for fun, here’s the same plot for the name Siri.
query = ''' SELECT * FROM baby WHERE Name = "Siri" AND Sex = "F"; ''' siri = pd.read_sql(query, db) px.line(siri, x='Year', y='Count', width=350, height=250)
Why might the popularity have dropped so suddenly after 2010? Well, Siri happens to be the name of Apple’s voice assistant and was introduced in 2011. Let’s draw a line for the year 2011 and take a look…
fig = px.line(siri, x='Year', y='Count', width=350, height=250) fig.add_vline(x=2011, line_color='red', line_width=4)
It looks like parents don’t want their kids to be confused when other people say “Hey Siri” to their phones.
This section introduced relations in SQL. We’ve covered the common ways that data scientists subset relations—slicing with column labels and filtering using a boolean condition. In the next section, we explain how to aggregate rows together.