7.1. Subsetting

This section introduces operations for taking subsets of relations. When data scientists begin working with a relation, they often want to subset the specific data that they plan to use. For example, a data scientist can slice out the ten relevant features from a relation with hundreds of columns. Or, they can filter a relation to remove rows with incomplete data. For the rest of this chapter, we’ll introduce relation operations using a dataset of baby names.

To work with relations, we’ll introduce a domain-specific programming language called SQL (Structured Query Language). We commonly pronounce “SQL” like “sequel” instead of spelling out the acronym. SQL is a specialized language for working with relations—as such, SQL has its own syntax that makes it easier to write programs that operate on relational data.

7.1.1. About the Data

There’s a 2021 New York Times article that talks about Prince Harry and Meghan’s unique choice for their new baby daughter’s name: Lilibet. The article has an interview with Pamela Redmond, an expert on baby names, who talks about interesting trends in how people name their kids. For example, she says that names that start with the letter “L” have become very popular in recent years, while names that start with the letter “J” were most popular in the 1970s and 1980s. Are these claims reflected in data? We can use SQL to find out.

In this chapter, we’ll use SQL queries within Python programs. This illustrates a common workflow—data scientists often process and subset data in SQL before loading the data into Python for further analysis. SQL databases make it easier to work with large amounts of data compared to pandas programs. However, loading data into pandas makes it easier to visualize the data and build statistical models.

So, in this chapter we’ll use the pandas.read_sql function which runs a SQL query and stores the output in a dataframe. Using this function requires some setup. We start by importing the pandas and sqlalchemy Python packages.

import pandas as pd
import sqlalchemy

Our database is stored in a file called babynames.db. This file is a SQLite database, so we’ll set up a sqlalchemy object that can process this format.

db = sqlalchemy.create_engine('sqlite:///babynames.db')

Note

SQL is a programming language that is implemented differently across database systems. In this book, we use SQLite, one popular database system. Other systems make different tradeoffs that are useful for different domains. For instance, PostgreSQL and MySQL are systems that are useful for large web applications where many end users are writing data at the same time.

To make matters more complicated, each SQL system has slight differences. In this book, we’ll rely on core parts of SQL syntax that are unlikely to change across implementations. While we won’t go over other systems in detail, we’ll point out where different SQL systems may differ in capabilities.

Now, we can use pd.read_sql to run SQL queries on this database. This database has two relations: baby and nyt. Here’s a simple example that reads in the entire baby relation.

# SQL query saved in a Python string
query = ''' 
SELECT *
FROM baby;
'''

pd.read_sql(query, db)
Name Sex Count Year
0 Liam M 19659 2020
1 Noah M 18252 2020
2 Oliver M 14147 2020
... ... ... ... ...
2020719 Verona F 5 1880
2020720 Vertie F 5 1880
2020721 Wilma F 5 1880

2020722 rows × 4 columns

The text inside the query variable contains SQL code. SELECT and FROM are SQL keywords. We read the query above like:

SELECT *    -- Get all the columns...
FROM baby;  -- ...from the baby relation

The data in the baby relation comes from the US Social Security department, which records the baby name and birth sex for birth certificate purposes. They make the baby names data available on their website.

The Social Security website has a page that describes the data in more detail. We won’t go in-depth in this chapter about the data’s limitations, but we’ll point out this relevant quote from the website:

All names are from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in our data. For others who did apply, our records may not show the place of birth, and again their names are not included in our data.

All data are from a 100% sample of our records on Social Security card applications as of March 2021.

7.1.2. What’s a Relation?

Let’s examine the baby relation in more detail. A relation has rows and columns. Every column has a label, as illustrated in Fig. 7.1. Unlike dataframes, however, individual rows in a relation don’t have labels. Also, unlike dataframes, rows of a relation aren’t ordered.

relation-labels

Fig. 7.1 The baby relation has labels for columns (boxed).

Relations have a long history. More formal treatments of relations use the term “tuple” to refer to the rows of a relation, and “attribute” to refer to the columns. There is also a rigorous way to define data operations using relational algebra, which is derived from mathematical set algebra. Interested data scientists can find a more in-depth treatment of relations in books on database systems such as the one by Garcia-Molina, Ullman, and Widom.

7.1.3. Slicing

Slicing is an operation that creates a new relation by taking a subset of rows or columns out of another relation. Think about slicing a tomato—slices can go both vertially and horizontally. To slice columns of a relation, we give the SELECT statement the columns we want.

query = ''' 
SELECT Name
FROM baby;
''' 

pd.read_sql(query, db)
Name
0 Liam
1 Noah
2 Oliver
... ...
2020719 Verona
2020720 Vertie
2020721 Wilma

2020722 rows × 1 columns

query = ''' 
SELECT Name, Count
FROM baby;
''' 

pd.read_sql(query, db)
Name Count
0 Liam 19659
1 Noah 18252
2 Oliver 14147
... ... ...
2020719 Verona 5
2020720 Vertie 5
2020721 Wilma 5

2020722 rows × 2 columns

To slice out a specific number of rows, use the LIMIT keyword:

query = ''' 
SELECT Name
FROM baby
LIMIT 10;
''' 

pd.read_sql(query, db)
Name
0 Liam
1 Noah
2 Oliver
... ...
7 Lucas
8 Henry
9 Alexander

10 rows × 1 columns

In sum, we use the SELECT and LIMIT keywords to slice columns and rows of a relation.

7.1.4. Filtering Rows

So far, we’ve shown how to use SELECT and LIMIT to slice columns and rows of a relation.

However, data scientists often want to filter rows—they want to take subsets of rows using some criteria. Let’s say you want to find the most popular baby names in 2020. To do this, you can filter rows to keep only the rows where the Year is 2020.

To filter a relation, use the WHERE keyword with a predicate:

query = ''' 
SELECT *
FROM baby
WHERE Year = 2020;
'''

pd.read_sql(query, db)
Name Sex Count Year
0 Liam M 19659 2020
1 Noah M 18252 2020
2 Oliver M 14147 2020
... ... ... ... ...
31267 Zylynn F 5 2020
31268 Zynique F 5 2020
31269 Zynlee F 5 2020

31270 rows × 4 columns

Warning

Note that when comparing for equality, SQL uses a single equals sign:

SELECT *
FROM baby
WHERE Year = 2020;
--         ↑
--         Single equals sign

In Python, however, single equals signs are used for variable assignment. The statement Year = 2020 will assign the value 2020 to the variable Year. To compare for equality, Python code uses double equals signs:

# Assignment
my_year = 2021

# Comparison, which evaluates to False
my_year == 2020

To add more predicates to the filter, use the AND and OR keywords. For instance, to find the names that have more than 10000 babies in either 2020 or 2019, we write:

query = ''' 
SELECT *
FROM baby
WHERE Count > 10000
  AND (Year = 2020
       OR Year = 2019);
'''

pd.read_sql(query, db)
Name Sex Count Year
0 Liam M 19659 2020
1 Noah M 18252 2020
2 Oliver M 14147 2020
... ... ... ... ...
41 Mia F 12452 2019
42 Harper F 10464 2019
43 Evelyn F 10412 2019

44 rows × 4 columns

Finally, to find the ten most common names in 2020, we can sort the dataframe by Count in descending order using the ORDER BY keyword with the DESC option (short for DESCending).

query = ''' 
SELECT *
FROM baby
WHERE Year = 2020
ORDER BY Count DESC
LIMIT 10;
'''

pd.read_sql(query, db)
Name Sex Count Year
0 Liam M 19659 2020
1 Noah M 18252 2020
2 Emma F 15581 2020
... ... ... ... ...
7 Sophia F 12976 2020
8 Amelia F 12704 2020
9 William M 12541 2020

10 rows × 4 columns

We see that Liam, Noah, and Emma were the most popular baby names in 2020.