Transforming and Common Table Expressions
Contents
7.4. Transforming and Common Table Expressions#
In this section, we show how to call functions to transform columns of data using built-in SQL functions. We also demonstrate how to use common table expressions to build up complex queries from simpler ones. As usual, we start by loading the database:
# Set up connection to database
import sqlalchemy
db = sqlalchemy.create_engine('sqlite:///babynames.db')
7.4.1. SQL Functions#
SQLite provides a variety of scalar functions, or functions that transform single data
values. When called on a column of data, SQLite will apply these functions on
each value in the column. In contrast, aggregation functions like SUM
and
COUNT
take a column of values as input and compute a single value as output.
SQLite provides a comprehensive list of the built-in scalar functions in its online documentation. For instance, to find the number of characters in each name, we use the
LENGTH
function:
query = '''
SELECT Name, LENGTH(Name)
FROM baby
LIMIT 10;
'''
pd.read_sql(query, db)
Name | LENGTH(Name) | |
---|---|---|
0 | Liam | 4 |
1 | Noah | 4 |
2 | Oliver | 6 |
... | ... | ... |
7 | Lucas | 5 |
8 | Henry | 5 |
9 | Alexander | 9 |
10 rows × 2 columns
Notice that the LENGTH
function is applied to each value within the Name
column.
Note
Like aggregation functions, each implementation of SQL provides a different set
of scalar functions. SQLite has a relatively minimal set of functions, while
PostgreSQL has many more. That said, almost all SQL implementations provide some
equivalent to SQLite’s LENGTH
, ROUND
, SUBSTR
, and LIKE
functions.
Although scalar functions use the same syntax as an aggregation function, they behave differently. This can result in confusing output if the two are mixed together in a single query:
query = '''
SELECT Name, LENGTH(Name), AVG(Count)
FROM baby
LIMIT 10;
'''
pd.read_sql(query, db)
Name | LENGTH(Name) | AVG(Count) | |
---|---|---|---|
0 | Liam | 4 | 174.47 |
Here, the AVG(Name)
computes the average of the entire Count
column, but the output is confusing—a reader could easily think the average is related to the name Liam. For this reason, we must be careful when scalar and aggregation functions
appear together within a SELECT
statement.
To extract the first letter of each name, we can use the SUBSTR
function
(short for substring). As described in the documentation, the SUBSTR
function takes three arguments. The first is the input string, the second is
the position to begin the substring (1-indexed), and the third is the
length of the substring:
query = '''
SELECT Name, SUBSTR(Name, 1, 1)
FROM baby
LIMIT 10;
'''
pd.read_sql(query, db)
Name | SUBSTR(Name, 1, 1) | |
---|---|---|
0 | Liam | L |
1 | Noah | N |
2 | Oliver | O |
... | ... | ... |
7 | Lucas | L |
8 | Henry | H |
9 | Alexander | A |
10 rows × 2 columns
We can use the AS
keyword to rename the column:
query = '''
SELECT *, SUBSTR(Name, 1, 1) AS Firsts
FROM baby
LIMIT 10;
'''
pd.read_sql(query, db)
Name | Sex | Count | Year | Firsts | |
---|---|---|---|---|---|
0 | Liam | M | 19659 | 2020 | L |
1 | Noah | M | 18252 | 2020 | N |
2 | Oliver | M | 14147 | 2020 | O |
... | ... | ... | ... | ... | ... |
7 | Lucas | M | 11281 | 2020 | L |
8 | Henry | M | 10705 | 2020 | H |
9 | Alexander | M | 10151 | 2020 | A |
10 rows × 5 columns
After calculating the first letter of each name, our analysis aims to understand the popularity of first letters over time. To do this, we want to take the output of this SQL query and use it as a single step within a longer chain of operations.
SQL provides several options to break queries into smaller steps, which is helpful in more complex analyses like this one. The most common options for doing this are to create a new relation using a CREATE TABLE
statement, create a new view using CREATE VIEW
, or create a temporary relation using WITH
. Each of these methods have different use-cases. For simplicity, we only describe the WITH
statement in this section and suggest that readers look over the SQLite documentation for details.
7.4.2. Multistep Queries Using a WITH Clause#
The WITH
clause lets us assign a name to any SELECT
query. Then we can
treat that query as though it exists as a relation in the database just for the duration of the query.
SQLite calls these temporary relations common table expressions.
For
instance, we can take the earlier query that calculates the first letter of each
name and call it letters
:
query = '''
-- Create a temporary relation called letters by calculating
-- the first letter for each name in baby
WITH letters AS (
SELECT *, SUBSTR(Name, 1, 1) AS Firsts
FROM baby
)
-- Then, select the first ten rows from letters
SELECT *
FROM letters
LIMIT 10;
'''
pd.read_sql(query, db)
Name | Sex | Count | Year | Firsts | |
---|---|---|---|---|---|
0 | Liam | M | 19659 | 2020 | L |
1 | Noah | M | 18252 | 2020 | N |
2 | Oliver | M | 14147 | 2020 | O |
... | ... | ... | ... | ... | ... |
7 | Lucas | M | 11281 | 2020 | L |
8 | Henry | M | 10705 | 2020 | H |
9 | Alexander | M | 10151 | 2020 | A |
10 rows × 5 columns
WITH
statements are very useful since they can be chained together. We can
create multiple temporary relations in a WITH
statement that each perform a
bit of work on the previous result, which lets us gradually build complicated
queries a step at a time.
7.4.3. Example: Popularity of “L” Names#
We can use WITH
statements to look at the popularity of names that start with the letter L over time. We’ll group the temporary letters
relation by the
first letter and year, then aggregate the Count
column using a sum, then filter to get only names with the letter L:
query = '''
WITH letters AS (
SELECT *, SUBSTR(Name, 1, 1) AS Firsts
FROM baby
)
SELECT Firsts, Year, SUM(Count) AS Count
FROM letters
WHERE Firsts = "L"
GROUP BY Firsts, Year;
'''
letter_counts = pd.read_sql(query, db)
letter_counts
Firsts | Year | Count | |
---|---|---|---|
0 | L | 1880 | 12799 |
1 | L | 1881 | 12770 |
2 | L | 1882 | 14923 |
... | ... | ... | ... |
138 | L | 2018 | 246251 |
139 | L | 2019 | 249315 |
140 | L | 2020 | 239760 |
141 rows × 3 columns
This relation contains the same data as the one from Chapter 6. In that chapter, we make a plot of the Count
column over time, which we omit here for brevity.
In this section, we introduced data transformations.
To transform values in a relation, we commonly use SQL functions like
LENGTH()
or SUBSTR()
.
We also explained how to build up complex queries using the WITH
clause.