Transforming
Contents
7.4. Transforming¶
Data scientists transform columns when they need to change each value in a feature in the same way. For example, if a feature contains heights of people in feet, a data scientist might want to transform the heights to centimeters. In this section, we’ll show how to apply functions that transform columns of data using SQL.
# Set up connection to database
import sqlalchemy
db = sqlalchemy.create_engine('sqlite:///babynames.db')
query = '''
SELECT *
FROM baby
LIMIT 10;
'''
pd.read_sql(query, db)
Name | Sex | Count | Year | |
---|---|---|---|---|
0 | Liam | M | 19659 | 2020 |
1 | Noah | M | 18252 | 2020 |
2 | Oliver | M | 14147 | 2020 |
... | ... | ... | ... | ... |
7 | Lucas | M | 11281 | 2020 |
8 | Henry | M | 10705 | 2020 |
9 | Alexander | M | 10151 | 2020 |
10 rows × 4 columns
In the baby names New York Times article [Williams, 2021], Pamela
mentions that names starting with the letter “L” and “K” became popular
after 2000. On the other hand, names starting with the letter “J” peaked in
popularity in the 1970s and 1980s and have dropped off in popularity since. We
can verify these claims using the baby
dataset.
We approach this problem using the following steps:
Transform the
Name
column into a new column that contains the first letters of each value inName
.Group the dataframe by the first letter and year.
Aggregate the name counts by summing.
To complete the first step, we’ll apply a function to the Name
column.
7.4.1. SQL Functions¶
SQLite provides scalar functions, or functions that transform single data
values. When called on a column of data, SQLite will apply these functions on
each value in the column. In contrast, aggregation functions like SUM
and
COUNT
take a column of values as input and compute a single value as output.
SQLite provides a comprehensive list of the built-in scalar functions on its
website 1. For instance, to find the lengths of each name, we use the
LENGTH
function.
query = '''
SELECT Name, LENGTH(Name)
FROM baby
LIMIT 10;
'''
pd.read_sql(query, db)
Name | LENGTH(Name) | |
---|---|---|
0 | Liam | 4 |
1 | Noah | 4 |
2 | Oliver | 6 |
... | ... | ... |
7 | Lucas | 5 |
8 | Henry | 5 |
9 | Alexander | 9 |
10 rows × 2 columns
Notice that the LENGTH
function is applied to each value within the Name
column.
Note
Like aggregation functions, each implementation of SQL provides a different set
of scalar functions. SQLite has a relatively minimal set of functions, while
PostgreSQL has many more 2. Most SQL implementations provide some
equivalent to SQLite’s LENGTH
, ROUND
, SUBSTR
, and LIKE
functions.
Calling a scalar function uses the same syntax as an aggregation function. This can result in confusing output if the two are mixed together in a single query:
query = '''
SELECT Name, LENGTH(Name), COUNT(Name)
FROM baby
LIMIT 10;
'''
pd.read_sql(query, db)
Name | LENGTH(Name) | COUNT(Name) | |
---|---|---|---|
0 | Liam | 4 | 2020722 |
For this reason, we must be careful when when scalar and aggregation functions
appear together within a SELECT
statement.
To extract the first letter of each name, we can use the SUBSTR
function
(short for “substring”). As described in the documentation, the SUBSTR
function takes three arguments. The first is the input string, the second is
the position to begin the substring (1-indexed), and the third is the
length of the substring.
query = '''
SELECT Name, SUBSTR(Name, 1, 1)
FROM baby
LIMIT 10;
'''
pd.read_sql(query, db)
Name | SUBSTR(Name, 1, 1) | |
---|---|---|
0 | Liam | L |
1 | Noah | N |
2 | Oliver | O |
... | ... | ... |
7 | Lucas | L |
8 | Henry | H |
9 | Alexander | A |
10 rows × 2 columns
Now, we can use the AS
keyword to rename the column:
query = '''
SELECT *, SUBSTR(Name, 1, 1) AS Firsts
FROM baby
LIMIT 10;
'''
pd.read_sql(query, db)
Name | Sex | Count | Year | Firsts | |
---|---|---|---|---|---|
0 | Liam | M | 19659 | 2020 | L |
1 | Noah | M | 18252 | 2020 | N |
2 | Oliver | M | 14147 | 2020 | O |
... | ... | ... | ... | ... | ... |
7 | Lucas | M | 11281 | 2020 | L |
8 | Henry | M | 10705 | 2020 | H |
9 | Alexander | M | 10151 | 2020 | A |
10 rows × 5 columns
This completes step 1 of our analysis plan. SQL provides several options to
break queries into smaller steps, which is helpful in a more complex analysis
like this one. One option is to create an entirely new relation using the
CREATE TABLE
statement. Another option is to use the WITH
keyword to
create a temporary relation just for the query at hand. We’ll demonstrate
the WITH
keyword for this example.
7.4.2. Multistep Queries Using a WITH
Clause¶
The WITH
clause lets us assign a name to any SELECT
query. Then, we can
treat that query as though it exists as a relation in the database. For
instance, we can take the query above that calculates the first letter of each
name and call it letters
:
query = '''
WITH letters AS (
SELECT *, SUBSTR(Name, 1, 1) AS Firsts
FROM baby
)
SELECT *
FROM letters
LIMIT 10;
'''
pd.read_sql(query, db)
Name | Sex | Count | Year | Firsts | |
---|---|---|---|---|---|
0 | Liam | M | 19659 | 2020 | L |
1 | Noah | M | 18252 | 2020 | N |
2 | Oliver | M | 14147 | 2020 | O |
... | ... | ... | ... | ... | ... |
7 | Lucas | M | 11281 | 2020 | L |
8 | Henry | M | 10705 | 2020 | H |
9 | Alexander | M | 10151 | 2020 | A |
10 rows × 5 columns
We can read this query as follows:
-- Create a temporary relation called letters by calculating the first
-- letters for each name in baby
WITH letters AS (
SELECT *,
SUBSTR(Name, 1, 1) AS Firsts
FROM baby
)
-- Then, select the first ten rows from letters
SELECT *
FROM letters
LIMIT 10;
WITH
statements are highly useful since they can be chained together. We can
create multiple temporary relations in a WITH
statement that each perform a
bit of work on the previous result, which lets us gradually build complicated
queries a step at a time.
7.4.3. Example: Popularity of “L” Names¶
Now, we can perform the next steps in our analysis. We’ll group the relation by the
first letter and year, then aggregate the Count
column using a sum.
query = '''
WITH letters AS (
SELECT *, SUBSTR(Name, 1, 1) AS Firsts
FROM baby
)
SELECT Firsts, Year, SUM(Count) AS Count
FROM letters
GROUP BY Firsts, Year;
'''
letter_counts = pd.read_sql(query, db)
letter_counts
Firsts | Year | Count | |
---|---|---|---|
0 | A | 1880 | 16740 |
1 | A | 1881 | 16257 |
2 | A | 1882 | 18790 |
... | ... | ... | ... |
3638 | Z | 2018 | 55996 |
3639 | Z | 2019 | 55293 |
3640 | Z | 2020 | 54011 |
3641 rows × 3 columns
Finally, we use plotly
to plot the popularity of “L” names over time:
fig = px.line(letter_counts.loc[letter_counts['Firsts'] == 'L'],
x='Year', y='Count', title='Popularity of "L" names',
width=350, height=250)
margin(fig, t=30)
The plot shows that “L” names were popular in the 1960s, dipped in the decades after, but have indeed resurged in popularity after 2000.
What about “J” names?
fig = px.line(letter_counts.loc[letter_counts['Firsts'] == 'J'],
x='Year', y='Count', title='Popularity of "J" names',
width=350, height=250)
margin(fig, t=30)
The NYT article says that “J” names were popular in the 1970s and 80s. The plot agrees, and also shows that they have become less popular after 2000.
In this section, we introduced data transformations.
To transform values in a relation, we commonly use SQL functions like
LENGTH()
or SUBSTR()
.
We also explained how to build up complex queries using the WITH
clause.
In the next section, we’ll compare relations with other ways to represent and
manipulate data tables.