Aggregating
Contents
7.2. Aggregating¶
This section introduces operations for aggregating rows in a relation. Data scientists aggregate rows together to make summaries of data. For instance, a dataset containing daily sales can be aggregated to show monthly sales instead. Specifically, we’ll introduce grouping, a common operation for aggregating data.
We’ll work with the baby names data, as introduced in the previous section:
# Set up connection to database
import sqlalchemy
db = sqlalchemy.create_engine('sqlite:///babynames.db')
query = '''
SELECT *
FROM baby
LIMIT 10
'''
pd.read_sql(query, db)
Name | Sex | Count | Year | |
---|---|---|---|---|
0 | Liam | M | 19659 | 2020 |
1 | Noah | M | 18252 | 2020 |
2 | Oliver | M | 14147 | 2020 |
... | ... | ... | ... | ... |
7 | Lucas | M | 11281 | 2020 |
8 | Henry | M | 10705 | 2020 |
9 | Alexander | M | 10151 | 2020 |
10 rows × 4 columns
7.2.1. Basic Group-Aggregate¶
Let’s say we want to find out the total number of babies born as recorded in
this data. This is simply the sum of the Count
column. SQL provides
functions that we use in the SELECT
statement, like SUM
:
query = '''
SELECT SUM(Count)
FROM baby
'''
pd.read_sql(query, db)
SUM(Count) | |
---|---|
0 | 352554503 |
Summing up the name counts is one simple way to aggregate the data—it combines data from multiple rows.
But let’s say we instead want to answer a more interesting question: are U.S.
births trending upwards over time? To answer this question, we can sum the
Count
column within each year rather than taking the sum over the entire
dataset. In other words, we split the data into groups based on Year
, then
sum up the Count
values within each group.
This process is depicted in A depiction of grouping then aggregating for example data..
Fig. 7.2 A depiction of grouping then aggregating for example data.¶
We call this operation grouping followed by aggregating. In SQL, we
specify what column we wish to use for grouping through the GROUP BY
clause,
then using aggregation functions in SELECT
:
query = '''
SELECT Year, SUM(Count)
FROM baby
GROUP BY Year
'''
pd.read_sql(query, db)
Year | SUM(Count) | |
---|---|---|
0 | 1880 | 194419 |
1 | 1881 | 185772 |
2 | 1882 | 213385 |
... | ... | ... |
138 | 2018 | 3487193 |
139 | 2019 | 3437438 |
140 | 2020 | 3287724 |
141 rows × 2 columns
Notice that the code is nearly the same as the non-grouped version, except that
it contains a GROUP BY
clause using the Year
column. We also add the Year
column to the SELECT
clause so that each row of the result also contains the
year.
The result is a relation with the total babies born for each year in the data.
Notice that the Year column contains the unique Year
values—there are no
duplicate Year
values anymore since we grouped them together. Now we can plot
the counts over time:
counts_by_year = pd.read_sql(query, db)
px.line(counts_by_year, x='Year', y='SUM(Count)', width=350, height=250)
What do we see in this plot? First, we might notice that there seem to be suspiciously few babies born before 1920. One likely explanation is that the Social Security Administration was created in 1935, so its data for prior births could be less complete.
We might also notice the dip when World War II began in 1939, and the post-war Baby Boomer era from 1946-1964.
Here’s the basic recipe for grouping in SQL
:
SELECT
col1, -- column used for grouping
SUM(col2) -- aggregation of another column
FROM table_name -- relation to use
GROUP BY col1 -- the column(s) to group by
7.2.2. Grouping on Multiple Columns¶
We pass multiple columns into GROUP BY
to group by multiple columns at
once. This is useful when we need to further subdivide our groups. For
example, we can group by both year and sex to see how many male and female
babies were born over time.
query = '''
SELECT Year, Sex, SUM(Count)
FROM baby
GROUP BY Year, Sex
'''
pd.read_sql(query, db)
Year | Sex | SUM(Count) | |
---|---|---|---|
0 | 1880 | F | 83929 |
1 | 1880 | M | 110490 |
2 | 1881 | F | 85034 |
... | ... | ... | ... |
279 | 2019 | M | 1785527 |
280 | 2020 | F | 1581301 |
281 | 2020 | M | 1706423 |
282 rows × 3 columns
Notice how the code closely follows the grouping recipe.
7.2.3. Other Aggregation Functions¶
The SQLite database has several other built-in aggregation functions, such as
COUNT
, AVG
, MIN
, and MAX
. The full list of functions is available on
the SQLite website 1.
We’ve already seen the SUM
function:
query = '''
SELECT Year, SUM(Count)
FROM baby
GROUP BY Year
'''
pd.read_sql(query, db)
Year | SUM(Count) | |
---|---|---|
0 | 1880 | 194419 |
1 | 1881 | 185772 |
2 | 1882 | 213385 |
... | ... | ... |
138 | 2018 | 3487193 |
139 | 2019 | 3437438 |
140 | 2020 | 3287724 |
141 rows × 2 columns
To use another aggregation function, we call it in the SELECT
clause. For
instance, we can use MAX
instead of SUM
:
query = '''
SELECT Year, MAX(Count)
FROM baby
GROUP BY Year
'''
pd.read_sql(query, db)
Year | MAX(Count) | |
---|---|---|
0 | 1880 | 9655 |
1 | 1881 | 8769 |
2 | 1882 | 9557 |
... | ... | ... |
138 | 2018 | 19924 |
139 | 2019 | 20555 |
140 | 2020 | 19659 |
141 rows × 2 columns
Note
The available aggregation functions are one of the first places a data
scientist may encounter differences in SQL implementations. For instance,
SQLite has a relatively minimal set of aggregation functions, while PostgreSQL
has many more 2. Most SQL implementations provide SUM
, COUNT
,
MIN
, MAX
, and AVG
.
7.2.4. Example: Have People Become More Creative With Baby Names?¶
Have people become more creative with baby names over time? One way to measure this is to see whether the number of unique baby names per year has increased over time.
To do this aggregation in SQL, we use the COUNT
function and the DISTINCT
keyword. The DISTINCT
keyword tells SQL to only keep the unique values within
a set of columns.
# Finds the unique baby names
query = '''
SELECT DISTINCT Name
FROM baby
'''
pd.read_sql(query, db)
Name | |
---|---|
0 | Liam |
1 | Noah |
2 | Oliver |
... | ... |
100361 | Crete |
100362 | Roll |
100363 | Zilpah |
100364 rows × 1 columns
To count the number of distinct names, we can aggregate using the COUNT
function. We’ll also use the AS
keyword to rename the resulting column.
# Finds the number of unique baby names
query = '''
SELECT COUNT(DISTINCT Name) AS n_names
FROM baby
'''
pd.read_sql(query, db)
n_names | |
---|---|
0 | 100364 |
Finally, we group by the Year
column to aggregate over each year rather than
over the entire dataset:
query = '''
SELECT Year, COUNT(DISTINCT Name) AS n_names
FROM baby
GROUP BY Year
'''
unique_names_by_year = pd.read_sql(query, db)
unique_names_by_year
Year | n_names | |
---|---|---|
0 | 1880 | 1889 |
1 | 1881 | 1829 |
2 | 1882 | 2012 |
... | ... | ... |
138 | 2018 | 29619 |
139 | 2019 | 29417 |
140 | 2020 | 28613 |
141 rows × 2 columns
Now, we can plot the number of unique names over time:
px.line(unique_names_by_year,
x='Year', y='n_names',
labels={'n_names': '# unique names'},
width=350, height=250)
We see that the number of unique names has generally increased over time, even though the number of babies born has mostly stabilized since the 1960s.
Note
Unlike pandas
, SQLite doesn’t provide a simple way to pivot a relation.
Instead, we use GROUP BY
on two columns in SQL, read the result into a
dataframe, and then use the unstack()
dataframe method 3.
This section covered common ways to aggregate data in SQL using
the GROUP BY
keyword with one or more columns.
In the next section, we’ll explain how to join relations together.