Aggregating
Contents
7.2. Aggregating#
This section introduces grouping and aggregating in SQL. We’ll work with the baby names data, as in the previous section:
import sqlalchemy
db = sqlalchemy.create_engine('sqlite:///babynames.db')
query = '''
SELECT *
FROM baby
LIMIT 10
'''
pd.read_sql(query, db)
Name | Sex | Count | Year | |
---|---|---|---|---|
0 | Liam | M | 19659 | 2020 |
1 | Noah | M | 18252 | 2020 |
2 | Oliver | M | 14147 | 2020 |
... | ... | ... | ... | ... |
7 | Lucas | M | 11281 | 2020 |
8 | Henry | M | 10705 | 2020 |
9 | Alexander | M | 10151 | 2020 |
10 rows × 4 columns
7.2.1. Basic Group-Aggregate Using GROUP BY#
Let’s say we want to find out the total number of babies born as recorded in
this data. This is simply the sum of the Count
column. SQL provides
functions that we use in the SELECT
statement, like SUM
:
query = '''
SELECT SUM(Count)
FROM baby
'''
pd.read_sql(query, db)
SUM(Count) | |
---|---|
0 | 352554503 |
In Chapter 6, we used grouping and aggregation to figure out whether US births are trending upward over time. We grouped the dataset by year using .groupby()
, then summed the counts within each group using .sum()
.
In SQL, we instead group using the GROUP BY
clause, then call aggregation functions in SELECT
:
query = '''
SELECT Year, SUM(Count)
FROM baby
GROUP BY Year
'''
pd.read_sql(query, db)
Year | SUM(Count) | |
---|---|---|
0 | 1880 | 194419 |
1 | 1881 | 185772 |
2 | 1882 | 213385 |
... | ... | ... |
138 | 2018 | 3487193 |
139 | 2019 | 3437438 |
140 | 2020 | 3287724 |
141 rows × 2 columns
As with dataframe grouping, notice that the Year
column contains the unique Year
values—there are no duplicate Year
values anymore since we grouped them together. When grouping in pandas
, the grouping columns become the index of the resulting dataframe. However, relations don’t have row labels, so the Year
values are just a column in the resulting relation.
Here’s the basic recipe for grouping in SQL
:
SELECT
col1, -- column used for grouping
SUM(col2) -- aggregation of another column
FROM table_name -- relation to use
GROUP BY col1 -- the column(s) to group by
Note that the order of clauses in a SQL statement is important. To avoid a syntax error, SELECT
needs to appear first, then FROM
, then WHERE
, then GROUP BY
.
When using GROUP BY
we need to be careful about the columns given to SELECT
. In general, we can only include columns without an aggregation when we use those columns to group. For instance, in the preceding example we grouped by the Year
column, so we can include Year
in the SELECT
clause. All other columns included in SELECT
should be aggregated, as we did earlier with SUM(Count)
. If we included a “bare” column like Name
that wasn’t used for grouping, it’s ambiguous which name within the group should be returned. Although bare columns won’t cause an error for SQLite, they cause other SQL engines to error, so we recommend avoiding them.
7.2.2. Grouping on Multiple Columns#
We pass multiple columns into GROUP BY
to group by multiple columns at
once. This is useful when we need to further subdivide our groups. For
example, we can group by both year and sex to see how many male and female
babies were born over time:
query = '''
SELECT Year, Sex, SUM(Count)
FROM baby
GROUP BY Year, Sex
'''
pd.read_sql(query, db)
Year | Sex | SUM(Count) | |
---|---|---|---|
0 | 1880 | F | 83929 |
1 | 1880 | M | 110490 |
2 | 1881 | F | 85034 |
... | ... | ... | ... |
279 | 2019 | M | 1785527 |
280 | 2020 | F | 1581301 |
281 | 2020 | M | 1706423 |
282 rows × 3 columns
Notice that the preceding code is very similar to grouping by a single column, except that it gives multiple columns to GROUP BY
to group by both Year
and Sex
.
Note
Unlike pandas
, SQLite doesn’t provide a simple way to pivot a relation.
Instead, we can use GROUP BY
on two columns in SQL, read the result into a
dataframe, and then use the unstack()
dataframe method.
7.2.3. Other Aggregation Functions#
SQLite has several other built-in aggregation functions besides SUM
, such as
COUNT
, AVG
, MIN
, and MAX
. For the full list of functions, consult the
SQLite website.
To use another aggregation function, we call it in the SELECT
clause. For
instance, we can use MAX
instead of SUM
:
query = '''
SELECT Year, MAX(Count)
FROM baby
GROUP BY Year
'''
pd.read_sql(query, db)
Year | MAX(Count) | |
---|---|---|
0 | 1880 | 9655 |
1 | 1881 | 8769 |
2 | 1882 | 9557 |
... | ... | ... |
138 | 2018 | 19924 |
139 | 2019 | 20555 |
140 | 2020 | 19659 |
141 rows × 2 columns
Note
The built-in aggregation functions are one of the first places a data
scientist may encounter differences in SQL implementations. For instance,
SQLite has a relatively minimal set of aggregation functions while PostgreSQL
has many more. That said, almost all SQL implementations provide SUM
, COUNT
,
MIN
, MAX
, and AVG
.
This section covered common ways to aggregate data in SQL using
the GROUP BY
keyword with one or more columns.
In the next section, we’ll explain how to join relations together.