Aggregating
Contents
7.2. Aggregating#
This section introduces grouping and aggregating in SQL. We’ll work with the baby names data, as in the previous section:
import sqlalchemy
db = sqlalchemy.create_engine('sqlite:///babynames.db')
query = '''
SELECT *
FROM baby
LIMIT 10
'''
pd.read_sql(query, db)
| Name | Sex | Count | Year | |
|---|---|---|---|---|
| 0 | Liam | M | 19659 | 2020 |
| 1 | Noah | M | 18252 | 2020 |
| 2 | Oliver | M | 14147 | 2020 |
| ... | ... | ... | ... | ... |
| 7 | Lucas | M | 11281 | 2020 |
| 8 | Henry | M | 10705 | 2020 |
| 9 | Alexander | M | 10151 | 2020 |
10 rows × 4 columns
7.2.1. Basic Group-Aggregate Using GROUP BY#
Let’s say we want to find out the total number of babies born as recorded in
this data. This is simply the sum of the Count column. SQL provides
functions that we use in the SELECT statement, like SUM:
query = '''
SELECT SUM(Count)
FROM baby
'''
pd.read_sql(query, db)
| SUM(Count) | |
|---|---|
| 0 | 352554503 |
In Chapter 6, we used grouping and aggregation to figure out whether US births are trending upward over time. We grouped the dataset by year using .groupby(), then summed the counts within each group using .sum().
In SQL, we instead group using the GROUP BY clause, then call aggregation functions in SELECT:
query = '''
SELECT Year, SUM(Count)
FROM baby
GROUP BY Year
'''
pd.read_sql(query, db)
| Year | SUM(Count) | |
|---|---|---|
| 0 | 1880 | 194419 |
| 1 | 1881 | 185772 |
| 2 | 1882 | 213385 |
| ... | ... | ... |
| 138 | 2018 | 3487193 |
| 139 | 2019 | 3437438 |
| 140 | 2020 | 3287724 |
141 rows × 2 columns
As with dataframe grouping, notice that the Year column contains the unique Year values—there are no duplicate Year values anymore since we grouped them together. When grouping in pandas, the grouping columns become the index of the resulting dataframe. However, relations don’t have row labels, so the Year values are just a column in the resulting relation.
Here’s the basic recipe for grouping in SQL:
SELECT
col1, -- column used for grouping
SUM(col2) -- aggregation of another column
FROM table_name -- relation to use
GROUP BY col1 -- the column(s) to group by
Note that the order of clauses in a SQL statement is important. To avoid a syntax error, SELECT needs to appear first, then FROM, then WHERE, then GROUP BY.
When using GROUP BY we need to be careful about the columns given to SELECT. In general, we can only include columns without an aggregation when we use those columns to group. For instance, in the preceding example we grouped by the Year column, so we can include Year in the SELECT clause. All other columns included in SELECT should be aggregated, as we did earlier with SUM(Count). If we included a “bare” column like Name that wasn’t used for grouping, it’s ambiguous which name within the group should be returned. Although bare columns won’t cause an error for SQLite, they cause other SQL engines to error, so we recommend avoiding them.
7.2.2. Grouping on Multiple Columns#
We pass multiple columns into GROUP BY to group by multiple columns at
once. This is useful when we need to further subdivide our groups. For
example, we can group by both year and sex to see how many male and female
babies were born over time:
query = '''
SELECT Year, Sex, SUM(Count)
FROM baby
GROUP BY Year, Sex
'''
pd.read_sql(query, db)
| Year | Sex | SUM(Count) | |
|---|---|---|---|
| 0 | 1880 | F | 83929 |
| 1 | 1880 | M | 110490 |
| 2 | 1881 | F | 85034 |
| ... | ... | ... | ... |
| 279 | 2019 | M | 1785527 |
| 280 | 2020 | F | 1581301 |
| 281 | 2020 | M | 1706423 |
282 rows × 3 columns
Notice that the preceding code is very similar to grouping by a single column, except that it gives multiple columns to GROUP BY to group by both Year and Sex.
Note
Unlike pandas, SQLite doesn’t provide a simple way to pivot a relation.
Instead, we can use GROUP BY on two columns in SQL, read the result into a
dataframe, and then use the unstack() dataframe method.
7.2.3. Other Aggregation Functions#
SQLite has several other built-in aggregation functions besides SUM, such as
COUNT, AVG, MIN, and MAX. For the full list of functions, consult the
SQLite website.
To use another aggregation function, we call it in the SELECT clause. For
instance, we can use MAX instead of SUM:
query = '''
SELECT Year, MAX(Count)
FROM baby
GROUP BY Year
'''
pd.read_sql(query, db)
| Year | MAX(Count) | |
|---|---|---|
| 0 | 1880 | 9655 |
| 1 | 1881 | 8769 |
| 2 | 1882 | 9557 |
| ... | ... | ... |
| 138 | 2018 | 19924 |
| 139 | 2019 | 20555 |
| 140 | 2020 | 19659 |
141 rows × 2 columns
Note
The built-in aggregation functions are one of the first places a data
scientist may encounter differences in SQL implementations. For instance,
SQLite has a relatively minimal set of aggregation functions while PostgreSQL
has many more. That said, almost all SQL implementations provide SUM, COUNT,
MIN, MAX, and AVG.
This section covered common ways to aggregate data in SQL using
the GROUP BY keyword with one or more columns.
In the next section, we’ll explain how to join relations together.