7.2. Aggregating#

This section introduces grouping and aggregating in SQL. We’ll work with the baby names data, as in the previous section:

import sqlalchemy
db = sqlalchemy.create_engine('sqlite:///babynames.db')

query = ''' 
SELECT *
FROM baby
LIMIT 10
'''

pd.read_sql(query, db)

	Name	Sex	Count	Year
0	Liam	M	19659	2020
1	Noah	M	18252	2020
2	Oliver	M	14147	2020
...	...	...	...	...
7	Lucas	M	11281	2020
8	Henry	M	10705	2020
9	Alexander	M	10151	2020

10 rows × 4 columns

7.2.1. Basic Group-Aggregate Using GROUP BY#

Let’s say we want to find out the total number of babies born as recorded in this data. This is simply the sum of the Count column. SQL provides functions that we use in the SELECT statement, like SUM:

query = ''' 
SELECT SUM(Count)
FROM baby
'''

pd.read_sql(query, db)

	SUM(Count)
0	352554503

In Chapter 6, we used grouping and aggregation to figure out whether US births are trending upward over time. We grouped the dataset by year using .groupby(), then summed the counts within each group using .sum().

In SQL, we instead group using the GROUP BY clause, then call aggregation functions in SELECT:

query = ''' 
SELECT Year, SUM(Count)
FROM baby
GROUP BY Year
'''

pd.read_sql(query, db)

	Year	SUM(Count)
0	1880	194419
1	1881	185772
2	1882	213385
...	...	...
138	2018	3487193
139	2019	3437438
140	2020	3287724

141 rows × 2 columns

As with dataframe grouping, notice that the Year column contains the unique Year values—there are no duplicate Year values anymore since we grouped them together. When grouping in pandas, the grouping columns become the index of the resulting dataframe. However, relations don’t have row labels, so the Year values are just a column in the resulting relation.

Here’s the basic recipe for grouping in SQL:

SELECT
  col1,           -- column used for grouping
  SUM(col2)       -- aggregation of another column
FROM table_name   -- relation to use
GROUP BY col1     -- the column(s) to group by

Note that the order of clauses in a SQL statement is important. To avoid a syntax error, SELECT needs to appear first, then FROM, then WHERE, then GROUP BY.

When using GROUP BY we need to be careful about the columns given to SELECT. In general, we can only include columns without an aggregation when we use those columns to group. For instance, in the preceding example we grouped by the Year column, so we can include Year in the SELECT clause. All other columns included in SELECT should be aggregated, as we did earlier with SUM(Count). If we included a “bare” column like Name that wasn’t used for grouping, it’s ambiguous which name within the group should be returned. Although bare columns won’t cause an error for SQLite, they cause other SQL engines to error, so we recommend avoiding them.

7.2.2. Grouping on Multiple Columns#

We pass multiple columns into GROUP BY to group by multiple columns at once. This is useful when we need to further subdivide our groups. For example, we can group by both year and sex to see how many male and female babies were born over time:

query = ''' 
SELECT Year, Sex, SUM(Count)
FROM baby
GROUP BY Year, Sex
'''

pd.read_sql(query, db)

	Year	Sex	SUM(Count)
0	1880	F	83929
1	1880	M	110490
2	1881	F	85034
...	...	...	...
279	2019	M	1785527
280	2020	F	1581301
281	2020	M	1706423

282 rows × 3 columns

Notice that the preceding code is very similar to grouping by a single column, except that it gives multiple columns to GROUP BY to group by both Year and Sex.

Note

Unlike pandas, SQLite doesn’t provide a simple way to pivot a relation. Instead, we can use GROUP BY on two columns in SQL, read the result into a dataframe, and then use the unstack() dataframe method.

7.2.3. Other Aggregation Functions#

SQLite has several other built-in aggregation functions besides SUM, such as COUNT, AVG, MIN, and MAX. For the full list of functions, consult the SQLite website.

To use another aggregation function, we call it in the SELECT clause. For instance, we can use MAX instead of SUM:

query = ''' 
SELECT Year, MAX(Count)
FROM baby
GROUP BY Year
'''

pd.read_sql(query, db)

	Year	MAX(Count)
0	1880	9655
1	1881	8769
2	1882	9557
...	...	...
138	2018	19924
139	2019	20555
140	2020	19659

141 rows × 2 columns

Note

The built-in aggregation functions are one of the first places a data scientist may encounter differences in SQL implementations. For instance, SQLite has a relatively minimal set of aggregation functions while PostgreSQL has many more. That said, almost all SQL implementations provide SUM, COUNT, MIN, MAX, and AVG.

This section covered common ways to aggregate data in SQL using the GROUP BY keyword with one or more columns. In the next section, we’ll explain how to join relations together.

Learning Data Science

Aggregating

Contents

7.2. Aggregating#

7.2.1. Basic Group-Aggregate Using GROUP BY#

7.2.2. Grouping on Multiple Columns#

7.2.3. Other Aggregation Functions#