{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "import sys\n", "import os\n", "if not any(path.endswith('textbook') for path in sys.path):\n", " sys.path.append(os.path.abspath('../../..'))\n", "from textbook_utils import *" ] }, { "cell_type": "markdown", "metadata": { "user_expressions": [] }, "source": [ "(sec:sql_aggregating)=\n", "# Aggregating\n", "\n", "This section introduces grouping and aggregating in SQL.\n", "We'll work with the baby names data, as in the previous section:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import sqlalchemy\n", "db = sqlalchemy.create_engine('sqlite:///babynames.db')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameSexCountYear
0LiamM196592020
1NoahM182522020
2OliverM141472020
...............
7LucasM112812020
8HenryM107052020
9AlexanderM101512020
\n", "

10 rows × 4 columns

\n", "
" ], "text/plain": [ " Name Sex Count Year\n", "0 Liam M 19659 2020\n", "1 Noah M 18252 2020\n", "2 Oliver M 14147 2020\n", ".. ... .. ... ...\n", "7 Lucas M 11281 2020\n", "8 Henry M 10705 2020\n", "9 Alexander M 10151 2020\n", "\n", "[10 rows x 4 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = ''' \n", "SELECT *\n", "FROM baby\n", "LIMIT 10\n", "'''\n", "\n", "pd.read_sql(query, db)" ] }, { "cell_type": "markdown", "metadata": { "user_expressions": [] }, "source": [ "## Basic Group-Aggregate Using GROUP BY" ] }, { "cell_type": "markdown", "metadata": { "user_expressions": [] }, "source": [ "Let's say we want to find out the total number of babies born as recorded in\n", "this data. This is simply the sum of the `Count` column. SQL provides\n", "functions that we use in the `SELECT` statement, like `SUM`:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SUM(Count)
0352554503
\n", "
" ], "text/plain": [ " SUM(Count)\n", "0 352554503" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = ''' \n", "SELECT SUM(Count)\n", "FROM baby\n", "'''\n", "\n", "pd.read_sql(query, db)" ] }, { "cell_type": "markdown", "metadata": { "user_expressions": [] }, "source": [ "In Chapter 6, we used grouping and aggregation to figure out whether US births are trending upward over time. We grouped the dataset by year using `.groupby()`, then summed the counts within each group using `.sum()`.\n", "\n", "In SQL, we instead group using the `GROUP BY` clause, then call aggregation functions in `SELECT`:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearSUM(Count)
01880194419
11881185772
21882213385
.........
13820183487193
13920193437438
14020203287724
\n", "

141 rows × 2 columns

\n", "
" ], "text/plain": [ " Year SUM(Count)\n", "0 1880 194419\n", "1 1881 185772\n", "2 1882 213385\n", ".. ... ...\n", "138 2018 3487193\n", "139 2019 3437438\n", "140 2020 3287724\n", "\n", "[141 rows x 2 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = ''' \n", "SELECT Year, SUM(Count)\n", "FROM baby\n", "GROUP BY Year\n", "'''\n", "\n", "pd.read_sql(query, db)" ] }, { "cell_type": "markdown", "metadata": { "user_expressions": [] }, "source": [ "As with dataframe grouping, notice that the `Year` column contains the unique `Year` values---there are no duplicate `Year` values anymore since we grouped them together. When grouping in `pandas`, the grouping columns become the index of the resulting dataframe. However, relations don't have row labels, so the `Year` values are just a column in the resulting relation." ] }, { "cell_type": "markdown", "metadata": { "user_expressions": [] }, "source": [ "Here's the basic recipe for grouping in `SQL`:\n", "\n", "```sql\n", "SELECT\n", " col1, -- column used for grouping\n", " SUM(col2) -- aggregation of another column\n", "FROM table_name -- relation to use\n", "GROUP BY col1 -- the column(s) to group by\n", "```" ] }, { "cell_type": "markdown", "metadata": { "user_expressions": [] }, "source": [ "Note that the order of clauses in a SQL statement is important. To avoid a syntax error, `SELECT` needs to appear first, then `FROM`, then `WHERE`, then `GROUP BY`.\n", "\n", "When using `GROUP BY` we need to be careful about the columns given to `SELECT`. In general, we can only include columns without an aggregation when we use those columns to group. For instance, in the preceding example we grouped by the `Year` column, so we can include `Year` in the `SELECT` clause. All other columns included in `SELECT` should be aggregated, as we did earlier with `SUM(Count)`. If we included a \"bare\" column like `Name` that wasn't used for grouping, it's ambiguous which name within the group should be returned. Although bare columns won't cause an error for SQLite, they cause other SQL engines to error, so we recommend avoiding them." ] }, { "cell_type": "markdown", "metadata": { "user_expressions": [] }, "source": [ "## Grouping on Multiple Columns\n", "\n", "We pass multiple columns into `GROUP BY` to group by multiple columns at\n", "once. This is useful when we need to further subdivide our groups. For\n", "example, we can group by both year and sex to see how many male and female\n", "babies were born over time:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearSexSUM(Count)
01880F83929
11880M110490
21881F85034
............
2792019M1785527
2802020F1581301
2812020M1706423
\n", "

282 rows × 3 columns

\n", "
" ], "text/plain": [ " Year Sex SUM(Count)\n", "0 1880 F 83929\n", "1 1880 M 110490\n", "2 1881 F 85034\n", ".. ... .. ...\n", "279 2019 M 1785527\n", "280 2020 F 1581301\n", "281 2020 M 1706423\n", "\n", "[282 rows x 3 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = ''' \n", "SELECT Year, Sex, SUM(Count)\n", "FROM baby\n", "GROUP BY Year, Sex\n", "'''\n", "\n", "pd.read_sql(query, db)" ] }, { "cell_type": "markdown", "metadata": { "user_expressions": [] }, "source": [ "Notice that the preceding code is very similar to grouping by a single column, except that it gives multiple columns to `GROUP BY` to group by both `Year` and `Sex`." ] }, { "cell_type": "markdown", "metadata": { "user_expressions": [] }, "source": [ ":::{note}\n", "\n", "Unlike `pandas`, SQLite doesn't provide a simple way to pivot a relation.\n", "Instead, we can use `GROUP BY` on two columns in SQL, read the result into a\n", "dataframe, and then use the `unstack()` dataframe method.\n", "\n", ":::" ] }, { "cell_type": "markdown", "metadata": { "user_expressions": [] }, "source": [ "## Other Aggregation Functions\n", "\n", "SQLite has several other built-in aggregation functions besides `SUM`, such as\n", "`COUNT`, `AVG`, `MIN`, and `MAX`. For the full list of functions, consult [the \n", "SQLite website][agg_funcs].\n", "\n", "[agg_funcs]: https://www.sqlite.org/lang_aggfunc.html" ] }, { "cell_type": "markdown", "metadata": { "user_expressions": [] }, "source": [ "To use another aggregation function, we call it in the `SELECT` clause. For\n", "instance, we can use `MAX` instead of `SUM`:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearMAX(Count)
018809655
118818769
218829557
.........
138201819924
139201920555
140202019659
\n", "

141 rows × 2 columns

\n", "
" ], "text/plain": [ " Year MAX(Count)\n", "0 1880 9655\n", "1 1881 8769\n", "2 1882 9557\n", ".. ... ...\n", "138 2018 19924\n", "139 2019 20555\n", "140 2020 19659\n", "\n", "[141 rows x 2 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = ''' \n", "SELECT Year, MAX(Count)\n", "FROM baby\n", "GROUP BY Year\n", "'''\n", "\n", "pd.read_sql(query, db)" ] }, { "cell_type": "markdown", "metadata": { "user_expressions": [] }, "source": [ ":::{note}\n", "\n", "The built-in aggregation functions are one of the first places a data\n", "scientist may encounter differences in SQL implementations. For instance,\n", "SQLite has a relatively minimal set of aggregation functions while [PostgreSQL\n", "has many more][pg_agg_funcs]. That said, almost all SQL implementations provide `SUM`, `COUNT`,\n", "`MIN`, `MAX`, and `AVG`.\n", "\n", ":::\n", "\n", "[pg_agg_funcs]: https://www.postgresql.org/docs/current/functions-aggregate.html" ] }, { "cell_type": "markdown", "metadata": { "tags": [], "user_expressions": [] }, "source": [ "This section covered common ways to aggregate data in SQL using\n", "the `GROUP BY` keyword with one or more columns.\n", "In the next section, we'll explain how to join relations together." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" }, "toc": { "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }