{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "import sys\n",
    "import os\n",
    "if not any(path.endswith('textbook') for path in sys.path):\n",
    "    sys.path.append(os.path.abspath('../../..'))\n",
    "from textbook_utils import *"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "user_expressions": []
   },
   "source": [
    "(sec:sql_aggregating)=\n",
    "# Aggregating\n",
    "\n",
    "This section introduces grouping and aggregating in SQL.\n",
    "We'll work with the baby names data, as in the previous section:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sqlalchemy\n",
    "db = sqlalchemy.create_engine('sqlite:///babynames.db')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Count</th>\n",
       "      <th>Year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Liam</td>\n",
       "      <td>M</td>\n",
       "      <td>19659</td>\n",
       "      <td>2020</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Noah</td>\n",
       "      <td>M</td>\n",
       "      <td>18252</td>\n",
       "      <td>2020</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Oliver</td>\n",
       "      <td>M</td>\n",
       "      <td>14147</td>\n",
       "      <td>2020</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Lucas</td>\n",
       "      <td>M</td>\n",
       "      <td>11281</td>\n",
       "      <td>2020</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Henry</td>\n",
       "      <td>M</td>\n",
       "      <td>10705</td>\n",
       "      <td>2020</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Alexander</td>\n",
       "      <td>M</td>\n",
       "      <td>10151</td>\n",
       "      <td>2020</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>10 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         Name Sex  Count  Year\n",
       "0        Liam   M  19659  2020\n",
       "1        Noah   M  18252  2020\n",
       "2      Oliver   M  14147  2020\n",
       "..        ...  ..    ...   ...\n",
       "7       Lucas   M  11281  2020\n",
       "8       Henry   M  10705  2020\n",
       "9   Alexander   M  10151  2020\n",
       "\n",
       "[10 rows x 4 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query = ''' \n",
    "SELECT *\n",
    "FROM baby\n",
    "LIMIT 10\n",
    "'''\n",
    "\n",
    "pd.read_sql(query, db)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "user_expressions": []
   },
   "source": [
    "## Basic Group-Aggregate Using GROUP BY"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "user_expressions": []
   },
   "source": [
    "Let's say we want to find out the total number of babies born as recorded in\n",
    "this data. This is simply the sum of the `Count` column. SQL provides\n",
    "functions that we use in the `SELECT` statement, like `SUM`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>SUM(Count)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>352554503</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   SUM(Count)\n",
       "0   352554503"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query = ''' \n",
    "SELECT SUM(Count)\n",
    "FROM baby\n",
    "'''\n",
    "\n",
    "pd.read_sql(query, db)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "user_expressions": []
   },
   "source": [
    "In Chapter 6, we used grouping and aggregation to figure out whether US births are trending upward over time. We grouped the dataset by year using `.groupby()`, then summed the counts within each group using `.sum()`.\n",
    "\n",
    "In SQL, we instead group using the `GROUP BY` clause, then call aggregation functions in `SELECT`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Year</th>\n",
       "      <th>SUM(Count)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1880</td>\n",
       "      <td>194419</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1881</td>\n",
       "      <td>185772</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1882</td>\n",
       "      <td>213385</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>138</th>\n",
       "      <td>2018</td>\n",
       "      <td>3487193</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>139</th>\n",
       "      <td>2019</td>\n",
       "      <td>3437438</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>140</th>\n",
       "      <td>2020</td>\n",
       "      <td>3287724</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>141 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     Year  SUM(Count)\n",
       "0    1880      194419\n",
       "1    1881      185772\n",
       "2    1882      213385\n",
       "..    ...         ...\n",
       "138  2018     3487193\n",
       "139  2019     3437438\n",
       "140  2020     3287724\n",
       "\n",
       "[141 rows x 2 columns]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query = ''' \n",
    "SELECT Year, SUM(Count)\n",
    "FROM baby\n",
    "GROUP BY Year\n",
    "'''\n",
    "\n",
    "pd.read_sql(query, db)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "user_expressions": []
   },
   "source": [
    "As with dataframe grouping, notice that the `Year` column contains the unique `Year` values---there are no duplicate `Year` values anymore since we grouped them together. When grouping in `pandas`, the grouping columns become the index of the resulting dataframe. However, relations don't have row labels, so the `Year` values are just a column in the resulting relation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "user_expressions": []
   },
   "source": [
    "Here's the basic recipe for grouping in `SQL`:\n",
    "\n",
    "```sql\n",
    "SELECT\n",
    "  col1,           -- column used for grouping\n",
    "  SUM(col2)       -- aggregation of another column\n",
    "FROM table_name   -- relation to use\n",
    "GROUP BY col1     -- the column(s) to group by\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "user_expressions": []
   },
   "source": [
    "Note that the order of clauses in a SQL statement is important. To avoid a syntax error, `SELECT` needs to appear first, then `FROM`, then `WHERE`, then `GROUP BY`.\n",
    "\n",
    "When using `GROUP BY` we need to be careful about the columns given to `SELECT`. In general, we can only include columns without an aggregation when we use those columns to group. For instance, in the preceding example we grouped by the `Year` column, so we can include `Year` in the `SELECT` clause. All other columns included in `SELECT` should be aggregated, as we did earlier with `SUM(Count)`. If we included a \"bare\" column like `Name` that wasn't used for grouping, it's ambiguous which name within the group should be returned. Although bare columns won't cause an error for SQLite, they cause other SQL engines to error, so we recommend avoiding them."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "user_expressions": []
   },
   "source": [
    "## Grouping on Multiple Columns\n",
    "\n",
    "We pass multiple columns into `GROUP BY` to group by multiple columns at\n",
    "once. This is useful when we need to further subdivide our groups. For\n",
    "example, we can group by both year and sex to see how many male and female\n",
    "babies were born over time:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Year</th>\n",
       "      <th>Sex</th>\n",
       "      <th>SUM(Count)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1880</td>\n",
       "      <td>F</td>\n",
       "      <td>83929</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1880</td>\n",
       "      <td>M</td>\n",
       "      <td>110490</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1881</td>\n",
       "      <td>F</td>\n",
       "      <td>85034</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>279</th>\n",
       "      <td>2019</td>\n",
       "      <td>M</td>\n",
       "      <td>1785527</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>280</th>\n",
       "      <td>2020</td>\n",
       "      <td>F</td>\n",
       "      <td>1581301</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>281</th>\n",
       "      <td>2020</td>\n",
       "      <td>M</td>\n",
       "      <td>1706423</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>282 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     Year Sex  SUM(Count)\n",
       "0    1880   F       83929\n",
       "1    1880   M      110490\n",
       "2    1881   F       85034\n",
       "..    ...  ..         ...\n",
       "279  2019   M     1785527\n",
       "280  2020   F     1581301\n",
       "281  2020   M     1706423\n",
       "\n",
       "[282 rows x 3 columns]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query = ''' \n",
    "SELECT Year, Sex, SUM(Count)\n",
    "FROM baby\n",
    "GROUP BY Year, Sex\n",
    "'''\n",
    "\n",
    "pd.read_sql(query, db)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "user_expressions": []
   },
   "source": [
    "Notice that the preceding code is very similar to grouping by a single column, except that it gives multiple columns to `GROUP BY` to group by both `Year` and `Sex`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "user_expressions": []
   },
   "source": [
    ":::{note}\n",
    "\n",
    "Unlike `pandas`, SQLite doesn't provide a simple way to pivot a relation.\n",
    "Instead, we can use `GROUP BY` on two columns in SQL, read the result into a\n",
    "dataframe, and then use the `unstack()` dataframe method.\n",
    "\n",
    ":::"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "user_expressions": []
   },
   "source": [
    "## Other Aggregation Functions\n",
    "\n",
    "SQLite has several other built-in aggregation functions besides `SUM`, such as\n",
    "`COUNT`, `AVG`, `MIN`, and `MAX`. For the full list of functions, consult [the \n",
    "SQLite website][agg_funcs].\n",
    "\n",
    "[agg_funcs]: https://www.sqlite.org/lang_aggfunc.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "user_expressions": []
   },
   "source": [
    "To use another aggregation function, we call it in the `SELECT` clause. For\n",
    "instance, we can use `MAX` instead of `SUM`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Year</th>\n",
       "      <th>MAX(Count)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1880</td>\n",
       "      <td>9655</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1881</td>\n",
       "      <td>8769</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1882</td>\n",
       "      <td>9557</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>138</th>\n",
       "      <td>2018</td>\n",
       "      <td>19924</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>139</th>\n",
       "      <td>2019</td>\n",
       "      <td>20555</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>140</th>\n",
       "      <td>2020</td>\n",
       "      <td>19659</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>141 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     Year  MAX(Count)\n",
       "0    1880        9655\n",
       "1    1881        8769\n",
       "2    1882        9557\n",
       "..    ...         ...\n",
       "138  2018       19924\n",
       "139  2019       20555\n",
       "140  2020       19659\n",
       "\n",
       "[141 rows x 2 columns]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query = ''' \n",
    "SELECT Year, MAX(Count)\n",
    "FROM baby\n",
    "GROUP BY Year\n",
    "'''\n",
    "\n",
    "pd.read_sql(query, db)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "user_expressions": []
   },
   "source": [
    ":::{note}\n",
    "\n",
    "The built-in aggregation functions are one of the first places a data\n",
    "scientist may encounter differences in SQL implementations. For instance,\n",
    "SQLite has a relatively minimal set of aggregation functions while [PostgreSQL\n",
    "has many more][pg_agg_funcs]. That said, almost all SQL implementations provide `SUM`, `COUNT`,\n",
    "`MIN`, `MAX`, and `AVG`.\n",
    "\n",
    ":::\n",
    "\n",
    "[pg_agg_funcs]: https://www.postgresql.org/docs/current/functions-aggregate.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [],
    "user_expressions": []
   },
   "source": [
    "This section covered common ways to aggregate data in SQL using\n",
    "the `GROUP BY` keyword with one or more columns.\n",
    "In the next section, we'll explain how to join relations together."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  },
  "toc": {
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": true,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}