{ "cells": [ { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "import sys\n", "import os\n", "if not any(path.endswith('textbook') for path in sys.path):\n", " sys.path.append(os.path.abspath('../../..'))\n", "from textbook_utils import *" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "(sec:sql_transforming)=\n", "# Transforming and Common Table Expressions\n", "\n", "In this section, we show how to call functions to transform columns of data using built-in SQL functions. We also demonstrate how to use common table expressions to build up complex queries from simpler ones. As usual, we start by loading the database:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Set up connection to database\n", "import sqlalchemy\n", "db = sqlalchemy.create_engine('sqlite:///babynames.db')" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## SQL Functions\n", "\n", "SQLite provides a variety of _scalar functions_, or functions that transform single data\n", "values. When called on a column of data, SQLite will apply these functions on\n", "each value in the column. In contrast, aggregation functions like `SUM` and\n", "`COUNT` take a column of values as input and compute a single value as output.\n", "\n", "SQLite provides a comprehensive list of the built-in scalar functions in [its online documentation][funcs]. For instance, to find the number of characters in each name, we use the\n", "`LENGTH` function:\n", "\n", "[funcs]: https://www.sqlite.org/lang_corefunc.html" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameLENGTH(Name)
0Liam4
1Noah4
2Oliver6
.........
7Lucas5
8Henry5
9Alexander9
\n", "

10 rows × 2 columns

\n", "
" ], "text/plain": [ " Name LENGTH(Name)\n", "0 Liam 4\n", "1 Noah 4\n", "2 Oliver 6\n", ".. ... ...\n", "7 Lucas 5\n", "8 Henry 5\n", "9 Alexander 9\n", "\n", "[10 rows x 2 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = ''' \n", "SELECT Name, LENGTH(Name)\n", "FROM baby\n", "LIMIT 10;\n", "'''\n", "\n", "pd.read_sql(query, db)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the `LENGTH` function is applied to each value within the `Name`\n", "column. " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ ":::{note}\n", "\n", "Like aggregation functions, each implementation of SQL provides a different set\n", "of scalar functions. SQLite has a relatively minimal set of functions, while\n", "[PostgreSQL has many more][pg_funcs]. That said, almost all SQL implementations provide some\n", "equivalent to SQLite's `LENGTH`, `ROUND`, `SUBSTR`, and `LIKE` functions. \n", "\n", ":::\n", "\n", "[pg_funcs]: https://www.postgresql.org/docs/9.2/functions.html" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Although scalar functions use the same syntax as an aggregation function, they behave differently. This\n", "can result in confusing output if the two are mixed together in a single query:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameLENGTH(Name)AVG(Count)
0Liam4174.47
\n", "
" ], "text/plain": [ " Name LENGTH(Name) AVG(Count)\n", "0 Liam 4 174.47" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = ''' \n", "SELECT Name, LENGTH(Name), AVG(Count)\n", "FROM baby\n", "LIMIT 10;\n", "'''\n", "\n", "pd.read_sql(query, db)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Here, the `AVG(Name)` computes the average of the entire `Count` column, but the output is confusing---a reader could easily think the average is related to the name Liam. For this reason, we must be careful when scalar and aggregation functions\n", "appear together within a `SELECT` statement. " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "To extract the first letter of each name, we can use the `SUBSTR` function\n", "(short for _substring_). As described in the documentation, the `SUBSTR`\n", "function takes three arguments. The first is the input string, the second is\n", "the position to begin the substring (1-indexed), and the third is the\n", "length of the substring:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameSUBSTR(Name, 1, 1)
0LiamL
1NoahN
2OliverO
.........
7LucasL
8HenryH
9AlexanderA
\n", "

10 rows × 2 columns

\n", "
" ], "text/plain": [ " Name SUBSTR(Name, 1, 1)\n", "0 Liam L\n", "1 Noah N\n", "2 Oliver O\n", ".. ... ...\n", "7 Lucas L\n", "8 Henry H\n", "9 Alexander A\n", "\n", "[10 rows x 2 columns]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = ''' \n", "SELECT Name, SUBSTR(Name, 1, 1)\n", "FROM baby\n", "LIMIT 10;\n", "'''\n", "\n", "pd.read_sql(query, db)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We can use the `AS` keyword to rename the column:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameSexCountYearFirsts
0LiamM196592020L
1NoahM182522020N
2OliverM141472020O
..................
7LucasM112812020L
8HenryM107052020H
9AlexanderM101512020A
\n", "

10 rows × 5 columns

\n", "
" ], "text/plain": [ " Name Sex Count Year Firsts\n", "0 Liam M 19659 2020 L\n", "1 Noah M 18252 2020 N\n", "2 Oliver M 14147 2020 O\n", ".. ... .. ... ... ...\n", "7 Lucas M 11281 2020 L\n", "8 Henry M 10705 2020 H\n", "9 Alexander M 10151 2020 A\n", "\n", "[10 rows x 5 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = ''' \n", "SELECT *, SUBSTR(Name, 1, 1) AS Firsts\n", "FROM baby\n", "LIMIT 10;\n", "'''\n", "\n", "pd.read_sql(query, db)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "After calculating the first letter of each name, our analysis aims to understand the popularity of first letters over time. To do this, we want to take the output of this SQL query and use it as a single step within a longer chain of operations." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "SQL provides several options to break queries into smaller steps, which is helpful in more complex analyses like this one. The most common options for doing this are to create a new relation using a `CREATE TABLE` statement, create a new view using `CREATE VIEW`, or create a temporary relation using `WITH`. Each of these methods have different use-cases. For simplicity, we only describe the `WITH` statement in this section and suggest that readers look over the SQLite documentation for details." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Multistep Queries Using a WITH Clause\n", "\n", "The `WITH` clause lets us assign a name to any `SELECT` query. Then we can\n", "treat that query as though it exists as a relation in the database just for the duration of the query.\n", "SQLite calls these temporary relations *common table expressions*.\n", "For\n", "instance, we can take the earlier query that calculates the first letter of each\n", "name and call it `letters`:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameSexCountYearFirsts
0LiamM196592020L
1NoahM182522020N
2OliverM141472020O
..................
7LucasM112812020L
8HenryM107052020H
9AlexanderM101512020A
\n", "

10 rows × 5 columns

\n", "
" ], "text/plain": [ " Name Sex Count Year Firsts\n", "0 Liam M 19659 2020 L\n", "1 Noah M 18252 2020 N\n", "2 Oliver M 14147 2020 O\n", ".. ... .. ... ... ...\n", "7 Lucas M 11281 2020 L\n", "8 Henry M 10705 2020 H\n", "9 Alexander M 10151 2020 A\n", "\n", "[10 rows x 5 columns]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = ''' \n", "-- Create a temporary relation called letters by calculating\n", "-- the first letter for each name in baby\n", "WITH letters AS (\n", " SELECT *, SUBSTR(Name, 1, 1) AS Firsts\n", " FROM baby\n", ")\n", "-- Then, select the first ten rows from letters\n", "SELECT *\n", "FROM letters\n", "LIMIT 10;\n", "'''\n", "\n", "pd.read_sql(query, db)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "`WITH` statements are very useful since they can be chained together. We can\n", "create multiple temporary relations in a `WITH` statement that each perform a\n", "bit of work on the previous result, which lets us gradually build complicated\n", "queries a step at a time. " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Example: Popularity of \"L\" Names" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We can use `WITH` statements to look at the popularity of names that start with the letter _L_ over time. We'll group the temporary `letters` relation by the\n", "first letter and year, then aggregate the `Count` column using a sum, then filter to get only names with the letter _L_:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FirstsYearCount
0L188012799
1L188112770
2L188214923
............
138L2018246251
139L2019249315
140L2020239760
\n", "

141 rows × 3 columns

\n", "
" ], "text/plain": [ " Firsts Year Count\n", "0 L 1880 12799\n", "1 L 1881 12770\n", "2 L 1882 14923\n", ".. ... ... ...\n", "138 L 2018 246251\n", "139 L 2019 249315\n", "140 L 2020 239760\n", "\n", "[141 rows x 3 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = ''' \n", "WITH letters AS (\n", " SELECT *, SUBSTR(Name, 1, 1) AS Firsts\n", " FROM baby\n", ")\n", "SELECT Firsts, Year, SUM(Count) AS Count\n", "FROM letters\n", "WHERE Firsts = \"L\"\n", "GROUP BY Firsts, Year;\n", "'''\n", "\n", "letter_counts = pd.read_sql(query, db)\n", "letter_counts" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "This relation contains the same data as the one from {numref}`Chapter %s `. In that chapter, we make a plot of the `Count` column over time, which we omit here for brevity." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "In this section, we introduced data transformations.\n", "To transform values in a relation, we commonly use SQL functions like\n", "`LENGTH()` or `SUBSTR()`.\n", "We also explained how to build up complex queries using the `WITH` clause." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.4" }, "toc": { "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }