{
"cells": [
{
"cell_type": "code",
"execution_count": 2,
"id": "68a728dc-f0cf-49c3-91ae-664eabeff0d2",
"metadata": {
"tags": [
"remove-cell"
]
},
"outputs": [],
"source": [
"import sys\n",
"import os\n",
"if not any(path.endswith('textbook') for path in sys.path):\n",
" sys.path.append(os.path.abspath('../../..'))\n",
"from textbook_utils import *"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "17e576a5-9e23-47f3-a541-2218c5511a03",
"metadata": {},
"source": [
"# JSON Data"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "6b1b5fc3-85a0-4e68-ad82-ef0a2273cff3",
"metadata": {},
"source": [
"JavaScript Object Notation (JSON) is a popular format for exchanging data on the web. This plain-text format has a simple and flexible syntax that aligns well with Python dictionaries, and it is easy for machines to parse and people to read. "
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "3bf57557-9a56-4bc6-b7ea-70ed54aba259",
"metadata": {},
"source": [
"Briefly, JSON has two main structures, the object and the array: \n",
"\n",
"Object\n",
": Like a Python `dict`, a JSON object is an unordered collection of\n",
" name-value pairs. These pairs are contained in curly braces; each is\n",
" formatted as `\"name\":value`, and separated by commas.\n",
"\n",
"Array\n",
": Like a Python `list`, a JSON array is an ordered collection of values\n",
" contained in square brackets, where the values are unnamed and separated by\n",
" commas. \n",
"\n",
"The values in an object and array can be of different types and can be nested. That is, an array can contain objects and vice versa. The primitive types are limited to string in double quotes; number in text representation; logical as true or false; and null. \n",
"\n",
"The following short JSON file demonstrates all of these syntactical features:"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "e7f97be9-b0ed-4c09-8873-78d69e2aaaab",
"metadata": {},
"source": [
"```json\n",
"{\"lender_id\":\"matt\", \n",
" \"loan_count\":23,\n",
" \"status\":[2, 1, 3], \n",
" \"sponsored\": false, \n",
" \"sponsor_name\": null,\n",
" \"lender_dem\":{\"sex\":\"m\",\"age\":77 } \n",
"}\n",
"```"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "15290f18-90ab-4c1c-88a3-5c5abb713bd2",
"metadata": {},
"source": [
"Here we have an object that contains six name-value pairs. The values are heterogeneous; four are primitive values: string, number, logical, and null. The `status` value consists of an array of three (ordered) numbers,\n",
"and `lender_dem` is an object with demographic information. "
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "372f43a7-58d6-48d6-aa4e-69b25910b509",
"metadata": {},
"source": [
"The built-in `json` package can be used to work with JSON files in Python. For example, we can load this small file into a Python dictionary: "
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "78832b4a-541e-4b85-9818-a1997a30ee11",
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"from pathlib import Path\n",
"\n",
"file_path = Path() / 'data' / 'js_ex' / 'ex.json'"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "710dd752-0e97-4bce-b926-20be258af720",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'lender_id': 'matt',\n",
" 'loan_count': 23,\n",
" 'status': [2, 1, 3],\n",
" 'sponsored': False,\n",
" 'sponsor_name': None,\n",
" 'lender_dem': {'sex': 'm', 'age': 77}}"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ex_dict = json.load(open(file_path))\n",
"ex_dict"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "6f843271-4b6e-4dbe-ac37-8c6d1c43ba89",
"metadata": {},
"source": [
"The dictionary matches the format of the Kiva file. This format doesn't naturally translate to a data frame. The `json_normalize` method can organize this semistructured JSON data into a flat table:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "4a8d06d0-33f5-42da-b2f3-2c6e4e51eb18",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" lender_id | \n",
" loan_count | \n",
" status | \n",
" sponsored | \n",
" sponsor_name | \n",
" lender_dem.sex | \n",
" lender_dem.age | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" matt | \n",
" 23 | \n",
" [2, 1, 3] | \n",
" False | \n",
" None | \n",
" m | \n",
" 77 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" lender_id loan_count status sponsored sponsor_name lender_dem.sex \n",
"0 matt 23 [2, 1, 3] False None m \\\n",
"\n",
" lender_dem.age \n",
"0 77 "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ex_df = pd.json_normalize(ex_dict)\n",
"ex_df"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "818b1e43-20e3-4a67-9601-fbdc871b6968",
"metadata": {},
"source": [
"Notice how the third element in this one-row data frame is a list, whereas, the nested object was converted into two columns."
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "6237741d-9490-4ee7-8e9e-6be50f6ac296",
"metadata": {},
"source": [
"There's a tremendous amount of flexibility in how data can be structured in JSON, which means that if we want to create a data frame from JSON content, we need to understand how the data are organized in the JSON file. We provide three structures that translate easily into a data frame in the next example. "
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "bbcc6c58-d20f-45df-bd1f-18a41fce5fa0",
"metadata": {},
"source": [
"The list of PurpleAir sites used in the case study in {numref}`Chapter %s ` was JSON formatted. In that chapter, we didn't call attention to the format and simply read the file contents into a dictionary with the `json` library's `load` method and then into a data frame. Here, we have simplified that file while maintaining the general structure so that it's easier to examine. \n",
"\n",
"We begin with an examination of the original file, and then reorganize it into two other JSON structures that might also be used to represent a data frame. With these examples we aim to show the flexibility of JSON. The diagrams in {numref}`Figure %s ` give representations of the three possibilities. "
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "356ea6d1-5901-44b7-955b-0b3b6c2b0fc5",
"metadata": {},
"source": [
"```{figure} figures/JSON-diagram.png\n",
"---\n",
"name: JSON-diagram\n",
"---\n",
"\n",
"Three different approaches for a JSON-formatted to store a dataframe\n",
"```"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "952129e8-d651-4d9f-82a0-caa554fc9c6a",
"metadata": {},
"source": [
"The leftmost data frame in the diagram shows an organization by rows. Each row is an object of named values where the name corresponds to the column name of the data frame. Rows would then be collected in an array. This structure coincides with that of the original file. In the following code, we display the file contents: "
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "c27d30c0-052b-4ba0-9ac6-3f3b2316629d",
"metadata": {},
"source": [
"```\n",
"{\"Header\": [\n",
" {\"status\": \"Success\",\n",
" \"request_time\": \"2022-12-29T01:48:30-05:00\",\n",
" \"url\": \"https://aqs.epa.gov/data/api/dailyData/...\",\n",
" \"rows\": 4\n",
" }\n",
" ],\n",
" \"Data\": [\n",
" {\"site\": \"0014\", \"date\": \"02-27\", \"aqi\": 30},\n",
" {\"site\": \"0014\", \"date\": \"02-24\", \"aqi\": 17},\n",
" {\"site\": \"0014\", \"date\": \"02-21\", \"aqi\": 60},\n",
" {\"site\": \"0014\", \"date\": \"01-15\", \"aqi\": null}\n",
" ]\n",
"}\n",
"```"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "6c479d41-3b53-41d5-becf-fa495b89b396",
"metadata": {},
"source": [
"We see that the file consists of one object with two elements, named `Header` and `Data`. The `Data` element is an array with an element for each row in the data frame, and as described earlier each element is an object. Let's load the file into a dictionary and check its contents (see {numref}`Chapter %s ` for more on finding a pathname to a file and printing its contents): "
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "8f8cbde7-f94b-4fa8-90f7-a4518c744ce8",
"metadata": {},
"outputs": [],
"source": [
"from pathlib import Path\n",
"import os\n",
"\n",
"epa_file_path = Path('data/js_ex/epa_row.json')"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "266f352c-4a34-4637-811a-cbe7131dad29",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'Header': [{'status': 'Success',\n",
" 'request_time': '2022-12-29T01:48:30-05:00',\n",
" 'url': 'https://aqs.epa.gov/data/api/dailyData/...',\n",
" 'rows': 4}],\n",
" 'Data': [{'site': '0014', 'date': '02-27', 'aqi': 30},\n",
" {'site': '0014', 'date': '02-24', 'aqi': 17},\n",
" {'site': '0014', 'date': '02-21', 'aqi': 60},\n",
" {'site': '0014', 'date': '01-15', 'aqi': None}]}"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_row = json.loads(epa_file_path.read_text())\n",
"data_row"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "7c04c0bc-ff5e-415f-885c-412cdcdeb86c",
"metadata": {},
"source": [
"We can quickly convert the array of objects into a data frame with the following call:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "a5032c24-5632-4e8b-95ef-128a351228de",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" site | \n",
" date | \n",
" aqi | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0014 | \n",
" 02-27 | \n",
" 30.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 0014 | \n",
" 02-24 | \n",
" 17.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 0014 | \n",
" 02-21 | \n",
" 60.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 0014 | \n",
" 01-15 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" site date aqi\n",
"0 0014 02-27 30.0\n",
"1 0014 02-24 17.0\n",
"2 0014 02-21 60.0\n",
"3 0014 01-15 NaN"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame(data_row[\"Data\"])"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "96fc214e-731c-4e3f-bc95-5e145a7cb4f7",
"metadata": {},
"source": [
"The middle diagram in {numref}`Figure %s ` takes a column approach to organizing the data. Here the columns are provided as arrays and collected into an object with names that match the column names. The following file demonstrates the concept:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "9b87ebea-a1ff-47c7-850e-d94c22044d4a",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\"site\":[ \"0014\", \"0014\", \"0014\", \"0014\"],\n",
"\"date\":[\"02-27\", \"02-24\", \"02-21\", \"01-15\"],\n",
"\"aqi\":[30,17,60,null]}\n",
"\n"
]
}
],
"source": [
"epa_col_path = Path('data/js_ex/epa_col.json')\n",
"print(epa_col_path.read_text())"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "d08cf983-fda8-4ee3-a209-eeefa904c113",
"metadata": {},
"source": [
"Since `pd.read_json()` expects this format, we can read the file into a\n",
"dataframe directly without needing to first load it into a dictionary:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "f3ffb95b-8e63-497b-a51c-4ebaaf7e077a",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" site | \n",
" date | \n",
" aqi | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 14 | \n",
" 02-27 | \n",
" 30.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 14 | \n",
" 02-24 | \n",
" 17.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 14 | \n",
" 02-21 | \n",
" 60.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 14 | \n",
" 01-15 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" site date aqi\n",
"0 14 02-27 30.0\n",
"1 14 02-24 17.0\n",
"2 14 02-21 60.0\n",
"3 14 01-15 NaN"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_json(epa_col_path)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "67f8c94d-a411-457e-9732-8f748b2a2d77",
"metadata": {},
"source": [
"Lastly, we organize the data into a structure that resembles a matrix (the diagram on the right in the figure) and separately provide the column names for the features. The data matrix is organized as an array of arrays:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "a773d545-d10e-49f2-8284-2c20631a134a",
"metadata": {
"tags": [
"hide-input"
]
},
"outputs": [
{
"data": {
"text/plain": [
"{'vars': ['site', 'date', 'aqi'],\n",
" 'data': [['0014', '02-27', 30],\n",
" ['0014', '02-24', 17],\n",
" ['0014', '02-21', 60],\n",
" ['0014', '01-15', None]]}"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"epa_mat_path = Path('data/js_ex/epa_val.json')\n",
"data_mat = json.loads(epa_mat_path.read_text())\n",
"data_mat"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "e5e49b97-b166-4e08-8c2d-690517270ed6",
"metadata": {},
"source": [
"We can provide `vars` and `data` to create the data frame:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "4e741c3e-5dcc-41e1-ad54-b5537630edad",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" site | \n",
" date | \n",
" aqi | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0014 | \n",
" 02-27 | \n",
" 30.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 0014 | \n",
" 02-24 | \n",
" 17.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 0014 | \n",
" 02-21 | \n",
" 60.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 0014 | \n",
" 01-15 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" site date aqi\n",
"0 0014 02-27 30.0\n",
"1 0014 02-24 17.0\n",
"2 0014 02-21 60.0\n",
"3 0014 01-15 NaN"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame(data_mat[\"data\"], columns=data_mat[\"vars\"])"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "ed4335e1-4700-46d4-8649-f0a762b4e1d1",
"metadata": {},
"source": [
"We've included these examples to show the versatility of JSON.\n",
"The main takeaway is that JSON files can arrange data in different ways, so we typically need to examine the file before we can read the data into a data frame successfully.\n",
"JSON files are very common for data stored on the web:\n",
"the examples in this section were files downloaded from the PurpleAir and Kiva websites.\n",
"Although we downloaded the data manually in this section, we often want to download many datafiles at a time,\n",
"or we want a reliable and reproducible record of the download.\n",
"In the next section, we introduce HTTP, a protocol that will let us write programs to download data from the web automatically. "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}