{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": [
"remove-cell"
]
},
"outputs": [],
"source": [
"import sys\n",
"import os\n",
"if not any(path.endswith('textbook') for path in sys.path):\n",
" sys.path.append(os.path.abspath('../../..'))\n",
"from textbook_utils import *"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"tags": [
"remove-cell"
]
},
"outputs": [],
"source": [
"csv_file = 'data/cleaned_purpleair_aqs/Full24hrdataset.csv'\n",
"usecols = ['Date', 'ID', 'region', 'PM25FM', 'PM25cf1', 'TempC', 'RH', 'Dewpoint']\n",
"full_df = (pd.read_csv(csv_file, usecols=usecols, parse_dates=['Date'])\n",
" .dropna())\n",
"full_df.columns = ['date', 'id', 'region', 'pm25aqs', 'pm25pa', 'temp', 'rh', 'dew']\n",
"full_df = full_df.loc[(full_df['pm25aqs'] < 50)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"(ch:pa_modeling)=\n",
"# Creating a Model to Correct PurpleAir Measurements "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we've explored the relationship between PM2.5 readings from AQS and PurpleAir sensors,\n",
"we're ready for the final step of the analysis: creating a model that corrects PurpleAir measurements.\n",
"Barkjohn's original analysis fits many models to the data in order to find the most appropriate one.\n",
"In this section, we fit the simple linear using the techniques from {numref}`Chapter %s `.\n",
"We also briefly describe the final model Barkjohn chose for real-world use.\n",
"Since these models use methods that we introduce later in the book,\n",
"we won't explain the technical details very deeply here---we\n",
"instead encourage you to revisit this section after reading {numref}`Chapter %s `."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, let's go over our modeling goals.\n",
"We want to create a model that predicts PM2.5 as accurately as possible.\n",
"To do this, we build a model that adjusts PurpleAir measurements based on AQS measurements. \n",
"We treat the AQS measurements as the true PM2.5 values because \n",
"they are taken from carefully calibrated instruments and are\n",
"actively used by the US government for decision-making. So, we have reason to\n",
"trust the AQS PM2.5 values as precise and close to the truth."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After we build the model that adjusts the PurpleAir measurements using AQS, we then flip the model around to use it to predict the true air quality in the future from PurpleAir measurements. \n",
"This is a *calibration* scenario.\n",
"Since the AQS measurements are close to the truth, we fit the more variable PurpleAir measurements to them;\n",
"this is the calibration procedure. \n",
"Then, we use the calibration curve to correct future PurpleAir measurements. \n",
"This two-step process is encapsulated in the simple linear model and its flipped form below."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$$\n",
"\\begin{aligned}\n",
"\\text{PA} &\\approx b + m\\text{AQS} &~~~~~~\\text{Fit a line to calibrate PA from AQS.}\\\\\n",
"\\text{True Air Quality} &\\approx -b/m + 1/m \\text{PA} &~~~~~~\\text{Flip the line around to use PA to predict true values.}\n",
"\\end{aligned}\n",
"$$\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The scatter plot and histograms that we made during our exploratory data analysis suggests that\n",
"the PurpleAir measurements are more variable, which supports the calibration approach.\n",
"And we saw that the PurpleAir measurements are about twice as high as the AQS measurements,\n",
"which suggests that $ m $ may be close to 2 and $1/m$ close to $ 1/2 $. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
":::{admonition} Why Two Steps?\n",
"\n",
"This calibration procedure might seem a bit roundabout.\n",
"Why not fit a linear model that directly uses PurpleAir to predict AQS?\n",
"That seems a lot easier, and we wouldn't need to invert anything.\n",
"\n",
"Since AQS measurements are \"true\" (or close to it), they have no error.\n",
"Intuitively, a linear model works conditionally on the value of the variable on the $ x $-axis, \n",
"and minimizes the error in the $ y $ direction: $ y-(b + mx) $. \n",
"So, to calibrate an instrument, we place the accurate measurements on the x-axis\n",
"and to fit the line, called the *calibration curve*. Then, in the future, we invert the line to\n",
"predict the truth from our instrument's measurement. That's why this process is also called *inverse regression*. It's a reasonable thing to do when the pairs of measurements are highly correlated. \n",
"\n",
"As a simpler example, let's say we're calibrating a weight scale.\n",
"We could do this by placing known weights---say 1 kg, 5 kg,\n",
"and 10 kg---onto the scale, and seeing what the scale reports.\n",
"We typically repeat this a few times, each time getting slightly different measurements from the scale. \n",
"Analogously, the AQS measurements are the known quantities, and we check\n",
"what the PurpleAir sensors report.\n",
"\n",
"For a more rigorous treatment of statistical calibration,\n",
"see {cite}`osborneStatistical1991`.\n",
"\n",
":::"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, let's fit the model. We follow the notion from {numref}`Chapter %s ` and choose a loss function and minimize the average error. Recall that a *loss function* measures how far away our model is from the actual data. We use squared loss: $ [y - (b+mx)]^2 $. And, to fit the model to our data, we minimize the average squared loss over our data,\n",
"\n",
"$$\n",
"\\begin{aligned}\n",
"\\frac{1}{n} \\sum_{i = 1}^{n} [y_i - (b + mx_i]^2\n",
"\\end{aligned}\n",
"$$\n",
"\n",
"We use the linear modeling functionality provided by `scikit learn` to do this. "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[1.4436159909590747, 0.5306143677362742]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.linear_model import LinearRegression\n",
"\n",
"\n",
"# Fit calibration model using sklearn\n",
"X, y = full_df[['pm25aqs']], full_df['pm25pa']\n",
" \n",
"model = LinearRegression().fit(X, y)\n",
"m, b = model.coef_[0], model.intercept_\n",
" \n",
"[-b/m, 1/m] "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our model is close to what we expected. The adjustment to PurpleAir measurements is close to $ 1/2 $:\n",
"\n",
"$$\n",
"\\begin{aligned}\n",
"\\text{True Air Quality} \\approx 1.44 + 0.53 \\text{PA}\n",
"\\end{aligned}\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The model that Barkjohn settled on incorporated the relative humidity:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$$\n",
"\\begin{aligned}\n",
"\\text{PA} \\approx b + m_1 \\text{AQS} + m_2 \\text{RH}\n",
"\\end{aligned}\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is an example of a multivariable linear regression model---it uses\n",
"more than one variable to make predictions. We can fit it by minimizing the average squared error over the data:\n",
"\n",
"$$\n",
"\\begin{aligned}\n",
"\\frac{1}{n} \\sum_{i = 1}^{n} [y_i - (b + m_1x_i + m_2 h_i]^2\n",
"\\end{aligned}\n",
"$$\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, $y_i$ is the PurpleAir measurement, $x_i$ the AQS measurement, and $h_i$ the relative humidity for the $i$th row in the data frame. Then, we invert the calibration to find the prediction model using the equation:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$$\n",
"\\begin{aligned}\n",
"\\text{True Air Quality} &\\approx - \\frac{b}{m_1} + \\frac{1}{m_1} \\text{PA}\n",
" - \\frac{m_2}{m_1} \\text{RH}\n",
"\\end{aligned}\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We fit this model and check the coefficients:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[5.744226372403694, 0.5316968091937639, -0.08763863228780978]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X, y = full_df[['pm25aqs', 'rh']], full_df['pm25pa']\n",
"model_h = LinearRegression().fit(X, y)\n",
"[m1, m2], b = model_h.coef_, model_h.intercept_\n",
" \n",
"[- b / m1, 1 / m1, - m2 / m1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So our *inverse regression* predicts the air quality from the PurpleAir measurement and relative humidity to be:\n",
"\n",
"$$\n",
"\\begin{aligned}\n",
"\\text{True Air Quality} \\approx 5.74 + 0.53 \\cdot \\text{PA} - 0.09 \\cdot \\text{RH}\n",
"\\end{aligned}\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Later in {numref}`Chapters %s ` and {numref}`%s ` we learn how to compare these two models by examining things like the size of and patterns in prediction error. For now, we note that the model that incorporates relative humidity performs the best."
]
}
],
"metadata": {
"celltoolbar": "Tags",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.4"
}
},
"nbformat": 4,
"nbformat_minor": 4
}