Exploring PurpleAir and AQS Measurements

12.5. Exploring PurpleAir and AQS Measurements

Let’s explore the cleaned dataset of matched AQS and PurpleAir PM2.5 readings and look for insights that might help us in modeling. Our main interest is in the relationship between the two sources of air quality measurements. But, we want to keep in mind the scope of the data, like how these data are situated in time and place. We learned from our data cleaning that we are working with daily averages of PM 2.5 for a couple of years and that we have data from dozens of locations across the US.

First we review the entire cleaned data frame.

csv_file = 'data/cleaned_purpleair_aqs/Full24hrdataset.csv'
usecols = ['Date', 'ID', 'region', 'PM25FM', 'PM25cf1', 'TempC', 'RH', 'Dewpoint']
full_df = (pd.read_csv(csv_file, usecols=usecols, parse_dates=['Date'])
        .dropna())
full_df.columns = ['date', 'id', 'region', 'pm25aqs', 'pm25pa', 'temp', 'rh', 'dew']
full_df
date id region pm25aqs pm25pa temp rh dew
0 2019-05-17 AK1 Alaska 6.7 8.62 18.03 38.56 3.63
1 2019-05-18 AK1 Alaska 3.8 3.49 16.12 49.40 5.44
2 2019-05-21 AK1 Alaska 4.0 3.80 19.90 29.97 1.73
... ... ... ... ... ... ... ... ...
12427 2019-02-20 WI6 North 15.6 25.30 1.71 65.78 -4.08
12428 2019-03-04 WI6 North 14.0 8.21 -14.38 48.21 -23.02
12429 2019-03-22 WI6 North 5.8 9.44 5.08 52.20 -4.02

12246 rows × 8 columns

We include an explanation for each of the columns in our data frame in the following table.

Column

Description

date

Date of the observation

id

A unique label for a site, formatted as the US state abbreviation with a number. (We performed data cleaning for site ID CA1.)

region

The name of the region, which corresponds to a group of sites. (The CA1 site is located in the West region.)

pm25aqs

The PM2.5 measurement from the AQS sensor.

pm25pa

The PM2.5 measurement from the PurpleAir sensor.

temp

Temperature, in Celcius.

rh

Relative humidity, ranging from 0 to 100%.

dew

The dew point. (Higher dew point means more moisture in the air.)

Let’s start with making a few simple visualizations to gain insight. Since the scope involves measurements over time at particular locations, we can choose one location with many measurements and make a line plot of the weekly average air quality. To choose, let’s find the sites with many records.

full_df['id'].value_counts()[:3]
IA3    830
NC4    699
CA2    659
Name: id, dtype: int64

The location labeled NC4 has nearly 700 hundred observations. To smooth the line plot a bit, Let’s plot weekly averages.

nc4 = full_df.loc[full_df['id'] =='NC4']
ts_nc4 = (nc4.set_index('date')
 .resample('W')
 ['pm25aqs', 'pm25pa']
 .mean()
 .reset_index()
)
fig = px.line(ts_nc4, x='date', y='pm25aqs',
              width=500, height=250)

fig.add_trace(go.Scatter(x=ts_nc4['date'], y=ts_nc4['pm25pa'],
                         line=dict(color='black', dash='dot')))
fig.update_xaxes(title='')
fig.update_yaxes(title='PM 2.5 weekly avg', range=[0,30])
fig.update_layout(showlegend=False)
fig.show()
../../_images/pa_eda_13_0.svg

We see that most PM2.5 values for the AQS sensor (solid line) range between 5.0 and 15.0 µg m⁻³. The PurpleAir sensor follows the up and down pattern of the AQS, which is reassuring. But, the measurements are consistently higher than AQS and in some cases quite a bit higher, which tells us that a correction might be helpful.

Next, let’s consider the distributions of the PM2.5 readings for the two sensors.

left= px.histogram(nc4, x='pm25aqs', histnorm='percent')
right = px.histogram(nc4, x='pm25pa', histnorm='percent')

fig = left_right(left, right, width=600, height=250)
fig.update_xaxes(title='AQS readings', col=1, row=1)
fig.update_xaxes(title='PurpleAir readings', col=2, row=1)
fig.update_yaxes(title='percent', col=1, row=1)
fig.show()
../../_images/pa_eda_15_0.svg

Both distributions are skewed right, which often happens when there’s a lower bound on values (in this case 0). A better way to compare these two distributions is with a quantile-quantile plot (see Chapter 10). With a Q-Q plot it can be easier to compare means, spreads, and tails.

percs = np.arange(1, 100, 1)
aqs_qs = np.percentile(nc4['pm25aqs'], percs, interpolation='lower')
pa_qs = np.percentile(nc4['pm25pa'], percs, interpolation='lower')
perc_df = pd.DataFrame({'percentile': percs, 'aqs_qs':aqs_qs, 'pa_qs':pa_qs})
fig = px.scatter(perc_df, x='aqs_qs', y='pa_qs',
                 labels={'aqs_qs': 'AQS Quantiles',
                         'pa_qs': 'PurpleAir Quantiles'},
                 width=350, height=250)

fig.update_layout(showlegend=False)
fig
../../_images/pa_eda_18_0.svg

The quantile-quantile plot is roughly linear. The slope of the curve is roughly 2, which indicates the spread of the PurpleAir measurements is about twice that of AQS.

What we can’t see in the Q-Q plot or the side-by-side histograms is how the sensor readings vary together. Let’s look at this next. First, we take a look at the distribution of difference between the two readings.

diffs = (nc4['pm25pa'] - nc4['pm25aqs'])
fig = px.histogram(diffs, histnorm='percent', 
                   width=350, height=250)

fig.update_xaxes(range=[-10,30], title="Difference: PA - AQS reading")
fig.update_traces(xbins=dict(
        start=-10.0, end=30.0, size=2)
                 )
fig.update_layout(showlegend=False)
fig.show()
../../_images/pa_eda_22_0.svg

If the instruments are in perfect agreement, we would see a spike at 0. If the instruments are in agreement and there is measurement error with no bias, we expect to see a distribution centered at 0. Instead, we see that 90% of the time, the PurpleAir measurement is larger than the AQS 24-hour average, and about 25% of the time it is more than 10 µg/m3 higher, which is a lot given the AQS averages tend to be between 5 and 10 µg/m3.

A scatter plot can give us additional insight into the relationship between the measurements from these two instruments. Since we are interested in finding a general relationship, regardless of time and location, we include all of our average readings in the plot.

px.scatter(full_df, x='pm25aqs', y='pm25pa', width=350, height=250)
../../_images/pa_eda_25_0.svg

While the relationship looks linear, all but a handful of readings are in the bottom left corner of the plot. Let’s remake the scatter plot and zoom in on the bulk of the data to get a better look. We also add a smooth curve to the plot to help us see the relationship better.

full_df = full_df.loc[(full_df['pm25aqs'] < 50)]
px.scatter(full_df, x='pm25aqs', y='pm25pa', 
           trendline='lowess', trendline_color_override="orange",
           width=350, height=250)
../../_images/pa_eda_28_0.svg

The relationship looks roughly linear, but there is a slight bend in the curve for small values of AQS. When the air is very clean, the PurpleAir sensor doesn’t pick up as much particulate matter and so is more accurate. Also, we can see that the curve should go through the point (0, 0). Despite the slight bend in the relationship, the linear association (correlation) between these two measurements is high:

np.corrcoef(full_df['pm25aqs'], full_df['pm25pa'])
array([[1.  , 0.88],
       [0.88, 1.  ]])

Before starting this analysis, we expected that PurpleAir measurements would generally overestimate the PM2.5. And indeed, this is reflected in the scatter plot, but we also see that there appears to be a strong linear relationship between the measurements from these two instruments that will be helpful in calibrating the PurpleAir sensor.