12.7. Summary#

In this chapter, we replicated Barkjohn’s analysis. We created a model that corrects PurpleAir measurements so that they closely match AQS measurements. The accuracy of this model enables the PurpleAir sensors to be included on official US government maps, like the AirNow Fire and Smoke map. Importantly, this model gives people timely and accurate measurements of air quality.

We saw how crowd-sourced, open data can be improved with data from rigorously maintained, precise government-monitored equipment. In the process, we focus on cleaning and merging data from multiple sources, but we also fit models to adjust and improve air quality measurements.

We applied many concepts covered in this part of the book for this case study. As you saw, wrangling files and data tables into a form we can analyze is a large and important part of data science. We used file wrangling and the notions of granularity from Chapter 8 to prepare two sources for merging. We got them into structures where we could match neighboring air quality sensors. This “grungy” part of data science was essential to widening the reach of reach of data from rigorously maintained, precise government-monitored equipment by augmenting it with crowd-sourced, open data.

This preparation process involved intensive, careful examination and cleaning of data to make them compatible across the two sources. We cleaned the data and improved quality so that we could trust in our analysis. Concepts from Chapter 9 helped us work with time data effectively and find and correct numerous issues like missing data points and even duplicated data values.

File and data wrangling, exploratory data analysis, and visualization are major parts of many analyses. While, fitting models may seem the most exciting part of data science. Getting to know and trust the data is crucial, and often leads to important insights in the modeling phase. Topics related to modeling make up most of the rest of this book. However, before we begin, we cover two more topics related to data wrangling. In the next chapter, we show how to create analyzable data from text, and in the following chapter we examine other formats for source files, mentioned in Chapter 8.

But before you head to the next chapter, now is a good time to take stock of what you’ve learned so far. Pat yourself on the back—you’ve already come a long way! The principles and techniques we’ve covered here are useful for nearly every type of data analysis, and you can readily start applying them towards analyses of your own.