# 9.1. Example: Wrangling CO2 Measurements from Mauna Loa Observatory¶

We saw in Chapter 2 that the National Oceanic and Atmospheric Administration (NOAA) monitors CO2 concentrations in the air at the Mauna Loa Observatory. We continue with this example and use it to introduce how to make data-quality checks, handle missing values, transform features, and reshape tables. These data are in the file data/co2_mm_mlo.txt. Let’s begin by checking the formatting, encoding, and size of the source before we load it into a data frame (see Chapter 8).

from pathlib import Path
import os
import chardet

co2_file_path = Path('data') / 'co2_mm_mlo.txt'

[os.path.getsize(co2_file_path),
[51131, 'ascii']

Our investigation has found that the file is plain text with ASCII encoding and about 50 KiB in size. Since the file is not particularly large, we should have no trouble loading it into a data frame, but first we need to determine the file’s format. Let’s look at the first few lines in the file.

len(lines)
811
lines[:6]
['# --------------------------------------------------------------------',
'# USE OF NOAA ESRL DATA',
'# ',
'# These data are made freely available to the public and the',
'# scientific community in the belief that their wide dissemination',
'# will lead to greater understanding and new scientific insights.']

We see that the file begins with information about the data source. We should read this documentation before starting our analysis, but some times the urge to plunge into the analysis wins over and we just start mucking about and discover properties of the data as we go. So let’s quickly find where the actual data values are located.

lines[69:79]
['#',
'#            decimal     average   interpolated    trend    #days',
'#             date                             (season corr)',
'1958   3    1958.208      315.71      315.71      314.62     -1',
'1958   4    1958.292      317.45      317.45      315.29     -1',
'1958   5    1958.375      317.50      317.50      314.71     -1',
'1958   6    1958.458      -99.99      317.10      314.85     -1',
'1958   7    1958.542      315.86      315.86      314.98     -1',
'1958   8    1958.625      314.93      314.93      315.94     -1',
'1958   9    1958.708      313.20      313.20      315.91     -1']

We find the data begins on the 73rd line of the file. We also spot some relevant characteristics:

• The values are separated by white space, possibly tabs.

• The data line up down the rows. For example, the month appears in 7th to 8th position of each line.

• The column headings split over two lines.

We can use read_csv to read the data into a pandas DataFrame and provide arguments to specify that the separators are white space, there is no header (we will set our own column names), and to skip the first 72 rows of the file.

names=['Yr', 'Mo', 'DecDate', 'Avg', 'Int', 'Trend', 'days'])
Yr Mo DecDate Avg Int Trend days
0 1958 3 1958.21 315.71 315.71 314.62 -1
1 1958 4 1958.29 317.45 317.45 315.29 -1
2 1958 5 1958.38 317.50 317.50 314.71 -1
3 1958 6 1958.46 -99.99 317.10 314.85 -1
4 1958 7 1958.54 315.86 315.86 314.98 -1

We have successfully loaded the file contents into a data frame, and we can see that the granularity of the data is a monthly average CO2, beginning in 1958 through 2019. Also, the table shape is 738 by 7. Now we’re ready to check data quality.

## 9.1.1. Quality Checks¶

Since scientific studies tend to have very clean data, it’s tempting to jump right in and make a plot to see how CO2 monthly averages have changed. The field DecDate conveniently represents the month and year as a numeric feature so we can easily make a line plot.

px.line(co2, x='DecDate', y='Avg', width=350, height=250)

Yikes! Plotting the data has uncovered a problem. What happened here? We can check a few rows of the data frame to see if we can spot the problem.

Looking a bit more closely this time, we see that some data have unusual values like -1 and -99.99. If we read the information at the top of the file more carefully, we find that -99.99 denotes a missing monthly average and -1 signifies a missing value for the number of days the equipment was in operation that month.

Let’s step back for a moment and perform some quality checks. First, we consider the shape of the data. How many rows should we have? From looking at the head and tail of the data frame, the data appear to be in chronological order, beginning with March 1958 and ending with August 2019. This means we should have $$12 \times (2019-1957) - 2 - 4 = 738$$ records. Our calculations match the number of rows in the data table.

Next, let’s check the quality of the features, starting with Mo. We expect the values to range from 1 to 12, and each month should have 2019-1957 = 62 or 61 instances (since the recordings begin in March of the first year and end in August of the most recent year).

co2["Mo"].value_counts().reindex(range(1,13)).tolist()
[61, 61, 62, 62, 62, 62, 62, 62, 61, 61, 61, 61]

As expected Jan, Feb, Sep, Oct, Nov, and Dec have 61 occurrences and the rest 62.

Let’s turn to examine the column called days with a histogram.

px.histogram(co2, x='days', width=350, height=250)

In terms of data quality, a handful of months have averages based on measurements taken on fewer than half the days. In addition, there are nearly 200 missing values. A scatter plot can help us check if there is any connection between missingness and the year of the recording.

px.scatter(co2, x='Yr', y='days', width=350, height=250)

All of the missing data are in the early years of operation. The number of days of operation of the equipment may not have been collected in the early days of operation. It also appears that there might have been problems with the equipment in the mid to late 80s. What do we do with these conjectures? We can try to confirm them by looking through documentation about the historical readings. If we are concerned about the impact on the CO2 averages for records with missing values for the feature days, then a simple solution would be to drop the earliest recordings. However, we would want to delay such action until after we have examined the time trends and assess whether there are any potential problems.

Next, let’s return to the -99.99 values for the average CO2 measurement and begin our checks with a histogram.

px.histogram(co2, x='Avg', width=350, height=250)

The non-missing values are in the 300-400 range, which is what we expect based on our research into CO2 levels. We also see that there are only a few missing values. Since there aren’t many missing values, we can examine all of them:

co2[co2["Avg"] < 0]
Yr Mo DecDate Avg Int Trend days
3 1958 6 1958.46 -99.99 317.10 314.85 -1
7 1958 10 1958.79 -99.99 312.66 315.61 -1
71 1964 2 1964.12 -99.99 320.07 319.61 -1
72 1964 3 1964.21 -99.99 320.73 319.55 -1
73 1964 4 1964.29 -99.99 321.77 319.48 -1
213 1975 12 1975.96 -99.99 330.59 331.60 0
313 1984 4 1984.29 -99.99 346.84 344.27 2

We are faced with the question of what to do with the -99.99 values. We have seen already the problems of leaving these values as is in a line plot. There are several options, and we describe them next.

The -99.99s for average CO2 levels indicate missing recordings. These interfere with our statistical summaries and plots. It’s good to know which values are missing, but we need to do something about them. We might drop those records, replace -99.99 with NaN, or substitute 99.99 with a likely value for the average CO2. Let’s examine each of these three options.

Note that the table already comes with a substitute value for the -99.99. The column labeled, Int, has values that exactly match those in Avg, except when Avg is -99.99, and then a “reasonable” estimate is used instead.

To see the effect of each option, let’s zoom in on a short time period, say the measurements in 1958, where we know we have two missing values. We can create a time series plot for the three cases: drop the records with -99.99s (left plot); use NaN for missing values (middle plot); substitute an estimate for -99.99 (right plot).

When we look closely, we can see the difference between each of these plots. The leftmost plot connects dots across a two month time period, rather than one month. In the middle, the line breaks where the data are missing, and on the right, we can see months 6 and 10 now have values. In the big picture since there are only seven values missing from the 738 months, any of these options works. However, there is some appeal to the right plot since the seasonal trends are more cleanly discernible.

The method used to interpolate the CO2 measurements for the missing values is an averaging process that takes into consideration the month and year. The idea is to reflect both seasonal changes and the long term trend. This technique is described in greater detail in the documentation at the top of the data file.

These plots have shown the granularity of the data to be monthly measurements, but other granularity options are available to us. We discuss this next.

## 9.1.3. Reshaping the Data Table¶

The CO2 measurements taken at the Mauna Loa research station are also available both daily and hourly. The hourly data has a finer granularity than the daily data; and reciprocally, the daily data is coarser than the hourly data.

Why not always just use the data with the finest granularity available? On a computational level, fine-grained data can become quite large. The Mauna Loa Observatory started recording CO2 levels in 1958. Imagine how many rows the data table would contain if they provided measurements every single second! But more importantly, we want the granularity of the data to match our research question. Suppose we want to see whether CO2 levels have risen over the past 50+ years, consistent with global warming predictions. We don’t need a CO2 measurement every second. In fact, we might well be content with yearly averages where the seasonal patterns are smoothed away. We can aggregate the monthly measurements, changing its granularity to annual averages, and make a plot to display the general trend. We can use aggregation to go to a coarser granularity — in pandas, we use .groupby() and .agg().

fig = px.line(co2.groupby('Yr')['Int'].mean(),
height=250, width=350)
fig.update_layout(showlegend=False)
fig.show()

Indeed, we see a rise by nearly 100 ppm of CO2 since Mauna Loa began recording in 1958.

To recap, after reading the whitespace-separated, plain-text file into a data frame, we began to check its quality. We used the scope and context of the data to affirm that its shape matched the range of dates of collection. We confirmed the values and counts for the month were as expected. We ascertained the extent of missing values in the features, and looked for connections between missing values and other features. We considered three approaches to handling the missing data: drop records, work with NaN values, and impute values to have a full table. And, finally, we changed the granularity of the data frame by rolling it up from a monthly to an annual average. This change in granularity removed seasonal fluctuations and focused on the long term trend in the level of CO2 in the atmosphere. The next four sections of this chapter expand on these actions to wrangle data into a form suitable for analysis: quality checks, missing value treatments, transformations, and shape adjustments. We begin with quality checks.