# 11.2. Smoothing and Aggregating Data¶

When we have lots of data, we often don’t want to plot all of the individual data points. The scatter plot below shows data from the Cherry Blossom Run, an annual 10-mile race that takes place in April in Washington D.C. when the cherry trees are in bloom. These data were scraped from the run’s website1 and include official times and other information for all registered male runners from 1999 to 2012. We’ve plotted the runner’s age on the x-axis and race time on the y-axis.

px.scatter(runners, x='age', y='time',
width=350, height=250) This scatter plot contains over 70,000 points. With so many points, many of them overlap with each other. This is a common problem called over-plotting. In this case, over-plotting prevents us from seeing how time and age are related. About the only thing that we can see in this plot is a group of very young runners, which points to possible issues with data quality. To address over-plotting, we use smoothing techniques that aggregate data before plotting.

## 11.2.1. Smoothing Techniques to Uncover Shape¶

The histogram is a familiar type of plot that uses smoothing. A histogram aggregates data values by putting points into bins and plotting one bar for each bin. Smoothing here means that we can not differentiate the location of individual points in a bin: the points are smoothly allocated across their bins. With histograms, the area of a bin corresponds to the percentage (or count or proportion) of points in the bin. (Often the bins are equal in width and we take a shortcut to label the height of a bin as the proportion.)

The histogram below plots the distribution of lifespans for dog breeds. Above the histogram is a rug plot that draws a single line for every data value. We can see in the tallest bin that even a small amount of data can cause overplotting in the rug plot. By smoothing out the points in the rug plot, the histogram reveals the general shape of the distribution. In this case, we see that many breeds have a longevity of about 12 years. For more on how to read and interpret histograms, see Chapter 10.

fig = px.histogram(dogs, x="longevity", marginal="rug", nbins=20,
labels={"longevity":"years"},
histnorm='percent', width=350, height=250)
fig.data.marker.line =  dict( color = 'black',width = 1)
fig Another common smoothing technique is kernel density estimation (KDE). A KDE plot shows the distribution using a smooth curve rather than bars. In the plot below, we show the same histogram of dog longevity with a KDE curve overlaid. The KDE curve shows a similar shape as the histogram.

from scipy.stats import gaussian_kde

fig = px.histogram(dogs, x="longevity", marginal="rug",
histnorm='probability density', nbins=20,
labels={"longevity":"years"},
width=450, height=250)

fig.update_traces(marker_color='rgba(76,114,176,0.3)',
selector=dict(type='histogram'))

fig.data.marker.line =  dict( color = 'black',width = 1)

bandwidth = 0.2
xs = np.linspace(min(dogs['longevity']), max(dogs['longevity']), 100)
ys = gaussian_kde(dogs['longevity'], bandwidth)(xs)
curve = go.Scatter(x=xs, y=ys)

fig.update_traces(marker_color='rgb(76,114,176)',
selector=dict(type='scatter'))
fig.update_layout(showlegend=False) It might come as a surprise to think of a histogram as a smoothing method. Both the KDE and histogram aim to help us see important features in the distribution of values. Similar smoothing techniques can be used with scatter plots. This is the topic of the next section.

## 11.2.3. Smoothing Techniques Need Tuning¶

Now that we’ve seen how smoothing is useful for plotting, we turn to the issue of tuning. For histograms, the width of the bins or, for equal-width bins, the number of bins affect the look of the histogram. The left histogram of longevity below has a few wide bins, and the right histogram has many narrow bins. In both cases, it’s hard to see the shape of the distribution. With a few wide bins, we have over-smoothed the distribution, which makes it impossible to discern modes and tails. On the other hand, too many bins, gives a plot that’s little better than a rug plot. KDE plots have a parameter called the “bandwidth” that works similarly to the bin width of a histogram.

f1 = px.histogram(dogs, x="longevity", nbins=3, histnorm='probability density',
width=350, height=250)

f1.update_traces(marker_color='rgba(76,114,176,0.3)',
selector=dict(type='histogram'))

f1.data.marker.line =  dict( color = 'black',width = 1)

#bandwidth = 0.5
xs = np.linspace(min(dogs['longevity']), max(dogs['longevity']), 100)
ys = gaussian_kde(dogs['longevity'])(xs)
curve = go.Scatter(x=xs, y=ys)

f2 = px.histogram(dogs, x="longevity", nbins=100, histnorm='probability density',
width=350, height=250)
f2.update_traces(marker_color='rgba(76,114,176,0.3)',
selector=dict(type='histogram'))

f2.data.marker.line =  dict( color = 'black',width = 1)

left_right(f1, f2, height=250) Most histogram and KDE software automatically choose the bin width for the histogram and band width for the kernel. However, these parameters often need a bit of fiddling to create the most useful plot. When you create visualizations that rely on tuning parameters, it’s important to try a few different values before settling on one.

A different approach to data reduction is to examine quantiles. This is the topic of the next section.

## 11.2.4. Reducing Distributions to Quantiles¶

We found in Chapter 10 that while box plots aren’t as informative as histograms, they can be useful when comparing the distributions of many groups at once. A box plot reduces the data to a few essential features based on the data quartiles. More generally, quantiles (the lower quartile, median, and upper quartile are the 25th, 50th, and 75th quantiles) can provide a useful reduction in the data when comparing distributions.

When two distributions are roughly similar in shape, it can be hard to compare them with histograms. For instance, the histograms below show the price distributions for two- and four-bedroom houses in the SF housing data. The distributions look roughly similar in shape. But, a plot of their quantiles handily compares the distributions’ center, spread, and tails.

px.histogram(sfh.query('br in [2, 4]'),
x='price', log_x=True, facet_col='br', width=700, height=250) We can compare quantiles with a quantile-quantile plot, called q-q plot for short. To make this plot, we first compute percentiles (also called quantiles) for both distributions. Then, we plot the matching percentiles on a scatter plot. We usually also show the reference line y = x to help with the comparison.

br2 = sfh.query('br == 2')
br4 = sfh.query('br == 4')
percs = np.arange(1, 100, 1)
perc2 = np.percentile(br2['price'], percs, interpolation='lower')
perc4 = np.percentile(br4['price'], percs, interpolation='lower')
perc_sfh = pd.DataFrame({'percentile': percs, 'br2': perc2, 'br4': perc4})
perc_sfh

percentile br2 br4
0 1 1.50e+05 2.05e+05
1 2 1.82e+05 2.50e+05
2 3 2.03e+05 2.75e+05
... ... ... ...
96 97 1.04e+06 1.75e+06
97 98 1.20e+06 1.95e+06
98 99 1.44e+06 2.34e+06

99 rows × 3 columns

fig = px.scatter(perc_sfh, x='br2', y='br4', log_x=True, log_y=True,
labels={'br2': 'Price of Two Bedroom Houses',
'br4': 'Price of Four Bedroom Houses'},
width=350, height=250)

go.Scatter(x=[1e5, 2e6], y=[1e5, 2e6],
mode='lines', line=dict(dash='dash'))
)
fig.update_layout(showlegend=False)
fig When the quantile points fall along a line, the variables have similarly shaped distributions. Lines parallel to the reference indicate a difference in center; lines with slopes other than 1 indicate a difference in spread, and curvature indicates a difference in shape. From the q-q plot above, we see that the distribution of price for four-bedroom houses is similar in shape to the two-bedroom distribution, except for a shift of about $$\100K$$ in price and a slightly longer right tail (indicated by the upward bend for large values). Reading a q-q plot takes practice. Once you get the hang of it though, it can be a handy way to compare distributions. Notice that the housing data have over 100,000 observations, and the q-q plot has reduced the data to 99 percentiles. This data reduction is quite useful. However, we don’t always want to use smoothers. This is the topic of the next section.

## 11.2.5. When Not to Smooth¶

Smoothing and aggregating can help us see important features and relationships, but when we have only a handful of observations, smoothing techniques can be misleading. With just a few observations, we prefer rug plots over histograms, box plots, and density curves, and we use scatter plots rather than smooth curves and density contours. This may seem obvious, but when we have a large amount of data, the amount of data in a subgroup can quickly dwindle. This phenomeon is an example of the “curse of dimensionality”.

One of the most common misuses of smoothing happens with box plots. As an example, below is a collection of seven box plots of longevity, one for each type of dog breed. Some of these boxplots have as few as two or three observations.

px.box(dogs, x='group', y='longevity', width=500, height=250) The strip plot below is a preferable visualization. We can still compare the groups, but we also see the exact values in each group. Now we can tell that there are only three breeds in the non-sporting group; the impression of a skewed distribution based on the box plot above reads too much into the shape of the box.

px.strip(dogs, x="group", y="longevity", width=400, height=250) This section introduced the problem of over-plotting, where we have overlapping points because of a large dataset. To address this issue, we introduced smoothing techniques that aggregate data together. We saw two common examples of smoothing: binning and kernel smoothing, and applied them in the one- and two-dimensional settings. In one-dimension, these are histograms and kernel density curves, respectively, and they both help us see the shape of a distribution. In two dimensions, we found it useful to smooth y-values while keeping x-values fixed in order to visualize trends in the data. We addressed the need to tune the smoothing amount to get more informative histograms and density curves, and we cautioned against smoothing with too few data.

There are many other ways to reduce over-plotting in scatter plots. For instance, we can make the dots partially transparent so overlapping points appear darker. If many observations have the same values (e.g., when measurements are rounded to the nearest inch), then we can add a small amount of random noise to the values to reduce the amount of over-plotting. This procedure is called jittering, and it is used in the strip plot above. Transparency and jittering are convenient for medium-sized data. However, they don’t work very well for large datasets since they still plot all the points in the data.

The quantile-quantile plot we introduced in this section offers one way to compare distributions, another is to use side-by-side box plots and yet another is to overlay KDE curves in the same plot. We often aim to compare distributions and relationships across subsets (or groups) of data, and in the next section, we discuss several design principles that facilitate meaningful comparisons for a variety of plot types.

1

https://www.cherryblossom.org/