11.1. Choosing Scale to Reveal Structure

In Chapter 10, we explored prices for houses sold in the San Francisico Bay Area between 2003 and 2009. Let’s revisit that example and take a look at a histogram of sale prices.

px.histogram(sfh, x='price', nbins=100,
             labels={'price':"Sale Proce (USD)"},
             width=350, height=250)
../../_images/viz_scale_6_0.svg

While this plot accurately displays the data, most of the data are crammed into the left side of the plot. This makes it hard to understand the distribution of prices.

Through data visualization, we want to reveal important features of the data like the shape of a distribution and the relationship between two or more features. As this example shows, after we produce an initial plot there are still other aspects we need to consider. In this section, we cover principles of scale which help us decide how to adjust the axis limits, place tick marks, and apply transformations. We begin by examining when and how we might adjust a plot to reduce empty space; in other words we try to fill the data region of our plot with data.

11.1.1. Filling the Data Region

As we can see from the histogram of sale prices, it’s hard to read a distribution when most of the data appear in a small portion of the plotting region. When this happens, important features of the data can be obscured like multiple modes and skewness. A similar issue happens for scatter plots. When all the points are bunched together in the corner of a scatter plot, it’s hard to shape, such as the form of nonlinearity.

This issue can crop up when there are a few unusually large observations. In order to get a better view of the main portion of the data we can drop those observations from the plot by adjusting the x- or y-axis limits, or we can remove the outlier values from the data before plotting. In either case, we mention this exclusion in the caption or on the plot itself.

Let’s use the first idea to improve the histogram of sale prices. In the side-by-side plots below, we clip the data by changing the limits of the x-axis. On the left, we’ve excluded houses that cost over \(\$2,000,000\). The shape of the distribution for the bulk of the houses is much clearer in this plot. For instance, we can more easily observe the skewness and a smaller secondary mode. On the right, we separately show detail in the long right tail of the distribution.

right_hist = px.histogram(sfh, x='price')
left_hist = px.histogram(sfh, x='price')
fig = left_right(left_hist, right_hist, height=250)
fig.update_xaxes(range=[0, 2e6], row=1, col=1)
fig.update_xaxes(range=[2e6, 9e6], row=1, col=2)
fig.update_yaxes(range=[0, 10], row=1, col=2)
fig
../../_images/viz_scale_10_0.svg

Notice that the x-axis in the left plot includes 0, but the x-axis in the right plot begins at $2,000,000. We consider when to include or exclude 0 on an axis next.

11.1.2. Including Zero

We often don’t need to include 0 on an axis, especially if including it makes it difficult to fill the data region. For instance, the scatter plot below shows the average longevity plotted against average height for dog breeds. (This dataset was first introduced in Chapter 10; it includes several features for 172 breeds.)

fig = px.scatter(dogs, x='height', y='longevity',
                 labels={"height": "Height (in)",
                    "longevity": "Typical Livespan (yr)"},
                 width=350, height=250)
margin(fig, t=30)
fig.show()
../../_images/viz_scale_13_0.svg

The x-axis of the plot starts at 10 cm since all dogs are at least that tall, and, similarly, the y-axis begins at 5 years.

There are a few cases where we usually want to include 0. For bar charts, including 0 is important so the heights of the bars directly relate to the data values. As an example, below, we’ve created two bar charts that compare the longevity of dog breeds. The left plot includes 0, but the right plot doesn’t. It’s easy to incorrectly conclude from the right plot that medium-sized dogs live twice as long as large-sized dogs.

dogs_lon = dogs.groupby('size')['longevity'].mean().reset_index()
sml = {"size": ['small', 'medium', 'large']}

left = px.bar(dogs_lon, x='longevity', y='size', category_orders=sml)
right = px.bar(dogs_lon, x='longevity', y='size', category_orders=sml)
fig = left_right(left, right, height=250)
fig.update_xaxes(range=[7, 13], row=1, col=2)
fig.update_xaxes(title_text='Avg Longevity (yrs)')
fig.update_layout(yaxis_title="Size")
fig.show()
../../_images/viz_scale_16_0.svg

We also typically want to include zero when working with proportions, since proportions range from 0 to 1. The plot below shows the proportion of breeds in each type.

size_props = ((dogs['group'].value_counts() / len(dogs))
              .reset_index()
              .rename(columns={'index': 'group', 'group': 'proportion'}))

size_props
fig = px.scatter(size_props, x='proportion', y='group',
                 width=350, height=250)
fig.update_traces(marker_size=15)
fig.update_xaxes(range=[0, 0.5])
fig.update_yaxes(title_text='')
fig.show()
../../_images/viz_scale_18_0.svg

In both the bar and dot plots, by including 0, we make it easier for our reader to accurately compare the relative sizes of the categories.

Earlier, when we adjusted axes, we essentially dropped data from our plotting region. While this is a useful strategy when a handful of observations are unusually large (or small), it is less effective with skewed distributions. In this situation, we often need to transform the data to gain a better view of its shape.

11.1.3. Revealing Shape Through Transformations

Another common way to adjust scale is to transform the data or the plot’s axes. We use transformations for skewed data so that it is easier to inspect the distribution. And, when the transformation produces a symmetric distribution, the symmetry carries with it useful properties for modeling (see Chapter 15).

There are multiple ways to transform data, but the log-transform tends to be especially useful. For instance, we’ve reproduced two histograms of SF house sale prices below. The left histogram is the original data. On the right, we’ve taken the log (base 10) of the prices before plotting.

sfl = sfh.assign(log_price=np.log10(sfh['price']))

orig = px.histogram(sfl, x='price', nbins=100,
                    width=350, height=250)
logged = px.histogram(sfl, x='log_price',
                      nbins=50,
                      width=350, height=250)

fig = left_right(orig, logged)
fig.update_xaxes(title_text='Sale Price (USD)', row=1, col=1)
fig.update_xaxes(title_text='Sale Price (USD log10)', row=1, col=2)
fig.show()
../../_images/viz_scale_22_0.svg

The log transformation makes the distribution of prices more symmetric. Now, we can more easily see important features of the distribution, like the mode at around \(10^{5.85}\) which is about 700,000 and the secondary mode near \(10^{5.55}\) or 350,000.

The downside of using the log transform is that the actual values aren’t as intuitive—in this example, we needed to convert the values back to original dollars to understand the sale price. Instead, we often favor transforming the axis to a log scale. This way we can see the original values on the axis, as shown below.

fig = px.histogram(sfh, x='price',
                   log_x=True,
                   histnorm='probability density',
                   labels={"price": "Sale Price (USD)"},
                   width=350, height=250)

fig.update_traces(xbins_size=30_000)
fig.update_yaxes(title="density")
fig.show()
../../_images/viz_scale_24_0.svg

The above histogram with its log-scaled x-axis essentially shows the same shape as the histogram of the transformed data. But, since the axis is displayed in the original units, we can directly read off the location of the modes in dollars.

The log transform can also reveal shape in scatter plots. Below, we’ve plotted building size on the x-axis and the lot size on the y-axis. It’s hard to see the shape in this plot since many of the points are crammed along the bottom of the data region.

px.scatter(sfh, x='bsqft', y='lsqft', 
           labels={"bsqft": "Building Size (sq ft)",
                    "lsqft": "Lot Size (sq ft)"},
          width=350, height=250)
../../_images/viz_scale_27_0.svg

However, when we use a log scale for both x- and y-axes, the shape of the relationship is much easier to see.

px.scatter(sfh, x='bsqft', y='lsqft',
           log_x=True, log_y=True, 
           labels={"bsqft": "Building Size (sq ft)",
                    "lsqft": "Lot Size (sq ft)"},
           width=350, height=250)
../../_images/viz_scale_29_0.svg

With the transformed axes, we can see that the lot size increases roughly linearly with building size (on the log scale). The log transformation pulls large values–values that are orders of magnitude larger than others–in toward the center. This transformation can help fill the data region and uncover hidden structure as we saw for both the distribution of house price and the relationship between house size and lot size.

In addition to setting the limits of an axis and transforming an axis, we also want to consider the aspect ratio of the plot–the length compared to the width. Adjusting the aspect ratio is called banking, and in the next section, we show how banking can help reveal relationships between features.

11.1.4. Banking to Decipher Relationships

With scatterplots, we try to choose scales so that the relationship between the two features roughly follows a 45-degree line. This scaling is called banking to 45 degrees. It makes it easier to see shape and trends because our eyes can more easily pick up deviations from a line this way. Our eyes can better detect departures from a straight line when the data roughly fall along a 45-degree angle. For instance, we’ve reproduced the plot that shows longevity of dog breeds against height. The plot has been banked to 45 degrees, and we can more easily see how the data roughly follow a line and where they deviate a bit at the extremes.

px.scatter(dogs, x='height', y='longevity',
           labels={"height": "Height (in)",
                    "longevity": "Typical Lifespan (yr)"},
                 width=300, height=250)
../../_images/viz_scale_32_0.svg

While banking to 45 degrees helps us see whether or not the data follow a linear relationship, when there is clear curvature it can be hard to figure out the form of the relationship. When this happens, we try transformations to get the data to fall along a straight line. The log transformation can be useful in uncovering the general form of curvilinear relationships.

11.1.5. Revealing Relationships Through Straightening

We often use scatter plots to look at the relationship between two features. For instance, in the plot below we’ve plotted height against weight for the dog breeds. We see that taller dogs weigh more, but this relationship isn’t linear.

px.scatter(dogs, x='height', y='weight',
           labels={"height": "Height (in)",
                    "weight": "Weight (lb)"},
           width=350, height=250)
../../_images/viz_scale_35_0.svg

When it looks like two variables have a non-linear relationship, it’s useful to try applying a log scale to the x-axis, y-axis, or both. We look for a linear relationship in the scatter plot with transformed axes. For instance, in the plot below, we applied a log scale to the y-axis.

px.scatter(dogs, x='height', y='weight',
           labels={"height": "Height (in)",
                    "weight": "Weight (lb)"},
            log_y=True, width=300, height=300)
../../_images/viz_scale_37_0.svg

This plot shows a roughly linear relationship, and in this case, we say that there’s a log-linear relationship between dog weight and height.

In general, when we see a linear relationship after transforming one or both axes, we can use Table 11.1 to reveal what relationship the original variables have. We make these transformations because it is easier for us to see whether points fall along a line or not than to see if they follow a power law compared to an exponential.

Table 11.1 Relationships between two variables when transformations are applied. \(a\) and \(b\) are constants.

x-axis

y-axis

Relationship

AKA

No transform

No transform

Linear: \( y = ax + b \)

Linear

Log-scale

No transform

Log: \( y = a \log x + b \)

Linear-Log

No transform

Log-scale

Exponential: \( y = ba^x \)

Log-Linear

Log-scale

Log-scale

Power: \( y = bx^a \)

Log-Log

As Table 11.1 shows, the log transform can reveal several common types of relationships. Because of this, the log transform is considered the jackknife of transformations. As another, albeit artificial, example, the leftmost plot in Figure 11.1 reveals a curvilinear relationship between x and y. The middle plot shows a different curvilinear relationship between log(y) and x; this plot also appears nonlinear. A further log transformation, at the far right, displays a plot of log(y) against log(x). This plot confirms that the data have a log-log (or power) relationship because the transformed points fall along a line.

../../_images/example-transforms.png

Fig. 11.1 These scatter plots show how log transforms can “straighten” a curvilinear relationship between two variables. The same (x,y) pairs are used to make each plot, but the middle plot shows (x, log y) and the right plot (log x, log y).

Adjusting scale is an important practice in data visualization, and in this section, we showed several approaches and when each approach is useful. In the next section, we look at principles of smoothing which we use when we need to visualize lots of data.