10.3. What to Look For in a Relationship

When we investigate multiple variables, we examine the relationships between them, in addition to their distributions. In this section, we consider pairs of features and describe what to look for. Table %s provides guidelines for what kind of plot to make based on the feature types. For two features, the combination of types (both quantitative, both qualitative, or a mix) matters. We consider each combination in turn.

10.3.1. Two Quantitative Features

If both features are quantitative, then we often examine their relationship with a scatter plot. Each point in a scatter plot marks the position of a pair of values for an observation. So, we can think of a scatter plot as a two-dimensional rug plot.

With scatter plots, we look for linear and simple nonlinear relationships, and we examine the strength of the relationship. We also look to see if a transformation of one or the other or both features leads to a linear relationship.

The scatter plot below displays the weight and height of dog breeds (both are quantitative). We observe that dogs that are above average in height tend to be above average in weight. This relationship appears nonlinear: the change in weight for taller dogs grows faster than for shorter dogs. Indeed, that makes sense if we think of a dog as basically shaped like a box: for similarly proportioned boxes, the weight of the contents of the box has a cubic relationship to its length.

../../_images/eda_relationships_7_0.svg

Two Univariate Plots ≠ One Bivariate Plot. The histograms for two quantitative features do not contain enough information to create their scatter plot so we must exercise caution when we read a pair of histograms. That is, the two histograms do not show how these features vary together. We need to use one of the plots listed in the appropriate row of Table %s (scatter plot, smooth curve, contour plot, heat map, quantile-quantile plot) to get a sense of the relationship between two quantitative features.

When one feature is numeric and the other qualitative, Table %s makes different recommendations. We describe them next.

10.3.2. One Qualitative and One Quantitative Variable

To examine the relationship between a quantitative and a qualitative feature, we often use the qualitative feature to divide the data into groups and compare the distribution of the quantitative feature across these groups. For example, we can compare the distribution of height for small, medium and large dog breeds (see the three overlaid density curves below). We see that the distribution of height for the small and medium breeds both appear bimodal, with the left mode the larger in each group. Also, the small and medium groups have a larger spread in height than the large group of breeds.

../../_images/eda_relationships_10_0.svg

Side-by-side box plots offer a similar comparison of distributions across groups. The boxplot offers a simpler approach that can give a crude understanding of a distribution. The three boxplots of height, one for each size of dog, make it clear that the size categorization is based on height because there is almost no overlap in height ranges for the groups. (This was not evident in the density curves due to the smoothing). What we don’t see in these box plots is the bimodality in the small and medium groups, but we can still see that the large dogs have a more narrow spread compared to the other two groups.

../../_images/eda_relationships_12_0.svg

Also, the plot on the right is a violin plot of height for each size category. The violin plots sketch density curves along an axis for each group. A flipped version of the density curve is added to create a symmetric “violin”. The violin plot aims to bridge the gap between the density curve and box plot.

Box plots (also known as box-and-whisker plots) give a visual summary of a few important statistics of a distribution. The box denotes the 25th percentile, median, and 75th percentile, the whiskers show the tails, and unusually large or small values are also plotted. Box plots cannot reveal as much shape of a distribution as a histogram or density curve. They primarily show symmetry and skew, long/short tails, and unusually large/small values.

Figure 10.2 is a visual explanation of the parts of a box plot. Asymmetry is evident from the median not in the middle of the box, the size of the tails are shown by the length of the whiskers, and outliers by the points that appear beyond the whiskers.

../../_images/box_plot.svg

Fig. 10.2 Diagram of a box plot with the summary statistics labeled.

When we examine the relationship between two qualitative features, our focus is on proportions, as we explain next.

10.3.3. Two Qualitative Features

With two qualitative features, we often compare the distribution of one feature across subgroups defined by the second. In effect, we hold one feature constant and plot the distribution of the second. To do this, we can use some of the same plots we used to display the distribution of one qualitative feature, such as a line plot or bar plot, but put multiple lines and sets of bars in one figure. As an example, let’s examine the relationship between the suitability of a breed for children and the size of the breed.

To examine the relationship between these two qualitative features, we need to calculate three sets of proportions (one each for low, medium, and high suitability). Within each suitability category we find the proportion of small, medium, and large dogs. These proportions are also displayed in following table. Notice that each row sums to 1 (equivalent to 100%).

kids high medium low
size
large 0.37 0.29 0.1
medium 0.36 0.34 0.2
small 0.27 0.37 0.7

The line plots below provide a visualization of these proportions. There is one “line” (set of connected dots) for each suitability level. Again, the connected dots give the breakdown of size within a suitability category. We see that breeds with low suitability for kids are primarily small.

../../_images/eda_relationships_22_0.svg

We can also present these proportions as a collection of side-by-side bar plots as shown here.

../../_images/eda_relationships_25_0.svg

We’ve covered exploratory visualizations that incorporate one or two features. In the next section, we discuss visualizations that incorporate more than two features.