17.6. Probability for Inference and Prediction

Hypothesis testing, confidence intervals, and prediction intervals rely on probability calculations computed from the sampling distribution and the data generation process. These probability frameworks also enable us to run simulation and bootstrap studies for a hypothetical survey, an experiment, or some other chance process in order to study its random behavior. For example, we found the sampling distribution for an average of ranks under the assumption that the treatment in a Wikipedia experiment was not effective. Using simulation, we quantified the typical deviations from the expected outcome and the distribution of the possible values for the summary statistic. The triptych in Figure 1 17.1 provided a diagram to guide us in the process; it helped keep straight the differences between the population, probability, and sample and also showed their connections. In this section, we bring more mathematical rigor to these concepts.

We formally introduce the notions of expected value, standard deviation, and random variable, and we connect them to the concepts we have been using in this chapter for testing hypotheses and making confidence and prediction intervals. We begin with the specific example from the Wikipedia experiment, before we generalize them. Along the way, we connect this formalism to the triptych that we have used as our guide throughout the chapter.


Fig. 17.3 This diagram shows the population, sampling, and sample distributions and their summaries from the Wikipedia example. In this example, the population is known to consist of the integers from 1 to 200, and the sample are the ranks of the observed post-productivity measurements for the treatment group. In the middle, the sampling distribution of the average rank is created from a simulation study. Notice it is normal in shape with a center that matches the population average.

17.6.1. Formalizing the theory for average rank statistics

Recall in the Wikipedia experiment, we pooled the post-award productivity values from the treatment and control groups and converted them into ranks, \(1, 2, 3, \ldots, 200\) so the population is simply made up of the integers from \(1\) to \(200\). Figure 17.3 is a diagram that represents this specific situation. Notice that the population distribution is flat and ranges from \(1\) to \(200\) (see leftside of Figure 17.3). Also, the population summary (called population parameter) we have used is the average rank:

\[\theta^* ~=~ Avg(pop) ~=~ \frac{1}{200} \Sigma_{k=1}^{200} k ~=~ 100.5. \]

Another relevant summary is the spread about \(\theta^*\), defined as the population standard deviation:

\[ SD(pop) ~=~ \sqrt{\frac {1}{200} \Sigma_{k=1}^{200} (k - \theta^*)^2} ~=~ \sqrt{\frac {1}{200} \Sigma_{k=1}^{200} (k - 100.5)^2} ~\approx~ 57.7 \]

The SD(pop) represents the typical deviation of a rank from the population average. To calculate SD(pop) for this example takes some mathematical handiwork. If you want to learn more see Pitman.

The observed sample consists of the integer ranks of the treatment group; we refer to these values as \(k_1, k_2, \ldots, k_{100}.\) The sample distribution appears on the right in Figure 17.3 (each of the 100 integers appears once).

A parallel to the population average is the sample average, which is our statistic of interest:

\[ Avg(sample) ~=~ \frac{1}{100} \Sigma_{i=1}^{100} k_i ~=~ \bar{k} ~=~113.7. \]

The \(Avg(sample)\) is the observed value for \(\hat{\theta}\). Similarly, the spread about \(Avg(sample)\), called the standard deviation of the sample, represents the typical deviation of a rank in the sample from the sample average:

\[ SD(sample) ~=~ \sqrt{\frac {1}{100} \Sigma_{i=1}^{100} (k_i - \bar{k})^2} ~=~ 55.3.\]

Notice the parallel between the definitions of the sample statistic and the population parameter, in the case where they are averages. The parallel between the two SDs is also note worthy.

Next we turn to the data generation process: draw 100 marbles from the urn (with values \(1, 2,\ldots,200\)), without replacement, to create the treatment ranks. We represent the action of drawing the first marble from the urn and the integer that we get, by the capital letter \(Z_1\). This \(Z_1\) is called a random variable. It has a probability distribution determined by the urn model. That is, we can list all of the values that \(Z_1\) might take and the probability associated with each:

\[{\mathbb{P}}(Z_1 = k) ~=~ \frac{1}{200} ~~~~\textrm{ for } k=1, \ldots, 200.\]

In this example, the probability distribution of \(Z_1\) is determined by a simple formula because all of the integers are equally likely to be drawn from the urn. (Chapter 3 first introduces the notion of a probability distribution).

We often summarize the distribution of a random variable by its expected value and standard deviation. Like with the population and sample, these two quantities give us a sense of what to expect as an outcome and how far the actual value might be from what is expected.

For our example, the expected value of \(Z_1\) is simply,

\[\begin{split} \begin{aligned} \mathbb{E}[Z_1] &= 1 \mathbb{P}(Z_1 = 1) + 2 \mathbb{P}(Z_1 = 2) + \cdots + 200 \mathbb{P}(Z_1 = 200) \\ &= 1 \times \frac{1}{200} + 2 \times \frac{1}{200} + \cdots + 200 \times \frac{1}{200} \\ &= 100.5 \end{aligned} \end{split}\]

Notice that \(\mathbb{E}[Z_1] = \theta^*\), the population average from the urn. The average value in a population and the expected value of a random variable that represents one draw at random from an urn that contains the population are always the same. This is more easily seen by expressing the population average as a weighted average of the unique values in the population weighted by the fraction of units that have that value. The expected value of a random variable of a draw at random from the population urn uses the exact same weights because they match the chance of selecting the particular value.


The term expected value can be a bit confusing because it need not be a possible value of the random variable. For example, \(\mathbb{E}[Z_1] = 100.5\), but only integers are possible values for \(Z_1\).

Next, the variance of \(Z_1\) is

\[\begin{split} \begin{aligned} \mathbb{V}(Z_1) &= (1 - \mathbb{E}[Z_1)]^2 \mathbb{P}(Z_1 = 1) + \cdots + [200 - \mathbb{E}(Z_1)]^2 \mathbb{P}(Z_1 = 200) \\ &= (1 - 100.5)^2 \times \frac{1}{200} + \cdots + (200 - 100.5)^2 \times \frac{1}{200} \\ &= 3333.25 \end{aligned} \end{split}\]


\[ SD(Z_1) = \sqrt{\mathbb{V}(Z_1)} = 57.7 \]

We again point out that the standard deviation of \(Z_1 \) matches the \(SD(pop)\).

To describe the entire data generation process in Figure 17.3, we also define, \(Z_2 , Z_3, \ldots, Z_{100}\) as the result of the remaining 99 draws from the urn. By symmetry these random variables should all have the same probability distribution. That is, for any \(k = 1, \ldots, 200\),

\[\mathbb{P}(Z_1 = k) ~=~ \mathbb{P}(Z_2 = k) ~=~ \cdots ~=~ \mathbb{P}(Z_{100} = k) ~=~ \frac{1}{200}.\]

This implies that each \(Z_i\) has the same expected value, 100.5, and standard deviation, 57.7. However, these random variables are not independent. For example, if you know that \(Z_1 = 17\), then it is not possible for \(Z_2 = 17\).

To complete the middle portion of Figure 17.3, which involves the sampling distribution of \(\hat{\theta}\), we express the average rank statistic as follows:

\[\hat{\theta} = \frac{1}{100} \Sigma_{i=1}^{100} Z_i\]

We can use the expected value and SD of \(Z_1\) and our knowledge of the data generation process to find the expected value and SD of \(\hat{\theta}\). However, we need some more information about how combinations of random variables behave so we first present the results and then circle back to explain why. We first find the expected value of \(\hat{\theta}\):

\[\begin{split} \begin{align} \mathbb{E}(\hat{\theta}) ~&=~ \mathbb{E}[\frac{1}{100} \Sigma_{i=1}^{100} Z_i]\\ ~&=~ \frac{1}{100} \Sigma_{i=1}^{100} \mathbb{E}[Z_i] \\ ~&=~ 100.5 \\ ~&=~ \theta^* \end{align} \end{split}\]

In other words, the expected value of the average of random draws from the population equals the population average. Below we provide formulas for the variance of the average in terms of the population variance, as well as the SD.

\[\begin{split} \begin{align} \mathbb{V}(\hat{\theta}) ~&=~ \mathbb{V}[\frac{1}{100} \Sigma_{i=1}^{100} Z_i]\\ ~&=~ \frac{200-100}{100-1} \times \frac{\mathbb{V}(Z_i)}{100} \\ ~&=~ 16.75 \\ ~&~\\ SD(\hat{\theta}) ~&=~ \sqrt{\frac{100}{199}} \frac{SD(Z_1)}{10} \\ ~&=~ 4.1 \end{align} \end{split}\]

These computations relied on several properties of expected value and variance of a random variable and sums of random variables. Next we describe several useful properties of sums and averages of random variables. These can be used to derive the formulas we just presented for the expected value and SD of the average from a population.

17.6.2. General properties of random variables

In general, a random variable represents a numeric outcome of a probabilistic event. In this book, we use capital letters like \(X\) or \(Y\) or \(Z\) to denote a random variable. The probability distribution for \(X\) is the specification, \(\mathbb{P}(X = x) = p_x\) for all values \(x\) that the random variable takes on.

Then, the expected value of \(X\) is defined as:

\[\mathbb{E}[X] = \sum_{x} x p_x,\]

the variance \(X\) is defined as:

\[\begin{split} \begin{align} \mathbb{V}(X) ~&=~ \mathbb{E}[(X - \mathbb{E}[X])^2] \\ ~&=~ \sum_{x} [x - \mathbb{E}(X)]^2 p_x, \end{align} \end{split}\]

and, the \(SD(X)\) is the square-root of \(\mathbb{V}(X)\).


Although random variables can represent either discrete (such as, the number of children in a family drawn at random from a population) or continuous (such as, the air quality measured by an air monitor) quantities, we address only random variables with discrete outcomes in this book. Since most measurements are made to a certain degree of precision, this simplification doesn’t limit us too much.

Simple formulas provide the expected value, variance, and standard deviation when we make scale and shift changes to random variables, such as \(a + bX\) for constants \(a\) and \(b\).

\[\begin{split} \begin{aligned} \mathbb{E}(a + bX) ~&=~ a + b\mathbb{E}(X) \\ \mathbb{V}(a + bX) ~&=~ b^2\mathbb{V}(X) \\ SD(a + bX) ~&=~ |b|SD(X) \\ \end{aligned} \end{split}\]

To convince yourself that these formulas make sense, think about how a distribution changes if you added a constant \(a\) to each value: it would simply shift the distribution, which in turn would shift the expected value but not change the size of the deviations about the expected value. On the other hand, scaling the values by, say, 2 would spread the distribution out and essentially double the deviations from the expected value.

We are also interested in the properties of the sum of two or more random variables. Let’s consider two random variables, say \(X\) and \(Y\). Then,

\[ \mathbb{E}(a + bX + cY) ~=~ a + b\mathbb{E}(X) + c\mathbb{E}(Y). \]

But, to find the variance of \(a + bX + cY\), we need to know how \(X\) and \(Y\) vary together, which is called the joint distribution of \(X\) and \(Y\). The joint distribution of \(X\) and \(Y\) assigns probabilities to combinations of their outcomes,

\[ \mathbb{P}(X =x, Y=y) ~=~ p_{x,y} \]

A summary of how \(X\) and \(Y\) vary together, called the covariance, is defined as:

\[\begin{split} \begin{align} Cov(X, Y) ~&=~ \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] \\ ~&=~ \mathbb{E}[(XY) - \mathbb{E}(X)\mathbb{E}(Y)] \\ ~&=~ \Sigma{x,y}[(xy) - \mathbb{E}(X)\mathbb{E}(Y)]p_{x,y} \end{align} \end{split}\]

The covariance enters into the calculation of \(\mathbf(a + bX + cY)\), as shown below:

\[ \mathbb{V}(a + bX + cY) ~=~ b^2\mathbb{V}(X) + 2bcCov(X,Y) + c^2\mathbb{V}(Y) \]

In the special case where \(X\) and \(Y\) are independent, their joint distribution is simplified to \(p_{x,y} = p_x p_y\). And, in this case, \(Cov(X,Y) = 0\) so

\[ \mathbb{V}(a + bX + cY) ~=~ b^2\mathbb{V}(X) + c^2\mathbb{V}(Y) \]

These properties can be used to show that for random variables, \(X_1, X_2, \ldots X_n\), that are independent with expected value \(\mu\) and standard deviation \(\sigma\), the average, \(\bar{X}\), has the following expected value, variance, and standard deviation.

\[\begin{split} \begin{align} \mathbb{E}(\bar{X}) ~&=~ \mu\\ \mathbb{V}(\bar{X}) ~&=~ \sigma^2 /n\\ SD(\bar{X}) ~&=~ \sigma/\sqrt{n} \end{align} \end{split}\]

This situation arises with the urn model where \(X_1, \ldots,X_n\) are the result of random draws with replacement. In this case, \(\mu\) represents the average of the urn and \(\sigma\) the standard deviation.

However, when we made random draws from the urn, they were made without replacement. In this situation, \(\bar{X}\) has the following expected value and variance:

\[\begin{split} \begin{align} \mathbb{E}(\bar{X}) ~&=~ \mu\\ \mathbb{V}(\bar{X}) ~&=~ \frac{N-n}{N-1} \times \frac{\sigma^2}{n}\\ \end{align} \end{split}\]

Notice that while the expected value is the same as when the draws are without replacement, the variance and SD are smaller. These quantities are adjusted by \((N-n/(N-1)\), which is called the finite population correction factor. We used this formula earlier to compute the \(SD(\hat{\theta})\) in our Wikipedia example.

Returning to Figure 17.3, we see that the sampling distribution for \(\bar{X}\) in the center of the diagram has an expectation that matches the population average; the SD decreases like \(1/\sqrt{n}\) but even faster because we are drawing without replacement; and the distribution is shaped like a normal curve. We saw these properties earlier in our simulation study, any you can read more about the probability theory behind these observations in XX.

Now that we have outlined the general properties of random variables and their sums, we connect these ideas to testing, confidence, and prediction intervals.

17.6.3. Probability behind testing and intervals

As mentioned at the beginning of this chapter, probability is the underpinning behind conducting a hyptohesis test, providing a confidence interval for an estimator and a prediction interval for a future observation.

We now have the technical machinery to explain these concepts, which we have carefully defined in this chapter without the use of formal technicalities. This time we present the results in terms of random variables and their distributions.

Recall that a hypothesis test relies on a null model which provides the probability distribution for the statistic, \(\hat{\theta}\). The tests we carried out were essentially computing (sometimes approximately) the following probability: given the assumptions of the null distribution,

\[ \mathbb{P}(\hat{\theta} \geq \textrm{observed statistic}) \]

Often times, the random variable is normalized to make these computations easier and standard:

\[ \mathbb{P}\left( \frac{\hat{\theta} - {\theta}^*}{SD(\hat{\theta})} \geq \frac{\textrm{observed stat}- \theta^*}{SD(\hat{\theta})}\right)\]

When, \(SD(\hat{\theta})\) is not known, we have approximated it via simulation or, when we have a formula for \(SD(\hat{\theta})\) in terms of \(SD(pop)\), we substitute \(SD(samp)\) in for \(SD(pop)\). This normalization is popular because it simplifies the null distribution. For example, if \(\hat{\theta}\) has an approximate normal distribution than the normalized version will have a standard normal distribution with center 0 and SD of 1. These approximations are useful if a lot of hypothesis tests are being carried out, such as with A/B testing, for there is no need to simulate every for every statistic because we can just use the normal curve probabilities.

The probability statement behind a confidence interval is quite similar to the probability calculations used in testing. In particular, to create a 95% confidence interval where the sampling distribution of the estimator is roughly normal, we standardize and use the probability,

\[\begin{split} \begin{aligned} \mathbb{P}\left( \frac{|\hat{\theta} - \theta^*|}{SD(\hat{\theta})} \leq 1.96 \right) &~=~ \mathbb{P}\left(\hat{\theta} - 1.96SD(\hat{\theta}) \leq \theta^* \leq \hat{\theta} + 1.96SD(\hat{\theta}) \right) \\ &~\approx~ 0.95 \end{aligned} \end{split}\]

Note that \(\hat{\theta}\) is a random variable in the above probability statement and \(\theta^*\) is considered a fixed unknown parameter value. The confidence interval is created by substituting the observed statistic in for \(\hat\theta\) and calling it a 95% confidence interval:

\[ \left[\textrm{observed stat} - 1.96SD(\hat{\theta}),~ \textrm{observed stat} + 1.96SD(\hat{\theta}) \right] \]

Once the observed statistic is substituted in for the random variable, then we say that we are 95% confident that the interval we have created contains the true value \(\theta^*\). In other words, in 100 cases where we compute an interval in this way, we expect 95 of them to cover the population parameter that we are estimating. We show how to simulate this scenario in the Exercises.

Lastly, we consider prediction intervals. The basic notion is to provide an interval that denotes the expected variation of a future observation about the estimator. In the simple case, where the statistic is \(\bar{X}\) and we have a hypothetical new observation \(X_0\) that has the same expected value, say \(\mu\), and standard deviation, say \(\sigma\), of each \(X_i\), then we find the expected variation of the squared loss:

\[\begin{split} \begin{aligned} \mathbb{E}[(X_0 - \bar{X})^2] ~&=~ \mathbb{E}\{[(X_0 - \mu) - (\bar{X} - \mu)]^2\} \\ ~&=~ \mathbb{V}(X_0) + \mathbb{V}(\bar{X}) \\ ~&=~ \sigma^2 + \sigma^2/n \\ ~&=~ \sigma\sqrt{1 + 1/n} \end{aligned} \end{split}\]

Notice there are two parts to the variation: one due to the variation of \(X_0\) and the other due to the approximation of \(\mathbb{E}(X_0)\) by \(\bar{X}\).

In the case of more complex models, the variation in prediction also breaks down into two components: the inherent variation in the data about the model plus the variation in the sampling distribution due to the estimation of the model. Assuming the model is roughly correct, we can express it as follows:

\[ \mathbf{Y} ~=~ \textbf{X}\boldsymbol{\theta}^{*} + \boldsymbol{\epsilon}, \]

where \(\boldsymbol{\theta}^*\) is a \((p+1) \times 1\) column vector, \(\textbf{X}\) is a \(n \times (p+1)\) design matrix, and \(\boldsymbol{\epsilon}\) consists of \(n\) independent random variables that each has expected value 0 and variance \(\sigma^2\). In this equation, \(\mathbf{Y}\) is a vector of random variables, where the expected value of each variable is determined by the design matrix and the variance is \(\sigma^2\). That is, the variation about the line is constant in that it does not change with \(\mathbf{x}\).

When we create prediction intervals in regression, they are given an \(1 \times (p+1)\) row vector of covariates, called \(\mathbf{x}_0\). the prediction is \(\mathbf{x}_0\boldsymbol{\hat{\theta}}\), where \(\boldsymbol{\hat{\theta}}\) is the estimated parameter vector based on the original \(\mathbf{y}\) and design matrix \(\textbf{X}\). The expected squared error in this prediction is

\[\begin{split} \begin{aligned} \mathbb{E}[(Y_0 - \mathbf{x_0} \boldsymbol{\hat{\theta}})^2] ~&=~ \mathbb{E}\{[(Y_0 - \mathbf{x_0\boldsymbol{\theta}^{*} }) - (\mathbf{x_0}\boldsymbol{\hat{\theta}} - \mathbf{x_0}\boldsymbol{\theta}^{*})]^2\} \\ ~&=~ \mathbb{V}(\epsilon) + \mathbb{V}(\mathbf{x_0}\boldsymbol{\hat{\theta}}) \\ ~&=~ \sigma^2 [1 + \mathbf{x}_0 (\textbf{X}^\top \textbf{X})^{-1} \mathbf{x}_0^\top] \\ \end{aligned} \end{split}\]

We approximate the variance of \(\epsilon\) with the variance of the residuals from the least squares fit.

The prediction intervals we create using the normal curve rely on the additional assumption that the distribution of the errors is approximately normal. This is a stronger assumption than we make for the confidence intervals. With confidence intervals, the probability distribution of \(X_i\) need not look normal for \(\bar{X}\) to have an approximate normal distribution. Similarly, the probability distribution of \(\boldsymbol{\epsilon}\) in the linear model need not look normal for the estimator \(\hat{\theta}\) to have an approximate normal distribution.

We also assume that the linear model is approximately correct when making these prediction intervals. In Chapter 16 we consider the case where the fitted model doesn’t match the model that has produced the data.

While we have covered a lot of theory in this chapter, we have attempted to tie it to the basics of the urn model and the three distributions: population, sample, and sampling. We wrap up the chapter with a few cautions to keep in mind when performing hypothesis tests and when making confidence or prediction intervals.