6  Sampling Distributions

This is perhaps the most difficult chapter in the whole book. That is because sampling distributions are a very abstract concept that many students struggle to grasp. And yet sampling distributions are the core of how we conduct statistical inference. You may need to reread this chapter a few times before it makes much sense, but doing so will be well worth your time if you want to understand applied statistics.

6.1 Introduction to Sampling Distributions1

Suppose you randomly sampled 10 people from the population of women in Houston, Texas, between the ages of 21 and 35 years and computed the mean height of your sample. You would not expect your sample mean to be equal to the mean of all women in Houston. It might be somewhat lower or it might be somewhat higher, but it would not equal the population mean exactly. Similarly, if you took a second sample of 10 people from the same population, you would not expect the mean of this second sample to equal the mean of the first sample.

Recall that inferential statistics concern generalizing from a sample to a population. A critical part of inferential statistics involves determining how far sample statistics are likely to vary from each other and from the population parameter. (In this example, the sample statistics are the sample means and the population parameter is the population mean.) As the later portions of this chapter show, these determinations are based on sampling distributions.

6.1.1 Discrete Distributions

We will illustrate the concept of sampling distributions with a simple example. Figure 6.1 shows three pool balls, each with a number on it. Two of the balls are selected randomly (with replacement) and the average of their numbers is computed.

Figure 6.1: The pool balls.

All possible outcomes are shown below in Table 6.1.

Table 6.1: All possible outcomes when two balls are sampled with replacement.
Outcome Ball 1 Ball 2 Mean
1 1 1 1.0
2 1 2 1.5
3 1 3 2.0
4 2 1 1.5
5 2 2 2.0
6 2 3 2.5
7 3 1 2.0
8 3 2 2.5
9 3 3 3.0

Notice that all the means are either 1.0, 1.5, 2.0, 2.5, or 3.0. The frequencies of these means are shown in Table 6-2. The relative frequencies are equal to the frequencies divided by nine because there are nine possible outcomes.

Table 6.2: Frequencies of means for n = 2.
Mean Frequency Relative Frequency
1.0 1 0.111
1.5 2 0.222
2.0 3 0.333
2.5 2 0.222
3.0 1 0.111

Figure 6.2 shows a relative frequency distribution of the means based on Table 6.2. This distribution is also a probability distribution since the Y-axis is the probability of obtaining a given mean from a sample of two balls in addition to being the relative frequency.

Figure 6.2: Distribution of means for n = 2.

The distribution shown in Figure 6.2 is called the sampling distribution of the mean. Specifically, it is the sampling distribution of the mean for a sample size of 2 (n = 2). For this simple example, the distribution of pool balls and the sampling distribution are both discrete distributions. The pool balls have only the values 1, 2, and 3, and a sample mean can have one of only five values shown in Table 6.2.

There is an alternative way of conceptualizing a sampling distribution that will be useful for more complex distributions. Imagine that two balls are sampled (with replacement) and the mean of the two balls is computed and recorded. Then this process is repeated for a second sample, a third sample, and eventually thousands of samples. After thousands of samples are taken and the mean computed for each, a relative frequency distribution is drawn. The more samples, the closer the relative frequency distribution will come to the sampling distribution shown in Figure 6.2. As the number of samples approaches infinity, the relative frequency distribution will approach the sampling distribution. This means that you can conceive of a sampling distribution as being a relative frequency distribution based on a very large number of samples. To be strictly correct, the relative frequency distribution approaches the sampling distribution as the number of samples approaches infinity.

It is important to keep in mind that every statistic, not just the mean, has a sampling distribution. For example, Table 6.3 shows all possible outcomes for the range of two numbers (larger number minus the smaller number). Table 6.4 shows the frequencies for each of the possible ranges and Figure 6.3 shows the sampling distribution of the range.

Table 6.3: All possible outcomes when two balls are sampled with replacement.
Outcome Ball 1 Ball 2 Range
1 1 1 0
2 1 2 1
3 1 3 2
4 2 1 1
5 2 2 0
6 2 3 1
7 3 1 2
8 3 2 1
9 3 3 0
Table 6.4: Distribution of ranges for n = 2.
Range Frequency Relative Frequency
0 3 0.333
1 4 0.444
2 2 0.222
Figure 6.3: Distribution of ranges for n = 2.

It is also important to keep in mind that there is a sampling distribution for various sample sizes. For simplicity, we have been using n = 2. The sampling distribution of the range for n = 3 is shown in Figure 6.4.

Figure 6.4: Distribution of ranges for n = 3.

6.1.2 Continuous Distributions

In the previous section, the population consisted of three pool balls. Now we will consider sampling distributions when the population distribution is continuous. What if we had a thousand pool balls with numbers ranging from 0.001 to 1.000 in equal steps? (Although this distribution is not really continuous, it is close enough to be considered continuous for practical purposes.) As before, we are interested in the distribution of means we would get if we sampled two balls and computed the mean of these two balls. In the previous example, we started by computing the mean for each of the nine possible outcomes. This would get a bit tedious for this example since there are 1,000,000 possible outcomes (1,000 for the first ball x 1,000 for the second). Therefore, it is more convenient to use our second conceptualization of sampling distributions which conceives of sampling distributions in terms of relative frequency distributions. Specifically, the relative frequency distribution that would occur if samples of two balls were repeatedly taken and the mean of each sample computed.

When we have a truly continuous distribution, it is not only impractical but actually impossible to enumerate all possible outcomes. Moreover, in continuous distributions, the probability of obtaining any single value is zero. Therefore, these values are called probability densities rather than probabilities.

6.1.3 Sampling Distributions and Inferential Statistics

As we stated in the beginning of this chapter, sampling distributions are important for inferential statistics. In the examples given so far, a population was specified and the sampling distribution of the mean and the range were determined. In practice, the process proceeds the other way: you collect sample data and from these data you estimate parameters of the sampling distribution. This knowledge of the sampling distribution can be very useful. For example, knowing the degree to which means from different samples would differ from each other and from the population mean would give you a sense of how close your particular sample mean is likely to be to the population mean. Fortunately, this information is directly available from a sampling distribution. The most common measure of how much sample means differ from each other is the standard deviation of the sampling distribution of the mean. This standard deviation is called the standard error of the mean. If all the sample means were very close to the population mean, then the standard error of the mean would be small. On the other hand, if the sample means varied considerably, then the standard error of the mean would be large.

To be specific, assume your sample mean were 125 and you estimated that the standard error of the mean were 5 (using a method shown in a later section). If you had a normal distribution, then it would be likely that your sample mean would be within 10 units of the population mean since most of a normal distribution is within two standard deviations of the mean.

Keep in mind that all statistics have sampling distributions, not just the mean. For example, later in this chapter we will construct confidence intervals relying on the sampling distribution for a regression slope coefficient.

6.2 Sampling Distribution of the Mean2

As we learned in the prior section, the sampling distribution of the mean refers to the probability distribution describing all possible values of the sample mean we could obtain in repeated sampling. This section goes over some important properties of the sampling distribution of the mean.

6.2.1 Mean

The mean of the sampling distribution of the mean is the mean of the population from which the scores were sampled. Therefore, if a population has a mean \(\mu\), then the mean of the sampling distribution of the mean (\(\bar{X}\)) is also \(\mu\). The symbol \(\mu_{\bar{X}}\) is used to refer to the mean of the sampling distribution of the mean. Therefore, the formula for the mean of the sampling distribution of the mean can be written as:

\[ \mu_{\bar{X}} = \mu \]

6.2.2 Variance

The variance of the sampling distribution of the mean is computed as follows:

\[ \sigma^2_{\bar{X}} = \frac{\sigma^2}{n} \]

That is, the variance of the sampling distribution of the mean is the population variance divided by \(n\), the sample size (the number of scores used to compute a mean).3 Thus, the larger the sample size, the smaller the variance of the sampling distribution of the mean.

As noted previously, the standard error of the mean is the standard deviation of the sampling distribution of the mean. It is therefore the square root of the variance of the sampling distribution of the mean and can be written as:

\[ \sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}} \]

The standard error is represented by a \(\sigma\) because it is a standard deviation. The subscript (\(\bar{X}\)) indicates that the standard error in question is the standard error of the (sample) mean.

6.2.3 Central Limit Theorem

The central limit theorem states that:

Given a population with a finite mean \(\mu\) and a finite non-zero variance \(\sigma^2\), the sampling distribution of the mean approaches a normal distribution with a mean of \(\mu\) and a variance of \(\sigma^2/n\) as \(n\), the sample size, increases.

The expressions for the mean and variance of the sampling distribution of the mean are not new or remarkable. What is remarkable is that regardless of the shape of the parent population, the sampling distribution of the mean approaches a normal distribution as n increases. Figure 6.5 shows the results of the simulation for n = 2 and n = 10. The parent population was a uniform distribution. You can see that the distribution for n = 2 is far from a normal distribution. Nonetheless, it does show that the scores are denser in the middle than in the tails. For n = 10 the distribution is quite close to a normal distribution. Notice that the means of the two distributions are the same, but that the spread of the distribution for n = 10 is smaller.

Figure 6.5: A simulation of a sampling distribution. The parent population is uniform. The blue line under “16” indicates that 16 is the mean. The red line extends from the mean plus and minus one standard deviation.

Figure 6.6 shows how closely the sampling distribution of the mean approximates a normal distribution even when the parent population is very non-normal. If you look closely you can see that the sampling distributions do have a slight positive skew. The larger the sample size, the closer the sampling distribution of the mean would be to a normal distribution.

Figure 6.6: A simulation of a sampling distribution. The parent population is very non-normal.

6.3 Calculating Confidence Intervals

6.3.1 Confidence Intervals for the Mean4

When you compute a confidence interval on the mean, you compute the mean of a sample in order to estimate the mean of the population. Clearly, if you already knew the population mean, there would be no need for a confidence interval. However, to explain how confidence intervals are constructed, we are going to work backwards and begin by assuming characteristics of the population. Then we will show how sample data can be used to construct a confidence interval.

6.3.1.1 An Artificial Example: Using the Normal Distribution

Assume that the weights of 10-year-old children are normally distributed with a mean of 90 and a standard deviation of 36. What is the sampling distribution of the mean for a sample size of 9? Recall from Section 6.2 that the mean of the sampling distribution is \(\mu\) and the standard error of the mean is

\[ \sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}} \]

For the present example, the sampling distribution of the mean has a mean of 90 and a standard deviation of 36/3 = 12. Note that the standard deviation of a sampling distribution is its standard error. Figure 6.7 shows this distribution. The shaded area represents the middle 95% of the distribution and stretches from 66.48 to 113.52. These limits were computed by adding and subtracting 1.96 standard deviations to/from the mean of 90 as follows:

\[ 90 - (1.96)(12) = 66.48 \]

\[ 90 + (1.96)(12) = 113.52 \]

The value of 1.96 is based on the fact that 95% of the area of a normal distribution is within 1.96 standard deviations of the mean; 12 is the standard error of the mean given our sample size of 9.

Figure 6.7: The sampling distribution of the mean for n=9. The middle 95% of the distribution is shaded.

Figure 6.7 shows that 95% of the means are no more than 23.52 units (1.96 standard deviations) from the mean of 90. Now consider the probability that a sample mean computed in a random sample is within 23.52 units of the population mean of 90. Since 95% of the distribution is within 23.52 of 90, the probability that the mean from any given sample will be within 23.52 of 90 is 0.95. This means that if we repeatedly compute the mean (\(\bar{X}\)) from a sample (drawing a new random sample each time), and create an interval ranging from \(\bar{X} - 23.52\) to \(\bar{X} + 23.52\), this interval will contain the population mean 95% of the time. In general, you compute the 95% confidence interval for the mean with the following formula:

\[ \text{Lower limit} = \bar{X} - Z_{.95}\times\sigma_{\bar{X}} \]\[ \text{Upper limit} = \bar{X} + Z_{.95}\times\sigma_{\bar{X}} \]

where \(Z_{.95}\) is the number of standard deviations extending from the mean of a normal distribution required to contain 0.95 of the area (always equal to 1.96) and \(\sigma_{\bar{X}}\) is the standard error of the mean.

If you look closely at this formula for a confidence interval, you will notice that you need to know the standard deviation (\(\sigma\)) in order to estimate the mean. This may sound unrealistic, and it is. However, computing a confidence interval when \(\sigma\) is known is easier than when \(\sigma\) has to be estimated, and serves a pedagogical purpose. Later in this section we will show how to compute a confidence interval for the mean when \(\sigma\) has to be estimated.

Suppose the following five numbers were sampled from a normal distribution with a standard deviation of 2.5: 2, 3, 5, 6, and 9. To compute the 95% confidence interval, start by computing the mean and standard error:

\[ \bar{X} = (2 + 3 + 5 + 6 + 9)/5 = 5. \]

\[ \sigma_{\bar{X}}=\frac{2.5}{\sqrt5}=1.118. \]

Z.95 can be found using the normal distribution calculator5 and specifying that the shaded area is 0.95 and indicating that you want the area to be between the cutoff points. As shown in Figure 6.8, the value is 1.96. If you had wanted to compute the 99% confidence interval, you would have set the shaded area to 0.99 and the result would have been 2.58.

Figure 6.8: 95% of the area is between -1.96 and 1.96.

The confidence interval can then be computed as follows:

\[ \text{Lower limit} = 5 - (1.96)(1.118)= 2.81 \]\[ \text{Upper limit} = 5 + (1.96)(1.118)= 7.19 \]

6.3.1.2 The Realistic Case: Using the T Distribution

You should use the t distribution rather than the normal distribution when the variance is not known and has to be estimated from sample data. You will learn more about the t distribution in the next section. When the sample size is large, say 100 or above, the t distribution is very similar to the standard normal distribution. However, with smaller sample sizes, the t distribution has relatively more scores in its tails than does the normal distribution. As a result, you have to extend farther from the mean to contain a given proportion of the area. Recall that with a normal distribution, 95% of the distribution is within 1.96 standard deviations of the mean. Using the t distribution, if you have a sample size of only 5, 95% of the area is within 2.78 standard deviations of the mean. Therefore, the standard error of the mean would be multiplied by 2.78 rather than 1.96.

The values of t to be used in a confidence interval can be looked up in a table of the t distribution, a small version of which is provided in the following section. You can also use an “inverse t distribution” calculator6 to find the t values to use in confidence intervals. With either approach, the t values will vary depending upon the degrees of freedom (df). For confidence intervals on the mean, df is equal to n - 1, where n is the sample size.

Assume that the following five numbers are sampled from a normal distribution: 2, 3, 5, 6, and 9 and that the standard deviation is not known. The first steps are to compute the sample mean and variance:

\[ \bar{X} = 5 \]

\[ s^2 = 7.5 \]

The next step is to estimate the standard error of the mean. If we knew the population variance, we could use the following formula:

\[ \sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}} \]

Instead we compute an estimate of the standard error (\(s_{\bar{X}}\)). Note that since we previously computed the sample variance (\(s^2 = 7.5\)), we can take the square root of this to obtain the sample standard deviation (\(s\)):

\[ s_{\bar{X}} = \frac{s}{\sqrt{n}} = \frac{\sqrt{7.5}}{\sqrt{5}} = 1.225 \]

The next step is to find the value of t. As shown in Table 6.6 of the following section, the value for the 95% interval for df = n - 1 = 4 is 2.776. The confidence interval is then computed just as it is when \(\sigma_{\bar{X}}\). The only differences are that \(s_{\bar{X}}\) and \(t\) rather than \(\sigma_{\bar{X}}\) and \(Z\) are used.

\[ \text{Lower limit} = 5 - (2.776)(1.225) = 1.60 \]\[ \text{Upper limit} = 5 + (2.776)(1.225) = 8.40 \]

More generally, the formula for the 95% confidence interval on the mean is:

\[ \text{Lower limit} = \bar{X} - (t_{CL})(s_{\bar{X}}) \]\[ \text{Upper limit} = \bar{X} + (t_{CL})(s_{\bar{X}}) \]

where \(\bar{X}\) is the sample mean, \(t_{CL}\) is the t for the confidence level desired (0.95 in the above example), and \(s_{\bar{X}}\) is the estimated standard error of the mean.

We will finish our discussion of confidence intervals for the mean with an analysis of the Stroop Data.7 Specifically, we will compute a confidence interval on the mean difference score. As mentioned in Section 1.4.2, the study involved 47 subjects naming the color of ink that words were written in. An additional detail that is now relevant to us is that subjects completed similar naming tasks multiple times under different conditions. In the “interference” condition, the names conflicted so that, for example, they would name the ink color of the word “blue” written in red ink. The correct response is to say “red” and ignore the fact that the word is “blue.” In a second condition, subjects named the ink color of colored rectangles.

Table 6.5: Response times in seconds for 10 subjects.
Naming Colored Rectangle Interference Difference
17 38 21
15 58 43
18 35 17
20 39 19
18 33 15
20 32 12
20 45 25
19 52 33
17 31 14
21 29 8

Table 6.5 shows the time difference between the interference and color-naming conditions for 10 of the 47 subjects. The mean time difference for all 47 subjects is 16.362 seconds and the standard deviation is 7.470 seconds. The standard error of the mean is 1.090. A t table shows the critical value of t for 47 - 1 = 46 degrees of freedom is 2.013 (for a 95% confidence interval). Therefore the confidence interval is computed as follows:

\[ \text{Lower limit} = 16.362 - (2.013)(1.090) = 14.17 \]\[ \text{Upper limit} = 16.362 + (2.013)(1.090) = 18.56 \]

Therefore, the interference effect (difference) for the whole population is likely to be between 14.17 and 18.56 seconds.

6.3.2 More about the T Distribution8

In the introduction to normal distributions it was shown that 95% of the area of a normal distribution is within 1.96 standard deviations of the mean. Therefore, if you randomly sampled a value from a normal distribution with a mean of 100, the probability it would be within 1.96\(\sigma\) of 100 is 0.95. Similarly, if you sample n values from the population, the probability that the sample mean (\(\bar{X}\)) will be within 1.96 \(\sigma_{\bar{X}}\) of 100 is 0.95.

Now consider the case in which you have a normal distribution but you do not know the standard deviation. You sample \(n\) values and compute the sample mean (\(\bar{X}\)) and estimate the standard error of the mean (\(\sigma_{\bar{X}}\)) with \(s_{\bar{X}}\). What is the probability that \(\bar{X}\) will be within 1.96 \(s_{\bar{X}}\) of the population mean (\(\mu\))? This is a difficult problem because there are two ways in which \(\bar{X}\) could be more than 1.96 \(s_{\bar{X}}\) from \(\mu\): (1) \(\bar{X}\) could, by chance, be either very high or very low and (2) \(s_{\bar{X}}\) could, by chance, be very low. Intuitively, it makes sense that the probability of being within 1.96 standard errors of the mean should be smaller than in the case when the standard deviation is known (and cannot be underestimated). But exactly how much smaller? Fortunately, the way to work out this type of problem was solved in the early 20th century by W. S. Gosset who determined the distribution of a mean divided by an estimate of its standard error. This distribution is called the Student’s t distribution or sometimes just the t distribution. Gosset worked out the t distribution and associated statistical tests while working for a brewery in Ireland. Because of a contractual agreement with the brewery, he published the article under the pseudonym “Student.” That is why the t test is called the “Student’s t test.”

The t distribution is very similar to the normal distribution when the estimate of variance is based on a large sample, but the t distribution has relatively more scores in its tails when there is a small sample. When working with the t distribution, sample size is expressed in what are called degrees of freedom. Degrees of freedom indicate the number of independent pieces of information on which an estimate is based; a more complete discussion of the concept is provided in Appendix I at the end of this chapter. As we noted in the prior section, when we are estimating the standard error for a sample mean, the degrees of freedom is simply equal to the sample size minus one (n-1).

Figure 6.9 shows t distributions with 2, 4, and 10 degrees of freedom and the standard normal distribution. Notice that the normal distribution has relatively more scores in the center of the distribution and the t distribution has relatively more in the tails. The t distribution approaches the normal distribution as the degrees of freedom increase.

Figure 6.9: A comparison of t distributions with 2, 4, and 10 df and the standard normal distribution. The distribution with the lowest peak is the 2 df distribution, the next lowest is 4 df, the lowest after that is 10 df, and the highest is the standard normal distribution.

Since the t distribution has more area in the tails, the percentage of the distribution within 1.96 standard deviations of the mean is less than the 95% for the normal distribution. Table 6.6 shows the number of standard deviations from the mean required to contain 95% and 99% of the area of the t distribution for various degrees of freedom. These are the values of t that you use in a confidence interval. The corresponding values for the normal distribution are 1.96 and 2.58 respectively. Notice that with few degrees of freedom, the values of t are much higher than the corresponding values for a normal distribution and that the difference decreases as the degrees of freedom increase. , since the critical t value for probability 0.95 got closer and closer to 1.96 (the critical value for 0.95 in the normal distribution) as the degrees of freedom incrased. The values shown in Table 6-7 can be obtained from statistical software or an online calculator.9

Table 6.6: Abbreviated t table.
df 0.95 0.99
2 4.303 9.925
3 3.182 5.841
4 2.776 4.604
5 2.571 4.032
8 2.306 3.355
10 2.228 3.169
20 2.086 2.845
50 2.009 2.678
100 1.984 2.626

Returning to the problem posed at the beginning of this section, suppose you sampled 9 values from a normal population and estimated the standard error of the mean (\(\sigma_{\bar{X}}\)) with \(s_{\bar{X}}\). What is the probability that \(\bar{X}\) would be within 1.96\(s_{\bar{X}}\) of \(\mu\)? Since the sample size is 9, there are n - 1 = 8 df. From Table 6.6, you can see that with 8 df the probability is 0.95 that the mean will be within 2.306 \(s_{\bar{X}}\) of \(\mu\). The probability that it will be within 1.96 \(s_{\bar{X}}\) of \(\mu\) is therefore lower than 0.95.

As shown in Figure 6.10, a t distribution calculator10 can be used to find that 0.086 of the area of a t distribution is more than 1.96 standard deviations from the mean, so the probability that \(\bar{X}\) would be less than 1.96\(s_{\bar{X}}\) from \(\mu\) is 1 - 0.086 = 0.914.

Figure 6.10: Area more than 1.96 standard deviations from the mean in a t distribution with 8 df. Note that the two-tailed button is selected so that the area in both tails will be included.

As expected, this probability is less than 0.95 that would have been obtained if \(\sigma_{\bar{X}}\) had been known instead of estimated.

6.3.3 Confidence Intervals for a Regression Slope Coefficient 11

The method for computing a confidence interval for the population slope in a simple linear regression is very similar to methods for computing other confidence intervals. For the 95% confidence interval, the formula is:

\[ \text{Lower limit} = \hat{\beta} - (t_{.95})(s_{\beta}) \]\[ \text{Upper limit} = \hat{\beta} + (t_{.95})(s_{\beta}) \]

where \(\hat{\beta}\) is the slope coefficient estimate, \(t_{.95}\) is the value of t for 95% (2-tailed) confidence, and \(s_{\beta}\) is the standard error for the slope estimate. As before, the t value can be found from a table or an inverse t distribution calculator based on the degrees of freedom.

We illustrate generating a confidence interval using the same data as in the example from Section 3.4, depicted again here as Figure 6.11.

Figure 6.11: A scatter plot of the example data. The black line consists of the predictions, the points are the actual data, and the vertical lines between the points and the black line represent errors of prediction.

When conducting statistical inference for linear regression coefficients, the degrees of freedom is equal to the number of observations minus the number of coefficients being estimated (usually one for the intercept plus one for each independent variable). In the case of simple regression, there is just one independent variable plus a y-intercept, so the number of degrees of freedom is:

\[ df = n-2 \]

where \(n\) is the number of pairs of scores (number of observations in the sample).

As we saw in Section 3.4, the estimated regression slope coefficient is 0.425 with this data. An \(n\) of 5 yields 3 degrees of freedom (5 - 2 = 3), which means the critical t value (at 95% confidence) is 3.182. Finally, the estimated standard error for this slope coefficient (the calculative of which is shown in this chapter’s Appendix II) is 0.305 with this data.

Applying these formulas we obtain a confidence interval with the following lower and upper limits:

\[ \text{Lower limit} = 0.425 - (3.182)(0.305) = -0.55 \]\[ \text{Upper limit} = 0.425 + (3.182)(0.305) = 1.40 \]

6.3.3.1 Assumptions for Statistical Inference from Regression

Although no assumptions were needed to determine the best-fitting straight line, assumptions are made in the calculation of inferential statistics. Naturally, these assumptions refer to the population, not the sample.

  1. Linearity: The relationship between the two variables is linear.

  2. Homoscedasticity: The variance around the regression line is the same for all values of X. A clear violation of this assumption is shown in Figure 6.12. Notice that the predictions for students with high high-school GPAs are very good, whereas the predictions for students with low high-school GPAs are not very good. In other words, the points for students with high high-school GPAs are close to the regression line, whereas the points for low high-school GPA students are not.

  3. The errors of prediction are distributed normally. This means that the deviations from the regression line are normally distributed. It does not mean that X or Y is normally distributed.

Figure 6.12: University GPA as a function of High School GPA.

We will return to the topic of model assumptions and the consequences of violating them in Chapter 11.

Chapter 6 Appendix I: Degrees of Freedom12

Some estimates are based on more information than others. For example, an estimate of the variance based on a sample size of 100 is based on more information than an estimate of the variance based on a sample size of 5. The degrees of freedom (df) of an estimate is the number of independent pieces of information on which the estimate is based.

As an example, let’s say that we know that the mean height of Martians is 6 and wish to estimate the variance of their heights. We randomly sample one Martian and find that its height is 8. Recall that the variance is defined as the mean squared deviation of the values from their population mean. We can compute the squared deviation of our value of 8 from the population mean of 6 to find a single squared deviation from the mean. This single squared deviation from the mean, (8-6)2 = 4, is an estimate of the mean squared deviation for all Martians. Therefore, based on this sample of one, we would estimate that the population variance is 4. This estimate is based on a single piece of information and therefore has 1 df. If we sampled another Martian and obtained a height of 5, then we could compute a second estimate of the variance, (5-6)2 = 1. We could then average our two estimates (4 and 1) to obtain an estimate of 2.5. Since this estimate is based on two independent pieces of information, it has two degrees of freedom. The two estimates are independent because they are based on two independently and randomly selected Martians. The estimates would not be independent if after sampling one Martian, we decided to choose its brother as our second Martian.

As you are probably thinking, it is pretty rare that we know the population mean when we are estimating the variance. Instead, we have to first estimate the population mean (\(\mu\)) with the sample mean (\(\bar{X}\)). The process of estimating the mean affects our degrees of freedom as shown below.

Returning to our problem of estimating the variance in Martian heights, let’s assume we do not know the population mean and therefore we have to estimate it from the sample. We have sampled two Martians and found that their heights are 8 and 5. Therefore \(\bar{X}\), our estimate of the population mean, is

\[ \bar{X} = (8+5)/2 = 6.5. \]

We can now compute two estimates of variance:

\[ \text{Estimate 1} = (8-6.5)^2 = 2.25 \]

\[ \text{Estimate 2} = (5-6.5)^2 = 2.25 \]

Now for the key question: Are these two estimates independent? The answer is no because each height contributed to the calculation of \(\bar{X}\). Since the first Martian’s height of 8 influenced \(\bar{X}\), it also influenced Estimate 2. If the first height had been, for example, 10, then \(\bar{X}\) would have been 7.5 and Estimate 2 would have been (5-7.5)2 = 6.25 instead of 2.25. The important point is that the two estimates are not independent and therefore we do not have two degrees of freedom. Another way to think about the non-independence is to consider that if you knew the mean and one of the scores, you would know the other score. For example, if one score is 5 and the mean is 6.5, you can compute that the total of the two scores is 13 and therefore that the other score must be 13-5 = 8.

In general, the degrees of freedom for an estimate is equal to the number of values minus the number of parameters estimated en route to the estimate in question. In the Martians example, there are two values (8 and 5) and we had to estimate one parameter (\(\mu\)) on the way to estimating the parameter of interest (\(\sigma^2\)). Therefore, the estimate of variance has 2 - 1 = 1 degree of freedom. If we had sampled 12 Martians, then our estimate of variance would have had 11 degrees of freedom. Therefore, the degrees of freedom of an estimate of variance is equal to n - 1, where n is the number of observations.

Recall from Section 2.3.4 that the formula for estimating the variance in a sample is:

\[ s^2=\frac{\sum{(X-\bar{X})^2}}{n-1} \]

The denominator of this formula is the degrees of freedom.

So far, we’ve mainly seen examples where the degrees of freedom is equal to n - 1. But as we saw in Section 6.3.3, the degrees of freedom can also be equal to other values such as n - 2. In the case of constructing a confidence interval for a coefficient from a multiple regression with four independent variables, the degrees of freedom will generally be equal to n - 5. The exact formula for degrees of freedom depends on what we are estimating.

Chapter 6 Appendix II: Estimating the Standard Error of a Regression Slope13

This appendix shows how to compute the estimated standard error for the slope of a simple linear regression. The estimated standard error of \(\beta\) is computed using the following formula:

\[ s_{\beta} = \frac{s_{est}}{\sqrt{SSX}} \]

where \(s_{\beta}\) is the estimated standard error of \(\beta\), \(s_{est}\) is the standard error of the estimate, and \(SSX\) is the sum of squared deviations of \(X\) from the mean of \(X\). \(SSX\) is calculated as:

\[ SSX = \sum{(X-\bar{X})^2} \]

where \(\bar{X}\) is the mean of \(X\). The standard error of the estimate can be calculated as:

\[ s_{est} = \sqrt{\frac{(1-r^2)SSY}{n-2}} \]

where \(r\) is the correlation between \(X\) and \(Y\), and \(SSY\) is the sum of squared deviations of \(Y\) from the mean of \(Y\).

These formulas are illustrated with the data shown in Table 6.7. These data are reproduced from Section 3.4. The column X has the values of the independent variable and the column Y has the values of the dependent variable. The third column, x, contains the differences between the values of column X and the mean of X. The fourth column, x2, is the square of the x column. The fifth column, y, contains the differences between the values of column Y and the mean of Y. The last column, y2, is simply square of the y column.

Table 6.7: Example data.
X Y x x2 y y2
1.00 1.00 -2.00 4 -1.06 1.1236
2.00 2.00 -1.00 1 -0.06 0.0036
3.00 1.30 0.00 0 -0.76 0.5776
4.00 3.75 1.00 1 1.69 2.8561
5.00 2.25 2.00 4 0.19 0.0361
Sum 15.00 10.30 0.00 10.00 0.00 4.5970

\(SSY\) is the sum of squared deviations from the mean of Y. It is, therefore, equal to the sum of the y2 column and is equal to 4.597. The correlation (\(r\)) between \(X\) and \(Y\) is 0.6268, and there are 5 observations (n=5). Thus, the standard error of the estimate is:

\[ s_{est} = \sqrt{\frac{(1-(0.6268)^2)(4.597)}{5-2}} = \sqrt{\frac{2.791}{3}} = 0.964 \]

\(SSX\) can be found as the sum of the x2 column and is equal to 10.

We now have all the information to compute the standard error of \(\beta\):

\[ s_{\beta} = \frac{0.964}{\sqrt{10}} = 0.305 \]


  1. This subsection is adapted from David M. Lane. “Introduction to Sampling Distributions.” Online Statistics Education: A Multimedia Course of Study. https://onlinestatbook.com/2/sampling_distributions/intro_samp_dist.html↩︎

  2. This subsection is adapted from David M. Lane. “Sampling Distribution of the Mean.” Online Statistics Education: A Multimedia Course of Study. https://onlinestatbook.com/2/sampling_distributions/samp_dist_mean.html↩︎

  3. This expression can be derived very easily from the variance sum law. Let’s begin by computing the variance of the sampling distribution of the sum of three numbers sampled from a population with variance \(\sigma^2\). The variance of the sum would be \(\sigma^2 + \sigma^2 + \sigma^2\). For \(n\) numbers, the variance would be \(n\sigma^2\). Since the mean is \(1/n\) times the sum, the variance of the sampling distribution of the mean would be \(1/n^2\) times the variance of the sum, which equals \(\sigma^2/n\).↩︎

  4. This subsubsection is adapted from David M. Lane. “Confidence Interval on the Mean.” Online Statistics Education: A Multimedia Course of Study. https://onlinestatbook.com/2/estimation/mean.html↩︎

  5. https://onlinestatbook.com/2/calculators/normal_dist.html↩︎

  6. https://onlinestatbook.com/2/calculators/inverse_t_dist.html↩︎

  7. https://onlinestatbook.com/2/case_studies/stroop.html↩︎

  8. This section is adapted from David M. Lane. “t Distribution.” Online Statistics Education: A Multimedia Course of Study. https://onlinestatbook.com/2/estimation/t_distribution.html↩︎

  9. https://onlinestatbook.com/2/calculators/inverse_t_dist.html↩︎

  10. https://onlinestatbook.com/2/calculators/t_dist.html↩︎

  11. This section is adapted from David M. Lane. “Inferential Statistics for b and r.” Online Statistics Education: A Multimedia Course of Study. https://onlinestatbook.com/2/regression/inferent↩︎

  12. This section is adapted from David M. Lane. “Degrees of Freedom.” Online Statistics Education: A Multimedia Course of Study. https://onlinestatbook.com/2/estimation/df.html↩︎

  13. This section is adapted from David M. Lane. “Inferential Statistics for b and r.” Online Statistics Education: A Multimedia Course of Study. https://onlinestatbook.com/2/regression/inferent↩︎