Suppose now that for percentages the percentiles are determined for a specified population distribution whose plausibility is being investigated. If the sample was actually selected from the specified distribution, the sample percentiles (ordered sample observations) should be reasonably close to the corresponding population distribution percentiles. That is, for there should be reasonable agreement between the th smallest sample observation and the th percentile for the specified distribution. Let’s consider the (population percentile, sample percentile) pairs-that is, the pairs

for . Each such pair can be plotted as a point on a two-dimensional coordinate system. If the sample percentiles are close to the corresponding population distribution percentiles, the first number in each pair will be roughly equal to the second number. The plotted points will then fall close to a line. Substantial deviations of the plotted points from a line cast doubt on the assumption that the distribution under consideration is the correct one.

Example 4.29 The value of a certain physical constant is known to an experimenter. The experimenter makes independent measurements of this value using a particular measurement device and records the resulting measurement errors (error observed value - true value). These observations appear in the accompanying table.

Percentage515253545
( z ) percentile-1.645-1.037-0.675-0.385-0.126
Sample observation-1.91-1.25-0.75-0.530.20
Percentage5565758595
( z ) percentile0.1260.3850.6751.0371.645
Sample observation0.350.720.871.401.56

Is it plausible that the random variable measurement error has a standard normal distribution? The needed standard normal percentiles are also displayed in the table. Thus the points in the probability plot are , and . Figure 4.33 shows the resulting plot. Although the points deviate a bit from the line, the predominant impression is that this line fits the points very well. The plot suggests that the standard normal distribution is a reasonable probability model for measurement error.

Figure 4.33 Plot of pairs ( percentile, observed value) for the data of Example 4.29 01925166-48c0-7eca-9860-67f13d0848b1_45_708_481_747_655_0.jpg

Figure 4.34 shows a plot of pairs ( percentile, observation) for a second sample of ten observations. The line gives a good fit to the middle part of the sample but not to the extremes. The plot has a well-defined S-shaped appearance. The two smallest sample observations are considerably larger than the corresponding , percentiles (the points on the far left of the plot are well above the line). Similarly, the two largest sample observations are much smaller than the associated percentiles. This plot indicates that the standard normal distribution would not be a plausible choice for the probability model that gave rise to these observed measurement errors.

Figure 4.34 Plots of pairs ( percentile, observed value) for the scenario of Example 4.29: second sample 01925166-48c0-7eca-9860-67f13d0848b1_45_701_1546_757_544_0.jpg

An investigator is typically not interested in knowing just whether a particular probability distribution, such as the standard normal distribution (normal with and ) or the exponential distribution with , is a plausible model for the population distribution from which the sample was selected. Instead, the issue is whether some member of a family of probability distributions specifies a plausible model-the family of normal distributions, the family of exponential distributions, the family of Weibull distributions, and so on. The values of the parameters of a distribution are usually not specified at the outset. If the family of Weibull distributions is under consideration as a model for lifetime data, are there any values of the parameters and for which the corresponding Weibull distribution gives a good fit to the data? Fortunately, it is frequently the case that just one probability plot will suffice for assessing the plausibility of an entire family. If the plot deviates substantially from a straight line, no member of the family is plausible. When the plot is quite straight, further work is necessary to estimate values of the parameters that yield the most reasonable distribution of the specified type.

Let’s focus on a plot for checking normality. Such a plot is useful in applied work because many formal statistical procedures give accurate inferences only when the population distribution is at least approximately normal. These procedures should generally not be used if the normal probability plot shows a very pronounced departure from linearity. The key to constructing an omnibus normal probability plot is the relationship between standard normal percentiles and those for any other normal distribution:

Consider first the case . If each observation is exactly equal to the corresponding normal percentile for some value of , the pairs percentile], observation) fall on a line, which has slope 1 . This then implies that the ( percentile, observation) pairs fall on a line passing through (i.e., one with -intercept 0) but having slope rather than 1 . The effect of a nonzero value of is simply to change the -intercept from 0 to .

Definition

A plot of the pairs

is called a normal probability plot. If the sample observations are in fact drawn from a normal distribution with mean value and standard deviation , the points should fall close to a straight line with slope and intercept . Thus a plot for which the points fall close to some straight line suggests that the assumption of a normal population distribution is plausible.

EX 4.30

There is an alternative version of a normal probability plot in which the percentile axis is replaced by a nonlinear percentage axis. The scaling on this axis is constructed so that plotted points should again fall close to a line when the sampled distribution is normal. Figure 4.36 shows such a plot from Minitab for the ratio data of Example 4.30. (The last two numbers in the small box on the right will be explained in Chapter 14.)

Figure 4.36 Normal probability plot of the ratio data from Minitab 01925166-48c0-7eca-9860-67f13d0848b1_47_523_1581_1117_586_0.jpg

A nonnormal population distribution can often be placed in one of the following three categories:

  1. It is symmetric and has “lighter tails” than does a normal distribution; that is, the density curve declines more rapidly out in the tails than does a normal curve.
  2. It is symmetric and heavy-tailed compared to a normal distribution.
  3. It is skewed.

A uniform distribution is light-tailed, since its density function drops to zero outside a finite interval. The Cauchy density function for is heavy-tailed, since declines much less rapidly than does . Lognormal and Weibull distributions are among those that are skewed. When the points in a normal probability plot do not adhere to a straight line, the pattern will frequently suggest that the population distribution is in a particular one of these three categories.

The largest and smallest observations in a sample from a light-tailed distribution are usually not as extreme as would be expected from a normal random sample. Visualize a straight line drawn through the middle part of the plot; points on the far right tend to be below the line (observed value percentile), whereas points on the left end of the plot tend to fall above the straight line (observed value percentile). The result is an -shaped pattern of the type pictured in Figure 4.34.

A sample from a heavy-tailed distribution also tends to produce an S-shaped plot. However, in contrast to the light-tailed case, the left end of the plot curves downward (observed percentile), as shown in Figure 4.37(a). If the underlying distribution is positively skewed (a short left tail and a long right tail), the smallest sample observations will be larger than expected from a normal sample and so will the largest observations. In this case, points on both ends of the plot will fall above a straight line through the middle part, yielding a curved pattern, as illustrated in Figure 4.37(b). A sample from a lognormal distribution will usually produce such a pattern. A plot of pairs should then resemble a straight line.

Figure 4.37 Probability plots that suggest a nonnormal distribution: (a) a plot consistent with a heavy-tailed distribution; (b) a plot consistent with a positively skewed distribution 01925166-48c0-7eca-9860-67f13d0848b1_48_413_1397_1341_390_0.jpg

Even when the population distribution is normal, the sample percentiles will not coincide exactly with the theoretical percentiles because of sampling variability. How much can the points in the probability plot deviate from a straight-line pattern before the assumption of population normality is no longer plausible? This is not an easy question to answer. Generally speaking, a small sample from a normal distribution is more likely to yield a plot with a nonlinear pattern than is a large sample. The book Fitting Equations to Data (see the Chapter 13 bibliography) presents the results of a simulation study in which numerous samples of different sizes were selected from normal distributions. The authors concluded that there is typically greater variation in the appearance of the probability plot for sample sizes smaller than 30 , and only for much larger sample sizes does a linear pattern generally predominate. When a plot is based on a small sample size, only a very substantial departure from linearity should be taken as conclusive evidence of nonnormality. A similar comment applies to probability plots for checking the plausibility of other types of distributions.