4.6.1 Sample Percentiles

The details involved in constructing probability plots differ a bit from source to source. The basis for our construction is a comparison between percentiles of the sample data and the corresponding percentiles of the distribution under consideration. Recall that the $(100 p)$ th percentile of a continuous distribution with $cdf F (\cdot)$ is the number $η (p)$ that satisfies $F (η (p)) = p$ . That is, $η (p)$ is the number on the measurement scale such that the area under the density curve to the left of $η (p)$ is $p$ . Thus the 50th percentile $η (.5)$ satisfies $F (η (.5)) = .5$ , and the 90th percentile satisfies $F (η (.9)) = .9$ . Consider as an example the standard normal distribution, for which we have denoted the cdf by $Φ (\cdot)$ . From Appendix Table A.3, we find the 20th percentile by locating the row and column in which .2000 (or a number as close to it as possible) appears inside the table. Since .2005 appears at the intersection of the -.8 row and the .04 column, the 20th percentile is approximately -.84. Similarly, the 25th percentile of the standard normal distribution is (using linear interpolation) approximately -.675.

Roughly speaking, sample percentiles are defined in the same way that percentiles of a population distribution are defined. The 50th-sample percentile should separate the smallest $50 %$ of the sample from the largest $50 %$ , the 90th percentile should be such that $90 %$ of the sample lies below that value and $10 %$ lies above, and so on. Unfortunately, we run into problems when we actually try to compute the sample percentiles for a particular sample of $n$ observations. If, for example, $n = 10$ , we can split off $20 %$ of these values or $30 %$ of the data, but there is no value that will split off exactly $23 %$ of these ten observations. To proceed further, we need an operational definition of sample percentiles (this is one place where different people do slightly different things). Recall that when $n$ is odd, the sample median or 50th-sample percentile is the middle value in the ordered list, for example, the sixth-largest value when $n = 11$ . This amounts to regarding the middle observation as being half in the lower half of the data and half in the upper half. Similarly, suppose $n = 10$ . Then if we call the third-smallest value the 25th percentile, we are regarding that value as being half in the lower group (consisting of the two smallest observations) and half in the upper group (the seven largest observations). This leads to the following general definition of sample percentiles.

Definition

Order the $n$ sample observations from smallest to largest. Then the $i$ th smallest observation in the list is taken to be the $[100 (i - .5) / n]$ th sample percentile.

Once the percentage values $100 (i - .5) / n (i = 1, 2, \dots, n)$ have been calculated, sample percentiles corresponding to intermediate percentages can be obtained by linear interpolation. For example, if $n = 10$ , the percentages corresponding to the ordered sample observations are $100 (1 - .5) / 10 = 5%, 100 (2 - .5) / 10 = 15 %$ , $25 %, \dots$ , and $100 (10 - .5) / 10 = 95 %$ . The 10th percentile is then halfway between the 5th percentile (smallest sample observation) and the 15th percentile (second-smallest observation). For our purposes, such interpolation is not necessary because a probability plot will be based only on the percentages $100 (i - .5) / n$ corresponding to the $n$ sample observations.

Youliang Zhong

Backlinks

Graph View

4.6.1 Sample Percentiles