Percentile - WikiHQ

In statistics, a percentile or percentile score, also known as centile (often denoted as <math>P_k</math> or Pk, for a given percentage k), is a score greater than a given percentage of all scores (or data point value) in a sample. I.e., a score in the k-th percentile would be above approximately k% of all scores in its sample. For example, the 97th percentile (P97) is the value such that 97% of the data points are less than it. The calculation of percentiles requires sorting scores.

Percentiles are a type of quantiles, obtained by a subdivision into 100 groups. The 25th percentile (P25) is also known as the first quartile (Q1), the 50th percentile (P50) as the median or second quartile (Q2), and the 75th percentile (P75) as the third quartile (Q3). For example, the 50th percentile (median) is the score (or , depending on the definition) which 50% of the scores in the distribution are found.

Percentiles are expressed in the same unit of measurement as the input scores, in percent; for example, if the scores refer to human weight, the corresponding percentiles will be expressed in kilograms or pounds.

In the limit of an infinite sample size, the percentile approximates the percentile function, the inverse of the cumulative distribution function.

A related quantity is the percentile rank of a given score, expressed as a percentage, which represents the fraction of scores in its distribution that are less than it, an exclusive definition.

Percentile scores and percentile ranks are often used in the reporting of test scores from norm-referenced tests, but, as just noted, they are not the same. For percentile ranks, a score is given and a percentage is computed. Percentile ranks are exclusive: if the percentile rank for a specified score is 90%, then 90% of the scores were lower. In contrast, for percentiles a percentage is given and a corresponding score is determined, which can be either exclusive or inclusive. The score for a specified percentage (e.g., 90th) indicates a score below which (exclusive definition) or at or below which (inclusive definition) other scores in the distribution fall.

Definitions

There is no standard definition of percentile;

however, all definitions yield similar results when the number of observations is very large and the probability distribution is continuous. In the limit, as the sample size approaches infinity, the 100pth percentile (0<p<1) approximates the inverse of the cumulative distribution function (CDF) thus formed, evaluated at p, as p approximates the CDF. This can be seen as a consequence of the Glivenko–Cantelli theorem. Some methods for calculating the percentiles are given below.

In the normal distribution

325px|thumb|Representation of the [[68–95–99.7 rule|three-sigma rule. The dark blue zone represents observations within one standard deviation (σ) to either side of the mean (μ), which accounts for about 68.3% of the population. Two standard deviations from the mean (dark and medium blue) account for about 95.4%, and three standard deviations (dark, medium, and light blue) for about 99.7%.]]

The methods given in the calculation methods section (below) are approximations for use in small-sample statistics. In general terms, for very large populations following a normal distribution, percentiles may often be represented by reference to a normal curve plot. The normal distribution is plotted along an axis scaled to standard deviations, or sigma (<math>\sigma</math>) units. Mathematically, the normal distribution extends to negative infinity on the left and positive infinity on the right. Note, however, that only a very small proportion of individuals in a population will fall outside the −3σ to +3σ range. For example, with human heights very few people are above the +3σ height level.

Percentiles represent the area under the normal curve, increasing from left to right. Each standard deviation represents a fixed percentile. Thus, rounding to two decimal places, −3σ is the 0.13th percentile, −2σ the 2.28th percentile, −1σ the 15.87th percentile, 0σ the 50th percentile (both the mean and median of the distribution), +1σ the 84.13rd percentile, +2σ the 97.72nd percentile, and +3σ the 99.87th percentile. This is related to the 68–95–99.7 rule or the three-sigma rule. Note that in theory the 0th percentile falls at negative infinity and the 100th percentile at positive infinity, although in many practical applications, such as test results, natural lower and/or upper limits are enforced.

Applications

When Internet service provider bill "burstable" internet bandwidth, the 95th or 98th percentile usually cuts off the top 5% or 2% of bandwidth peaks in each month, and then bills at the nearest rate. In this way, infrequent peaks are ignored, and the customer is charged in a fairer way. The reason this statistic is so useful in measuring data throughput is that it gives a very accurate picture of the cost of the bandwidth. The 95th percentile says that 95% of the time, the usage is below this amount: so, the remaining 5% of the time, the usage is above that amount.

Physicians will often use infant and children's weight and height to assess their growth in comparison to national median and other percentiles which are found in growth charts.

The 85th percentile speed of traffic on a road is often used as a guideline in setting speed limits and assessing whether such a limit is too high or low.

In finance, value at risk is a standard measure to assess (in a model-dependent way) the quantity under which the value of the portfolio is not expected to sink within a given period of time and given a confidence value.

Calculation methods

There are many formulas or algorithms for a percentile score. Hyndman and Fan )

: <math>x=f(p)=\begin{cases}

Np+\frac{1}{2},\forall p\in\left [p_1,p_N\right ], \\

1,\forall p\in\left [0,p_1\right ], \\

N,\forall p\in\left [p_N,1\right ].

\end{cases}</math>

where

: <math>p_i=\frac{1}{N}\left(i-\frac{1}{2}\right),i\in[1,N]\cap\mathbb{N}</math>

: <math>\therefore p_1=\frac{1}{2N}, p_N=\frac{2N-1}{2N}.</math>

Furthermore, let

: <math>P_i=100p_i.</math>

The inverse relationship is restricted to a narrower region:

: <math>p=\frac{1}{N}\left(x-\frac{1}{2}\right),x\in(1,N)\cap\mathbb{R}.</math>

Second variant, C = 1

[Source: Some software packages, including NumPy and Microsoft Excel]

: <math>x = f(p,N) = p(N-1)+1 \text{, } p\in[0,1]</math>

: <math>\therefore p = \frac{x-1}{N-1} \text{, } x\in[1,N].</math>

Note that the <math>x\leftrightarrow p</math> relationship is one-to-one for <math>p\in[0,1]</math>, the only one of the three variants with this property; hence the "INC" suffix, for inclusive, on the Excel function.

Third variant, C = 0

(The primary variant recommended by NIST.