thumb|upright=1.35|Box plot of data from the [[Michelson–Morley experiment#Michelson experiment (1881)|Michelson experiment]]In descriptive statistics, a box plot or boxplot is a method for demonstrating graphically the locality, spread and skewness groups of numerical data through their quartiles.thumb|A box plot representing data|342x342pxIn addition to the box on a box plot, there can be lines (which are called whiskers) extending from the box indicating variability outside the upper and lower quartiles, thus, the plot is also called the box-and-whisker plot and the box-and-whisker diagram. Outliers that differ significantly from the rest of the dataset may be plotted as individual points beyond the whiskers on the box plot. Box plots are non-parametric: they display variation in samples of a statistical population without making any assumptions of the underlying statistical distribution (though Tukey's box plot assumes symmetry for the whiskers and normality for their length).
The spacings in each subsection of the box plot indicate the degree of dispersion (spread) and skewness of the data, which are usually described using the five-number summary. In addition, the box plot allows one to visually estimate various L-estimators, notably the interquartile range, midhinge, range, mid-range, and trimean. Box plots can be drawn either horizontally or vertically.
History
The range-bar method was first introduced by Mary Eleanor Spear in her book "Charting Statistics" in 1952 and again in her book "Practical Charting Techniques" in 1969. The box-and-whisker plot was first introduced in 1970 by John Tukey, who later published on the subject in his book "Exploratory Data Analysis" in 1977.
Elements
thumb|Box plot with whiskers from minimum to maximum
thumb|The same box plot with whiskers drawn within the 1.5 IQR value
A box plot is a standardized way of displaying the dataset based on the five-number summary: the minimum, the maximum, the sample median, and the first and third quartiles.
- Minimum (Q<sub>0</sub> or 0th percentile): the lowest data point in the data set excluding any outliers
- Maximum (Q<sub>4</sub> or 100th percentile): the highest data point in the data set excluding any outliers
- Median (Q<sub>2</sub> or 50th percentile): the middle value in the data set
- First quartile (Q<sub>1</sub> or 25th percentile): also known as the lower quartile q<sub>n</sub>(0.25), it is the median of the lower half of the dataset
- Third quartile (Q<sub>3</sub> or 75th percentile): also known as the upper quartile q<sub>n</sub>(0.75), it is the median of the upper half of the dataset
In addition to the minimum and maximum values used to construct a box plot, another important element that can also be employed to obtain a box plot is the interquartile range (IQR), as denoted below:
- Interquartile range (IQR): the distance between the upper and lower quartiles
:: <math>\text{IQR} = Q_3 - Q_1 = q_n(0.75) - q_n(0.25)</math>
A box plot usually includes two parts, a box and a set of whiskers.
Box
The box is drawn from Q<sub>1</sub> to Q<sub>3</sub> with a horizontal line drawn inside it to denote the median. Some box plots include an additional character to represent the mean of the data.
Whiskers
The whiskers must end at an observed data point, but can be defined in various ways. In the most straightforward method, the boundary of the lower whisker is the minimum value of the data set, and the boundary of the upper whisker is the maximum value of the data set. Because of this variability, it is appropriate to describe the convention that is being used for the whiskers and outliers in the caption of the box plot.
Another popular choice for the boundaries of the whiskers is based on the 1.5 IQR value. From above the upper quartile (Q<sub>3</sub>), a distance of 1.5 times the IQR is measured out and a whisker is drawn up to the largest observed data point from the dataset that falls within this distance. Similarly, a distance of 1.5 times the IQR is measured out below the lower quartile (Q<sub>1</sub>) and a whisker is drawn down to the lowest observed data point from the dataset that falls within this distance. Because the whiskers must end at an observed data point, the whisker lengths can look unequal, even though 1.5 IQR is the same for both sides. All other observed data points outside the boundary of the whiskers are plotted as outliers. The outliers can be plotted on the box plot as a dot, a small circle, a star, etc. (see example below).
There are other representations in which the whiskers can stand for several other things, such as:
- One standard deviation above and below the mean of the data set
- The 9th percentile and the 91st percentile of the data set
- The 2nd percentile and the 98th percentile of the data set
Rarely, box plot can be plotted without the whiskers. This can be appropriate for sensitive information to avoid whiskers (and outliers) disclosing actual values observed.
The unusual percentiles 2%, 9%, 91%, 98% are sometimes used for whisker cross-hatches and whisker ends to depict the seven-number summary. If the data are normally distributed, the locations of the seven marks on the box plot will be equally spaced. On some box plots, a cross-hatch is placed before the end of each whisker.
Variations
thumb|upright=1.3|Four box plots, with and without notches and variable width
Since the mathematician John W. Tukey first popularized this type of visual data display in 1969, several variations on the classical box plot have been developed, and the two most commonly found variations are the variable-width box plots and the notched box plots.
Variable-width box plots illustrate the size of each group whose data is being plotted by making the width of the box proportional to the size of the group. A popular convention is to make the box width proportional to the square root of the size of the group.
Notched box plots apply a "notch" or narrowing of the box around the median. Notches are useful in offering a rough guide of the significance of the difference of medians; if the notches of two boxes do not overlap, this will provide evidence of a statistically significant difference between the medians. The height of the notches is proportional to the interquartile range (IQR) of the sample and is inversely proportional to the square root of the size of the sample. However, there is an uncertainty about the most appropriate multiplier (as this may vary depending on the similarity of the variances of the samples).
Adjusted box plots are intended to describe skew distributions, and they rely on the medcouple statistic of skewness. For a medcouple value of MC, the lengths of the upper and lower whiskers on the box plot are respectively defined to be:
:<math>\begin{matrix}
1.5 \text{IQR} \cdot e^{3 \text{MC, & 1.5 \text{ IQR} \cdot e^{-4 \text{MC \text{ if } \text{MC} \geq 0, \\
1.5 \text{IQR} \cdot e^{4 \text{MC, & 1.5 \text{ IQR} \cdot e^{-3\text{MC \text{ if } \text{MC} \leq 0.
\end{matrix}
</math>
For a symmetrical data distribution, the medcouple will be zero, and this reduces the adjusted box plot to the Tukey's box plot with equal whisker lengths of <math>1.5 \text{ IQR}</math> for both whiskers.
Other kinds of box plots, such as the violin plots and the bean plots can show the difference between single-modal and multimodal distributions, which cannot be observed from the original classical box plot.
