<!-- PLEASE SEE Wikipedia:WikiProject Probability123#Standards for a discussion of standards used for probability distribution articles such as this one. -->
thumb|A plot of the frequency of each word as a function of its frequency rank for two English language texts: Culpeper's Complete Herbal (1652) and H. G. Wells's [[The War of the Worlds (1898) in a log-log scale. The dashed line is the ideal law <math display="inline" alt=y is proportional to the inverse of x>y \propto \frac{1}{x}</math>.]]
Zipf's law () is an empirical law stating that when a set of measured values is sorted in decreasing order, the value of the -th entry is often approximately inversely proportional to .
The best-known instance of Zipf's law applies to the frequency distribution of words in a text or corpus of natural language:
<math display="block" alt="Word frequency is proportional to the inverse of the word rank">\ \mathsf{word\ frequency}\ \propto\ \frac{ 1 }{\ \mathsf{ word\ rank}\ } ~.</math>
It is usually found that the most common word occurs approximately twice as often as the next common one, three times as often as the third most common, and so on. For example, in the Brown Corpus of American English text, the word "the" is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). True to Zipf's law, the second-place word "of" accounts for slightly over 3.5% of words (36,411 occurrences), followed by "and" (28,852). and by E. Condon in 1928.
The same relation for frequencies of words in natural language texts was observed by George Zipf in 1932,), number of people watching the same TV channel, and in
1992 bioinformatician Wentian Li published a proof
Formal definition
</math> where H<sub>N,s</sub> is the Nth generalized harmonic number
| cdf = <math>\frac{H_{k,s{H_{N,s</math>
| mean = <math>\frac{H_{N,s-1{H_{N,s</math>
| median =
| mode = <math>1\,</math>
| variance = <math>\frac{H_{N,s-2{H_{N,s-\frac{H^2_{N,s-1{H^2_{N,s</math>
| skewness =
| kurtosis =
| entropy = <math>\frac{s}{H_{N,s\sum\limits_{k=1}^N\frac{\ln(k)}{k^s}
+\ln(H_{N,s})</math>
| mgf = <math>\frac{1}{H_{N,s\sum\limits_{n=1}^N \frac{e^{nt{n^s}</math>
| char = <math>\frac{1}{H_{N,s\sum\limits_{n=1}^N \frac{e^{int{n^s}</math>
Formally, the Zipf distribution on elements assigns to the element of rank (counting from 1) the probability:
<math display="block">\ f(k;N) ~=~
\begin{cases}
\frac{ 1 }{\ H_N }\ \frac{1}{\ k\ }\ , &\ \mbox{ if }\ 1 \le k \le N ~, \\
{} \\
~~ 0 ~~ , &\ \mbox{ if }\ k < 1\ \mbox{ or }\ N < k ~.
\end{cases}
</math> where <sub></sub> is a normalization constant: The th harmonic number:
<math display="block"> H_N \equiv \sum_{k=1}^N \frac{\ 1\ }{ k } ~.</math>
The distribution is sometimes generalized to an inverse power law with exponent instead of .
Zipf's law can be visualized by plotting the item frequency data on a log-log graph, with the axes being the logarithm of rank order, and logarithm of frequency. The data conform to Zipf's law with exponent to the extent that the plot approximates a linear (more precisely, affine) function with slope . For exponent one can also plot the reciprocal of the frequency (mean interword interval) against rank, or the reciprocal of rank against frequency, and compare the result with the line through the origin with slope and Simkin, "Re-inventing Willis".
However, it may be partly explained by statistical analysis of randomly generated texts. Wentian Li has shown that in a document in which each character has been chosen randomly from a uniform distribution of all letters (plus a space character), the "words" with different lengths follow the macro-trend of Zipf's law (the more probable words are the shortest and have equal probability).
Another possible cause for the Zipf distribution is a preferential attachment process, in which the value of an item tends to grow at a rate proportional to (intuitively, "the rich get richer" or "success breeds success"). Such a growth process results in the Yule–Simon distribution, which has been shown to fit word frequency versus rank in language
The frequency-rank word distribution is often characteristic of the author and changes little over time. This feature has been used in the analysis of texts for authorship attribution.
<!-- -->
<!-- -->
<!-- end "refs=" -->
Further reading
External links
- —An article on Zipf's law applied to city populations
- Seeing Around Corners (Artificial societies turn up Zipf's law)
- PlanetMath article on Zipf's law
- Distributions de type "fractal parabolique" dans la Nature (French, with English summary)
- An analysis of income distribution
- Zipf List of French words
- Zipf list for English, French, Spanish, Italian, Swedish, Icelandic, Latin, Portuguese and Finnish from Gutenberg Project and online calculator to rank words in texts
- Zipf's Law examples and modelling (1985)
- Benford's law, Zipf's law, and the Pareto distribution by Terence Tao.
