Zipf's law - WikiHQ

thumb|A plot of the frequency of each word as a function of its frequency rank for two English language texts: Culpeper's Complete Herbal (1652) and H. G. Wells's [[The War of the Worlds (1898) in a log-log scale. The dashed line is the ideal law <math display="inline" alt=y is proportional to the inverse of x>y \propto \frac{1}{x}</math>.]]

Zipf's law () is an empirical law stating that when a set of measured values is sorted in decreasing order, the value of the -th entry is often approximately inversely proportional to .

The best-known instance of Zipf's law applies to the frequency distribution of words in a text or corpus of natural language:

<math display="block" alt="Word frequency is proportional to the inverse of the word rank">\ \mathsf{word\ frequency}\ \propto\ \frac{ 1 }{\ \mathsf{ word\ rank}\ } ~.</math>

It is usually found that the most common word occurs approximately twice as often as the next common one, three times as often as the third most common, and so on. For example, in the Brown Corpus of American English text, the word "the" is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). True to Zipf's law, the second-place word "of" accounts for slightly over 3.5% of words (36,411 occurrences), followed by "and" (28,852). and by E. Condon in 1928.

The same relation for frequencies of words in natural language texts was observed by George Zipf in 1932,), number of people watching the same TV channel, and in

1992 bioinformatician Wentian Li published a proof

Formal definition

</math> where H<sub>N,s</sub> is the Nth generalized harmonic number

| cdf = <math>\frac{H_{k,s{H_{N,s</math>

| mean = <math>\frac{H_{N,s-1{H_{N,s</math>

| median =

| mode = <math>1\,</math>

| variance = <math>\frac{H_{N,s-2{H_{N,s-\frac{H^2_{N,s-1{H^2_{N,s</math>

| skewness =

| kurtosis =

| entropy = <math>\frac{s}{H_{N,s\sum\limits_{k=1}^N\frac{\ln(k)}{k^s}

+\ln(H_{N,s})</math>

| mgf = <math>\frac{1}{H_{N,s\sum\limits_{n=1}^N \frac{e^{nt{n^s}</math>

| char = <math>\frac{1}{H_{N,s\sum\limits_{n=1}^N \frac{e^{int{n^s}</math>

Formally, the Zipf distribution on elements assigns to the element of rank (counting from 1) the probability:

<math display="block">\ f(k;N) ~=~

\begin{cases}

\frac{ 1 }{\ H_N }\ \frac{1}{\ k\ }\ , &\ \mbox{ if }\ 1 \le k \le N ~, \\

{} \\

~~ 0 ~~ , &\ \mbox{ if }\ k < 1\ \mbox{ or }\ N < k ~.

\end{cases}

</math> where <sub></sub> is a normalization constant: The th harmonic number:

<math display="block"> H_N \equiv \sum_{k=1}^N \frac{\ 1\ }{ k } ~.</math>

The distribution is sometimes generalized to an inverse power law with exponent instead of .

Zipf's law can be visualized by plotting the item frequency data on a log-log graph, with the axes being the logarithm of rank order, and logarithm of frequency. The data conform to Zipf's law with exponent to the extent that the plot approximates a linear (more precisely, affine) function with slope . For exponent one can also plot the reciprocal of the frequency (mean interword interval) against rank, or the reciprocal of rank against frequency, and compare the result with the line through the origin with slope and Simkin, "Re-inventing Willis".

However, it may be partly explained by statistical analysis of randomly generated texts. Wentian Li has shown that in a document in which each character has been chosen randomly from a uniform distribution of all letters (plus a space character), the "words" with different lengths follow the macro-trend of Zipf's law (the more probable words are the shortest and have equal probability).

Another possible cause for the Zipf distribution is a preferential attachment process, in which the value of an item tends to grow at a rate proportional to (intuitively, "the rich get richer" or "success breeds success"). Such a growth process results in the Yule–Simon distribution, which has been shown to fit word frequency versus rank in language

The frequency-rank word distribution is often characteristic of the author and changes little over time. This feature has been used in the analysis of texts for authorship attribution.

External links

—An article on Zipf's law applied to city populations
Seeing Around Corners (Artificial societies turn up Zipf's law)
PlanetMath article on Zipf's law
Distributions de type "fractal parabolique" dans la Nature (French, with English summary)
An analysis of income distribution
Zipf List of French words
Zipf list for English, French, Spanish, Italian, Swedish, Icelandic, Latin, Portuguese and Finnish from Gutenberg Project and online calculator to rank words in texts
Zipf's Law examples and modelling (1985)
Benford's law, Zipf's law, and the Pareto distribution by Terence Tao.

Formal definition

Further reading

External links