Divergence-from-randomness model

In the field of information retrieval, divergence from randomness (DFR) is a generalization of one of the very first models, Harter's 2-Poisson indexing-model. It is one type of probabilistic model. It is used to measure the amount of information carried in documents. The 2-Poisson model is based on the hypothesis that the level of documents is related to a set of documents that contains words that occur to a relatively greater extent than in the rest of the documents. It is not a model, but a framework for weighting terms using probabilistic methods, and it has a special relationship for term weighting based on the notion of elite

Term weights serve as the standard of whether a specific word is in that set or not. Term weights are computed by measuring the divergence between a term distribution produced by a random process and the actual term distribution.

Divergence from randomness models set up by instantiating the three main components of the framework: first selecting a basic randomness model, then applying the first normalization and at last normalizing the term frequencies. The basic models are from the following tables.

Definition

The divergence from randomness is based on this idea: "The more the divergence of the within-document term-frequency from its frequency within the collection, the more the information carried by the word t in document d. In other words, the term-weight is inversely related to the probability of term-frequency within the document d obtained by a model M of randomness."

Term frequency normalization

Before using the within-document frequency tf of a term, the document-length dl is normalized to a standard length sl. Therefore, the term-frequencies tf are recalculated with the respect to the standard document-length, that is:

tf<sub>n</sub> = tf * log(1+ sl/dl) (normalization 1)

tfn represents the normalized term frequency. Another version of the normalization formula is the following:

tf<sub>n</sub> = tf * log(1 + c*(sl/dl)) (normalization 2)

Normalization 2 is usually considered to be more flexible, since there is no fixed value for c.

tf is the term-frequency of the term t in the document d
dl is the document-length.
sl is the standard length.

Mathematic and statistical tools

The probability space

Sampling space V

Utility-Theoretic Indexing developed by Cooper and Maron is a theory of indexing based on utility theory. To reflect the value for documents that is expected by the users, index terms are assigned to documents. Also, Utility-Theoretic Indexing is related an "event space" in the statistical word. There are several basic spaces Ω in the Information Retrieval. A really simple basic space Ω can be the set V of terms t, which is called the vocabulary of the document collection. Due to Ω=V is the set of all mutually exclusive events, Ω can also be the certain event with probability:

<math>P(V) = \sum_{t \mathop \in V} P(t) = 1</math>

Thus P, the probability distribution, assigns probabilities to all sets of terms for the vocabulary. Notice that the basic problem of Information Retrieval is to find an estimate for P(t). Estimates are computed on the basis of sampling and the experimental text collection furnishes the samples needed for the estimation. Now we run into the main concern which is how do we treat two arbitrary but heterogeneous pieces of texts appropriately. Paragons like a chapter in a Science Magazine and an article from a sports newspaper as the other. They can be considered as two different samples since those aiming at different population.

Sampling with a document

The relationship of the document with the experiments is made by the way in which the sample space is chosen. In IR, term experiment, or trial, is used here with a technical meaning rather than a common sense. For example, a document could be an experiment which means the document is a sequence of outcomes t∈V, or just a sample of a population. We will talk about the event of observing a number Xt =tf of occurrences of a given word t in a sequence of experiments. In order to introduce this event space, we should introduce the product of the probability spaces associated with the experiments of the sequence. We could introduce our sample space to associate a point with possible configurations of the outcomes. The one-to-one correspondence for sample space can be defined as:

<math>\mathop \Omega = V^{l_d}</math>

Where ld is the number of trials of the experiment or in this example, the length of a document. We can assume that each outcome may or may not depend on the outcomes of the previous experiments. If the experiments are designed so that an outcome is influencing the next outcomes, then the probability distribution on V is different at each trial. But, more commonly, in order to establish the simpler case when the probability space is invariant in IR, the term independence assumption is often made. Therefore, all possible configurations ofΩ=Vld are considered equiprobable. Considering this assumption, we can consider each document a Bernoulli process. The probability spaces of the product are invariant and the probability of a given sequence is the product of the probabilities at each trial. Consequently, if p=P(t) is the prior probability that the outcome is t and the number of experiments is ld we obtain the probability of <math>X_t=tf</math> is equal to:

<math>P(X_t=tf|p)=\binom{l_d}{tf}p^{tf}q^