Dempster–Shafer theory

thumb|[[Arthur P. Dempster at the Workshop on Theory of Belief Functions (Brest, 1 April 2010).]]

The theory of belief functions, also referred to as evidence theory or Dempster–Shafer theory (DST), is a general framework for reasoning with uncertainty, with understood connections to other frameworks such as probability, possibility and imprecise probability theories. Introduced by Arthur P. Dempster in the context of statistical inference, the theory was later developed by Glenn Shafer into a general framework for modeling epistemic uncertainty—a mathematical theory of evidence. The theory allows one to combine evidence from different sources and arrive at a degree of belief (represented by a mathematical object called belief function) that takes into account all the available evidence.

In a narrow sense, the term Dempster–Shafer theory refers to the original conception of the theory by Dempster and Shafer. However, it is more common to use the term in the wider sense of the same general approach, as adapted to specific kinds of situations. In particular, many authors have proposed different rules for combining evidence, often with a view to handling conflicts in evidence better. The early contributions have also been the starting points of many important developments, including the transferable belief model and the theory of hints. Put another way, it is a way of representing epistemic plausibilities, but it can yield answers that contradict those arrived at using probability theory.

Often used as a method of sensor fusion, Dempster–Shafer theory is based on two ideas: obtaining degrees of belief for one question from subjective probabilities for a related question, and Dempster's rule for combining such degrees of belief when they are based on independent items of evidence. In essence, the degree of belief in a proposition depends primarily upon the number of answers (to the related questions) containing the proposition, and the subjective probability of each answer. Also contributing are the rules of combination that reflect general assumptions about the data.

In this formalism a degree of belief (also referred to as a mass) is represented as a belief function rather than a Bayesian probability distribution. Probability values are assigned to sets of possibilities rather than single events: their appeal rests on the fact they naturally encode evidence in favor of propositions.

Dempster–Shafer theory assigns its masses to all of the subsets of the set of states of a system—in set-theoretic terms, the power set of the states. For instance, assume a situation where there are two possible states of a system. For this system, any belief function assigns mass to the first state, the second, to both, and to neither.

Belief and plausibility

Shafer's formalism starts from a set of possibilities under consideration, for instance numerical values of a variable, or pairs of linguistic variables like "date and place of origin of a relic" (asking whether it is antique or a recent fake). A hypothesis is represented by a subset of this frame of discernment, like "(Ming dynasty, China)", or "(19th century, Germany)". that are dictated by independent belief sources, such as in the case of combining hints or combining preferences. Note that the probability masses from propositions that contradict each other can be used to obtain a measure of conflict between the independent belief sources. Other situations can be modeled with different fusion operators, such as cumulative fusion of beliefs from independent sources, which can be modeled with the cumulative fusion operator.

Dempster's rule of combination is sometimes interpreted as an approximate generalisation of Bayes' rule. In this interpretation the priors and conditionals need not be specified, unlike traditional Bayesian methods, which often use a symmetry (minimax error) argument to assign prior probabilities to random variables (e.g. assigning 0.5 to binary values for which no information is available about which is more likely). However, any information contained in the missing priors and conditionals is not used in Dempster's rule of combination unless it can be obtained indirectly—and arguably is then available for calculation using Bayes equations.

Dempster–Shafer theory allows one to specify a degree of ignorance in this situation instead of being forced to supply prior probabilities that add to unity. This sort of situation, and whether there is a real distinction between risk and ignorance, has been extensively discussed by statisticians and economists. See, for example, the contrasting views of Daniel Ellsberg, Howard Raiffa, Kenneth Arrow and Frank Knight.

Formal definition

Let X be the universe: the set representing all possible states of a system under consideration. The power set

:<math>2^X \,\!</math>

is the set of all subsets of X, including the empty set <math>\emptyset</math>. For example, if:

:<math>X = \left \{ a, b \right \} \,\!</math>

then

:<math>2^X = \left \{ \emptyset, \left \{ a \right \}, \left \{ b \right \}, X \right \}. \,</math>

The elements of the power set can be taken to represent propositions concerning the actual state of the system, by containing all and only the states in which the proposition is true.

The theory of evidence assigns a belief mass to each element of the power set. Formally, a function

:<math>m: 2^X \rightarrow [0,1] \,\!</math>

is called a basic belief assignment (BBA), when it has two properties. First, the mass of the empty set is zero:

:<math>m(\emptyset) = 0. \,\!</math>

Second, the masses of all the members of the power set add up to a total of 1:

:<math>\sum_{A \in 2^X} m(A) = 1.</math>

The mass m(A) of A, a given member of the power set, expresses the proportion of all relevant and available evidence that supports the claim that the actual state belongs to A but to no particular subset of A. The value of m(A) pertains only to the set A and makes no additional claims about any subsets of A, each of which have, by definition, their own mass.

From the mass assignments, the upper and lower bounds of a probability interval can be defined. This interval contains the precise probability of a set of interest (in the classical sense), and is bounded by two non-additive continuous measures called belief (or support) and plausibility:

:<math>\operatorname{bel}(A) \le P(A) \le \operatorname{pl}(A).</math>

The belief bel(A) for a set A is defined as the sum of all the masses of subsets of the set of interest:

:<math>\operatorname{bel}(A) = \sum_{B \mid B \subseteq A} m(B). \, </math>

The plausibility pl(A) is the sum of all the masses of the sets B that intersect the set of interest A:

:<math>\operatorname{pl}(A) = \sum_{B \mid B \cap A \ne \emptyset} m(B). \, </math>

The two measures are related to each other as follows:

:<math>\operatorname{pl}(A) = 1 - \operatorname{bel}(\overline{A}).\,</math>

And conversely, for finite A, given the belief measure bel(B) for all subsets B of A, we can find the masses m(A) with the following inverse function:

:<math>m(A) = \sum_{B \mid B \subseteq A} (-1)^{|A-B|}\operatorname{bel}(B) \, </math>

where |A − B| is the difference of the cardinalities of the two sets.

Random-set interpretation

Belief functions have a very compelling interpretation in terms of set-valued random variables, or random sets'.

Consider a die, as a very simple example of (discrete) random variable. The corresponding probability space rests on the sample space <math>\Omega = \{ 1, 2, 3, 4, 5, 6 \}</math>, with elements corresponding to the die's faces.

thumb|586x586px|The random set (set-valued random variable) associated with the cloaked die in which faces 1 and 2 are not distinguishable.

Now, imagine that two of the faces, e.g., 1 and 2, are cloaked. We can still roll the die: the probability space has not changed, and this remains a random experiment described by an underlying random variable. What has changed is the nature of the outcome: besides regular individual faces (3, 4, etcetera), we might at times observe a cloaked face. Assuming only the top side is visible after each run, this is equivalent to observing the set of faces <math>\{ 1, 2 \}</math>. Mathematically, this is a set-valued random variable, o random set (see figure).

This illustrates one of the simplest possible cases of missing data, situations in which part of the data I need to observe in order to make my inference are partly or totally missing. This is very common in science and engineering: think of occlusions in computer vision, for instance, or missing temperature values for certain days or locations in meteorological data. The bottom line is, whenever data are missing, observations are inherently set-valued.

According to Art Dempster's original formulation, a belief function is indeed a multi-valued (1 to many) mapping between a probability (source) space <math>\Omega</math> and a target space <math>X</math>, of the kind: <math>\Gamma : \Omega \rightarrow 2^X</math>, where <math>\omega \in \Omega \mapsto A \subseteq X</math>. In the above cloaked die example, <math>A = \{ 1, 2 \} \subseteq X = \{ 1, 2, 3, 4, 5, 6 \}.</math>

Dempster's rule of combination

The problem we now face is how to combine two independent sets of probability mass assignments in specific situations. In case different sources express their beliefs over the frame in terms of belief constraints such as in the case of giving hints or in the case of expressing preferences, then Dempster's rule of combination is the appropriate fusion operator. This rule derives common shared belief between multiple sources and ignores all the conflicting (non-shared) belief through a normalization factor. Use of that rule in other situations than that of combining belief constraints has come under serious criticism, such as in case of fusing separate belief estimates from multiple sources that are to be integrated in a cumulative manner, and not as constraints. Cumulative fusion means that all probability masses from the different sources are reflected in the derived belief, so no probability mass is ignored.

Specifically, the combination (called the joint mass) is calculated from the two sets of masses m<sub>1</sub> and m<sub>2</sub> in the following manner:

:<math>m_{1,2}(\emptyset) = 0 \, </math>

:<math>m_{1,2}(A) = (m_1 \oplus m_2) (A) = \frac 1 {1 - K} \sum_{B \cap C = A \ne \emptyset} m_1(B) m_2(C) \,\!</math>

where

:<math>K = \sum_{B \cap C = \emptyset} m_1(B) m_2(C). \, </math>

K is a measure of the amount of conflict between the two mass sets.

Effects of conflict

The normalization factor above, 1 − K, has the effect of completely ignoring conflict and attributing any mass associated with conflict to the empty set. This combination rule for evidence can therefore produce counterintuitive results, as we show next.

Example producing correct results in case of high conflict

The following example shows how Dempster's rule produces intuitive results when applied in a preference fusion situation, even when there is high conflict.

:Suppose that two friends, Alice and Bob, want to see a film at the cinema one evening, and that there are only three films showing: X, Y and Z. Alice expresses her preference for film X with probability 0.99, and her preference for film Y with a probability of only 0.01. Bob expresses his preference for film Z with probability 0.99, and his preference for film Y with a probability of only 0.01. When combining the preferences with Dempster's rule of combination it turns out that their combined preference results in probability 1.0 for film Y, because it is the only film that they both agree to see.

:Dempster's rule of combination produces intuitive results even in case of totally conflicting beliefs when interpreted in this way. Assume that Alice prefers film X with probability 1.0, and that Bob prefers film Z with probability 1.0. When trying to combine their preferences with Dempster's rule it turns out that it is undefined in this case, which means that there is no solution. This would mean that they can not agree on seeing any film together, so they do not go to the cinema together that evening. However, the semantics of interpreting preference as a probability is vague: if it is referring to the probability of seeing film X tonight, then we face the fallacy of the excluded middle: the event that actually occurs, seeing none of the films tonight, has a probability mass of 0.

Example producing counter-intuitive results in case of high conflict

An example with exactly the same numerical values was introduced by Lotfi Zadeh in 1979,

to point out counter-intuitive results generated by Dempster's rule when there is a high degree of conflict. The example goes as follows:

:Suppose that one has two equi-reliable doctors and one doctor believes a patient has either a brain tumor, with a probability (i.e. a basic belief assignment—bba's, or mass of belief) of 0.99; or meningitis, with a probability of only 0.01. A second doctor believes the patient has a concussion, with a probability of 0.99, and believes the patient suffers from meningitis, with a probability of only 0.01. Applying Dempster's rule to combine these two sets of masses of belief, one gets finally m(meningitis)=1 (the meningitis is diagnosed with 100 percent of confidence).

Such result goes against common sense since both doctors agree that there is a little chance that the patient has meningitis. This example has been the starting point of many research works for trying to find a solid justification for Dempster's rule and for foundations of Dempster–Shafer theory or to show the inconsistencies of this theory.

Example producing counter-intuitive results in case of low conflict

The following example shows where Dempster's rule produces a counter-intuitive result, even when there is low conflict.

:Suppose that one doctor believes a patient has either a brain tumor, with a probability of 0.99, or meningitis, with a probability of only 0.01. A second doctor also believes the patient has a brain tumor, with a probability of 0.99, and believes the patient suffers from concussion, with a probability of only 0.01. If we calculate m (brain tumor) with Dempster's rule, we obtain

::<math>m(\text{brain tumor}) = \operatorname{Bel}(\text{brain tumor}) = 1. \, </math>

This result implies complete support for the diagnosis of a brain tumor, which both doctors believed very likely. The agreement arises from the low degree of conflict between the two sets of evidence comprised by the two doctors' opinions.

In either case, it would be reasonable to expect that:

:<math>m(\text{brain tumor}) < 1\text{ and } \operatorname{Bel}(\text{brain tumor}) < 1,\,</math>

since the existence of non-zero belief probabilities for other diagnoses implies less than complete support for the brain tumour diagnosis.

Dempster–Shafer as a generalisation of Bayesian theory

As in Dempster–Shafer theory, a Bayesian belief function <math>\operatorname{bel}: 2^X \rightarrow [0,1] \,\!</math> has the properties <math>\operatorname{bel}(\emptyset) = 0</math> and <math>\operatorname{bel}(X) = 1</math>. The third condition, however, is subsumed by, but relaxed in DS theory: For example, DS theory violates the requirements for Cox's theorem, which implies that it cannot be considered a coherent (contradiction-free) generalization of classical logic—specifically, DS theory violates the requirement that a statement be either true or false (but not both). As a result, DS theory is subject to the Dutch Book argument, implying that any agent using DS theory would agree to a series of bets that result in a guaranteed loss.

Bayesian approximation

The Bayesian approximation reduces a given bpa <math>m</math> to a (discrete) probability distribution, i.e. only singleton subsets of the frame of discernment are allowed to be focal elements of the approximated version <math>\underline{m}</math> of <math>m</math>:

:<math>

\underline{m} (A) =

\left\{

\begin{aligned}

& \frac{\sum\limits_{B | A \subseteq B} m(B) }{ \sum\limits_C m(C) \cdot |C| }, &

|A| = 1 \\

& 0, & \text{otherwise}

\end{aligned}

\right.

</math>

It's useful for those who are only interested in the single state hypothesis.

We can perform it in the 'light' example.

{| class="wikitable"

! Hypothesis !! <math>m_1</math> !! <math>m_2</math> !! <math>m_{1,2}</math> !! <math>\underline{m}_1</math> !! <math>\underline{m}_2</math> !! <math>\underline{m}_{1,2}</math>

| None || 0 || 0 || 0 || 0 || 0 || 0

| Red || 0.35 || 0.11 || 0.32 || 0.41|| 0.30 || 0.37

| Yellow || 0.25 || 0.21 || 0.33 || 0.33|| 0.38 || 0.38

| Green || 0.15 || 0.33 || 0.24 || 0.25 || 0.32 || 0.25

| Red or Yellow || 0.06 || 0.21 || 0.07 || 0|| 0 || 0

| Red or Green || 0.05 || 0.01 || 0.01 || 0|| 0 || 0

| Yellow or Green || 0.04 || 0.03 || 0.01 || 0|| 0 || 0

| Any || 0.1 || 0.1 || 0.02 || 0|| 0 || 0

Relation with other uncertainty measures

Belief functions have strong links to various other measures of second-order, epistemic uncertainty.

In particular, a belief function uniquely identifies a credal set, i.e., a convex set of probability distributions on the same domain it is defined on. More precisely, the corresponding credal set is:. Uncertainty measure generalizing entropy in the context of evidence theory have also been studied.

Criticism

Judea Pearl (1988a, chapter 9; 1988b and 1990) has argued that it is misleading to interpret belief functions as representing either "probabilities of an event," or "the confidence one has in the probabilities assigned to various outcomes," or "degrees of belief (or confidence, or trust) in a proposition," or "degree of ignorance in a situation." Instead, belief

functions represent the probability that a given proposition is provable from a set of other propositions, to which probabilities are assigned. Confusing probabilities of truth with probabilities of provability may lead to counterintuitive results in reasoning tasks such as (1) representing incomplete knowledge, (2) belief-updating and (3) evidence pooling. He further demonstrated that, if partial knowledge is encoded and updated by belief function methods, the resulting beliefs cannot serve as a basis for rational decisions.

Kłopotek and Wierzchoń proposed to interpret the Dempster–Shafer theory in terms of statistics of decision tables (of the rough set theory), whereby the operator of combining evidence should be seen as relational joining of decision tables. In another interpretation M. A. Kłopotek and S. T. Wierzchoń propose to view this theory as describing destructive material processing (under loss of properties), e.g. like in some semiconductor production processes. Under both interpretations reasoning in DST gives correct results, contrary to the earlier probabilistic interpretations, criticized by Pearl in the cited papers and by other researchers.

Jøsang proved that Dempster's rule of combination actually is a method for fusing belief constraints., and later extensively investigated by Fabio Cuzzolin.

thumb|614x614px|Illustration of the concept of belief space.

Assuming the domain a belief function Bel is defined on has <math>N</math> elements, the vector of belief functions attached to all possible sets of elements of <math>X</math> has cardinality <math>|2^X| -1</math>. One can then define the belief space associated with <math>X</math> as the set of vectors of the <math>\mathbb{R}^{ \{ 2^X - 1 \} }</math> which correspond to valid belief functions, i.e., which satisfy the above additivity and non-negativity constraints mass values need to obey. It is easy to show that the belief space has the form of a simplex, with the set of all probability measures forming one of the sides of this simplex (see figure).

The belief space can also be studied within differential geometry: in particular, it appears to have a recursive fiber bundle structure.

Relational measures

In considering preferences one might use the partial order of a lattice instead of the total order of the real line as found in Dempster–Shafer theory. Indeed, Gunther Schmidt has proposed this modification and outlined the method.

Given a set of criteria C and a bounded lattice L with ordering ≤, Schmidt defines a relational measure to be a function μ from the power set of C into L that respects the order ⊆ on <math>\mathbb{P}</math>(C):

:<math>A \subseteq B \implies \mu(A) \leq \mu(B)</math>

and such that μ takes the empty subset of <math>\mathbb{P}</math>(C) to the least element of L, and takes C to the greatest element of L.

Schmidt compares μ with the belief function of Shafer, and he also considers a method of combining measures generalizing the approach of Dempster (when new evidence is combined with previously held evidence). He also introduces a relational integral and compares it to the Choquet integral and Sugeno integral. Any relation m between C and L may be introduced as a "direct valuation", then processed with the calculus of relations to obtain a possibility measure μ.

References

External links

BFAS: Belief Functions and Applications Society