Jaccard index - WikiHQ

The Jaccard index is a statistic used for gauging the similarity and diversity of sample sets.

It is defined in general taking the ratio of two sizes (areas or volumes), the intersection size divided by the union size, also called intersection over union (IoU).

The concept was first introduced by Grove Karl Gilbert in 1884 as the “ratio of verification” in contexts of geological prediction evaluation. and now is often called the critical success index in meteorology. It was later developed independently by Paul Jaccard, originally giving the French name (coefficient of community), and independently formulated again by Taffee Tadashi Tanimoto. Both the exact solution and approximation methods are available for hypothesis testing with the Jaccard index.

Both the exact solution and approximation methods are available for hypothesis testing with the Jaccard index. Jaccard similarity also applies to bags, i.e., multisets. This has a similar formula, but the symbols used represent bag intersection and bag sum (not union). The maximum value is 1/2.

: <math>J(A, B) = \frac{|A \cap B|}{|A \uplus B|} = \frac{|A \cap B|}{|A| + |B|}.</math>

The Jaccard distance, which measures dissimilarity between sample sets, is complementary to the Jaccard index and is obtained by subtracting the Jaccard index from 1 or, equivalently, by dividing the difference of the sizes of the union and the intersection of two sets by the size of the union:

: <math>d_J(A, B) = 1 - J(A, B) = \frac{|A \cup B| - |A \cap B|}{|A \cup B|}.</math>

An alternative interpretation of the Jaccard distance is as the ratio of the size of the symmetric difference <math>A \mathbin\triangle B = (A \cup B) - (A \cap B)</math> to the union.

Jaccard distance is commonly used to calculate an matrix for clustering and multidimensional scaling of n sample sets. These distance measures are commonly used in cluster analysis for grouping similar observations.

This distance is a metric on the collection of all finite sets.

There is also a version of the Jaccard distance for measures, including probability measures. If <math>\mu</math> is a measure on a measurable space <math>X</math>, then we define the Jaccard index by

: <math>J_\mu(A, B) = \frac{\mu(A \cap B)}{\mu(A \cup B)},</math>

and the Jaccard distance by

: <math>d_\mu(A, B) = 1 - J_\mu(A,B) = \frac{\mu(A \mathbin\triangle B)}{\mu(A \cup B)}.</math>

The definition is not well-defined when <math>\mu(A \cup B) = 0</math> or <math>\mu(A \cup B) = \infty</math>.

The MinHash min-wise independent permutations locality sensitive hashing scheme may be used to efficiently compute an accurate estimate of the Jaccard similarity index of pairs of sets, where each set is represented by a constant-sized signature derived from the minimum values of a hash function.

The Jaccard index is particularly useful for analyzing large-scale and sparse datasets in modern data mining applications.

Similarity of asymmetric binary attributes

Given two objects, A and B, each with n binary attributes, the Jaccard index is a useful measure of the overlap that A and B share with their attributes. Each attribute of A and B can either be 0 or 1. The total number of each combination of attributes for both A and B are specified as follows:

:<math>M_{11}</math> represents the total number of attributes where A and B both have a value of 1.

:<math>M_{01}</math> represents the total number of attributes where the attribute of A is 0 and the attribute of B is 1.

:<math>M_{10}</math> represents the total number of attributes where the attribute of A is 1 and the attribute of B is 0.

:<math>M_{00}</math> represents the total number of attributes where A and B both have a value of 0.

{| class="wikitable" style="float:right;font-size:150%;"

! !! 0 !! 1

! 0

| <math>M_{00}</math>

| style="background:#ffcccc;border:solid;border-bottom:none;"|<math>M_{10}</math>

! 1

| style="background:#ffcccc;border:solid;border-right:none;"|<math>M_{01}</math>

| style="background:#66ff66;border:solid;border-top:none;border-left:none;"|<math>M_{11}</math>

Each attribute must fall into one of these four categories, meaning that

:<math>M_{11} + M_{01} + M_{10} + M_{00} = n.</math>

The Jaccard similarity index, J, is given as

:<math>J = {M_{11} \over M_{01} + M_{10} + M_{11.</math>

The Jaccard distance, d<sub>J</sub>, is given as

:<math>d_J = {M_{01} + M_{10} \over M_{01} + M_{10} + M_{11 = 1 - J.</math>

Statistical inference can be made based on the Jaccard similarity index, and consequently related metrics. The distinction between similarity measures that include or exclude joint absences has been widely discussed in numerical taxonomy and classification literature. The Jaccard index is defined as

where <math>M_{11}</math> is the number of attributes where both values are 1, <math>M_{10}</math> where only A is 1, and <math>M_{01}</math> where only B is 1.

Weighted Jaccard similarity and distance

If <math>\mathbf{x} = (x_1, x_2, \ldots, x_n)</math> and <math>\mathbf{y} = (y_1, y_2, \ldots, y_n)</math> are two vectors with all real <math>x_i, y_i \geq 0</math>, then their Jaccard similarity index (also known then as Ruzicka similarity) is defined as

:<math>J_\mathcal{W}(\mathbf{x}, \mathbf{y}) = \frac{\sum_i \min(x_i, y_i)}{\sum_i \max(x_i, y_i)},</math>

and Jaccard distance (also known then as Soergel distance)

:<math>d_{J\mathcal{W(\mathbf{x}, \mathbf{y}) = 1 - J_\mathcal{W}(\mathbf{x}, \mathbf{y}).</math>

With even more generality, if <math>f</math> and <math>g</math> are two non-negative measurable functions on a measurable space <math>X</math> with measure <math>\mu</math>, then we can define

:<math>J_\mathcal{W}(f, g) = \frac{\int\min(f, g) d\mu}{\int \max(f, g) d\mu},</math>

where <math>\max</math> and <math>\min</math> are pointwise operators. Then Jaccard distance is

:<math>d_{J\mathcal{W(f, g) = 1 - J_\mathcal{W}(f, g).</math>

Then, for example, for two measurable sets <math>A, B \subseteq X</math>, we have <math>J_\mu(A,B) = J(\chi_A, \chi_B),</math> where <math>\chi_A</math> and <math>\chi_B</math> are the characteristic functions of the corresponding set.

thumb|Comparison of growth rates for algorithmic complexity. Standard Jaccard calculation follows <math>O(n)</math>, while a sparse optimization allows the processing time to scale relative to the number of non-zero attributes.

Calculating the weighted Jaccard similarity index for two vectors typically requires a single pass over the data, resulting in a linear computational complexity of <math>O(n)</math>, where <math>n</math> represents the number of dimensions. In data science applications, this is often optimized for sparse vectors by iterating only over non-zero elements, which significantly reduces processing time for high-dimensional datasets. This optimization shifts the complexity to <math>O(k)</math>, where <math>k</math> is the number of non-zero attributes.<syntaxhighlight lang="python"># Psuedocode for O(k) complexity

function jaccardIndex(vector_A, vector_B):

intersection_sum = 0

union_sum = 0

Get total unique keys

all_keys = unique_keys(vector_A.keys() + vector_B.keys())

for key in all_keys:

val_a = vector_A.get(key, 0)

val_b = vector_B.get(key, 0)

intersection_sum += min(val_a, val_b)

union_sum += max(val_a, val_b)

return intersection_sum / union_sum</syntaxhighlight>

To further scale processing time, techniques such as MinHashing and locality sensitive hashing are used to approximate the index using compact signatures. These algorithmic enhancements ensure that similarity measures remain scalable and efficient, despite an increase in data volume and dimensionality.

Probability Jaccard similarity and distance

The weighted Jaccard similarity described above generalizes the Jaccard Index to positive vectors, where a set corresponds to a binary vector given by the indicator function, i.e. <math>x_i \in \{0,1\}</math>. However, it does not generalize the Jaccard Index to probability distribution<nowiki/>s, where a set corresponds to a uniform probability distribution, i.e.

:<math>x_i = \begin{cases} \frac{1}{|X|} & i \in X \\ 0 & \text{otherwise} \end{cases}</math>

It is always less if the sets differ in size. If <math>|X| > |Y|</math>, and <math>x_i = \mathbf{1}_X(i)/|X|, y_i = \mathbf{1}_Y(i)/|Y|</math> then

:<math>J_\mathcal{W}(x,y) = \frac{|X\cap Y|}{|X\setminus Y| + |X|} < J(X,Y).</math>

[[File:Geometric interpretation of the Probability Jaccard Index as Simplices.png|thumb|upright=1.75|

The probability Jaccard index can be interpreted as intersections of simplices.]]

Instead, a generalization that is continuous between probability distributions and their corresponding support sets is

:<math>J_\mathcal{P}(x,y) = \sum_{x_i\neq 0, y_i \neq 0} \frac{1}{\sum_{j} \max\left(\frac{x_j}{x_i}, \frac{y_j}{y_i}\right)}</math>

which is called the "Probability" Jaccard. It has the following bounds against the Weighted Jaccard on probability vectors.

:<math>J_\mathcal{W}(x,y) \leq J_\mathcal{P}(x,y) \leq \frac{2J_\mathcal{W}(x,y)}{1+J_\mathcal{W}(x,y)} </math>

Here the upper bound is the (weighted) Sørensen–Dice coefficient.

The corresponding distance, <math>1 - J_\mathcal{P}(x,y)</math>, is a metric over probability distributions, and a pseudo-metric over non-negative vectors.

The Probability Jaccard Index has a geometric interpretation as the area of an intersection of simplices. Every point on a unit <math>k</math>-simplex corresponds to a probability distribution on <math>k+1</math> elements, because the unit <math>k</math>-simplex is the set of points in <math>k+1</math> dimensions that sum to 1. To derive the Probability Jaccard Index geometrically, represent a probability distribution as the unit simplex divided into sub simplices according to the mass of each item. If you overlay two distributions represented in this way on top of each other, and intersect the simplices corresponding to each item, the area that remains is equal to the Probability Jaccard Index of the distributions.

Optimality of the Probability Jaccard Index

[[File:Visual proof of the optimality of the Probability Jaccard Index on Three element distributions.png|thumb|upright=1.75|A visual proof of the optimality of the Probability Jaccard Index on three element distributions.

]]

Consider the problem of constructing random variables such that they collide with each other as much as possible. That is, if <math>X\sim x</math> and <math>Y\sim y</math>, we would like to construct <math>X</math> and <math>Y</math> to maximize <math>\Pr[X=Y]</math>. If we look at just two distributions <math>x,y</math> in isolation, the highest <math>\Pr[X=Y]</math> we can achieve is given by <math>1 - \text{TV}(x,y)</math> where <math>\text{TV}</math> is the Total Variation distance. However, suppose we weren't just concerned with maximizing that particular pair, suppose we would like to maximize the collision probability of any arbitrary pair. One could construct an infinite number of random variables one for each distribution <math>x</math>, and seek to maximize <math>\Pr[X=Y]</math> for all pairs <math>x,y</math>. In a fairly strong sense described below, the Probability Jaccard Index is an optimal way to align these random variables.

For any sampling method <math>G</math> and discrete distributions <math>x,y</math>, if <math>\Pr[G(x) = G(y)] > J_\mathcal{P}(x,y)</math> then for some <math>z</math> where <math>J_\mathcal{P}(x,z)>J_\mathcal{P}(x,y)</math> and <math>J_\mathcal{P}(y,z)>J_\mathcal{P}(x,y)</math>, either <math>\Pr[G(x) = G(z)] < J_\mathcal{P}(x,z)</math> or <math>\Pr[G(y) = G(z)] < J_\mathcal{P}(y,z)</math>. cite an IBM Technical Report as the seminal reference.

In "A Computer Program for Classifying Plants", published in October 1960, a method of classification based on a similarity ratio, and a derived distance function, is given. It seems that this is the most authoritative source for the meaning of the terms "Tanimoto similarity" and "Tanimoto Distance". The similarity ratio is equivalent to Jaccard similarity, but the distance function is not the same as Jaccard distance.

Tanimoto's definitions of similarity and distance

In that paper, a "similarity ratio" is given over bitmaps, where each bit of a fixed-size array represents the presence or absence of a characteristic in the plant being modelled. The definition of the ratio is the number of common bits, divided by the number of bits set (i.e. nonzero) in either sample.

Presented in mathematical terms, if samples X and Y are bitmaps, <math>X_i</math> is the ith bit of X, and <math> \land , \lor </math> are bitwise and, or operators respectively, then the similarity ratio <math>T_s</math> is

: <math> T_s(X,Y) = \frac{\sum_i ( X_i \land Y_i)}{\sum_i ( X_i \lor Y_i)}</math>

If each sample is modelled instead as a set of attributes, this value is equal to the Jaccard index of the two sets. Jaccard is not cited in the paper, and it seems likely that the authors were not aware of it.

Tanimoto goes on to define a "distance" based on this ratio, defined for bitmaps with non-zero similarity:

: <math>T_d(X,Y) = -\log_2 ( T_s(X,Y) ) </math>

This coefficient is, deliberately, not a distance metric. It is chosen to allow the possibility of two specimens, which are quite different from each other, to both be similar to a third. It is easy to construct an example which disproves the property of triangle inequality.

Other definitions of Tanimoto distance

Tanimoto distance is often referred to as a synonym for Jaccard distance <math>1-T_s</math>. This function is a proper distance metric. In application, Tanimoto distance can be harmfully confused with Jaccard distance as a proper distance metric.

If Jaccard or Tanimoto similarity is expressed over a bit vector, then it can be written as

: <math>f(A,B) =\frac{ A \cdot B}{\|A\|^2 +\|B\|^2 - A \cdot B}</math>

where the same calculation is expressed in terms of vector scalar product and magnitude. This representation relies on the fact that, for a bit vector (where the value of each dimension is either 0 or 1) then

:<math>A \cdot B = \sum_i A_iB_i = \sum_i ( A_i \land B_i)</math>

and

:<math>\|A\|^2 = \sum_i A_i^2 = \sum_i A_i.</math>

This is a potentially confusing representation, because the function as expressed over vectors is more general, unless its domain is explicitly restricted. Properties of <math> T_s </math> do not necessarily extend to <math>f</math>. In particular, the difference function <math>1-f</math> does not preserve triangle inequality, and is not therefore a proper distance metric, whereas <math>1 - T_s </math> is.

There is a real danger that the combination of "Tanimoto Distance" being defined using this formula, along with the statement "Tanimoto Distance is a proper distance metric" will lead to the false conclusion that the function <math>1-f</math> is in fact a distance metric over vectors or multisets in general, whereas its use in similarity search or clustering algorithms may fail to produce correct results.

Lipkus

Application to computer science and graph theory

In computer science, the Jaccard Index can be used to measure similarity between vertices in a graph by comparing their adjacency sets. Given two vertices, their similarity is calculated by taking the size of the intersection of their neighborhoods divided by the size of their union. This measure is widely applied in Link prediction, community detection, and graph classification, where it serves as a guide to estimate the likelihood of an edge forming between two nodes in networks.

The computation of the Jaccard similarity between two vertices can be expressed using their adjacency sets, as shown below.