Double hashing

Double hashing is a computer programming technique used in conjunction with open addressing in hash tables to resolve hash collisions, by using a secondary hash of the key as an offset when a collision occurs. Double hashing with open addressing is a classical data structure on a table <math>T</math>.

The double hashing technique uses one hash value as an index into the table and then repeatedly steps forward an interval until the desired value is located, an empty location is reached, or the entire table has been searched; but this interval is set by a second, independent hash function. Unlike the alternative collision-resolution methods of linear probing and quadratic probing, the interval depends on the data, so that values mapping to the same location have different bucket sequences; this minimizes repeated collisions and the effects of clustering.

Given two random, uniform, and independent hash functions <math>h_1</math> and <math>h_2</math>, the <math>i</math>th location in the bucket sequence for value <math>x</math> in a hash table of <math>|T|</math> buckets is: <math>h(i,x)=(h_1(x) + i \cdot h_2(x))\bmod|T|.</math> The locations can be conveniently calculated by incrementing the previous hash by <math>h_2(x)</math>, i.e. <math>h(i+1,x)=(h(i,x) + h_2(x))\bmod|T|.</math>

Generally, <math>h_1</math> and <math>h_2</math> are selected from a set of universal hash functions; <math>h_1</math> is selected to have a range of <math>\{0,|T|-1\}</math> and <math>h_2</math> to have a range of <math>\{1,|T|-1\}</math>. Double hashing approximates a random distribution; more precisely, pair-wise independent hash functions yield a probability of <math>(n/|T|)^2</math> that any pair of keys will follow the same bucket sequence.

Selection of h<sub>2</sub>(x)

The secondary hash function <math>h_2(x)</math> should have several characteristics: proved in 1978 that, if <math>h_1</math> and <math>h_2</math> are uniformly random, and <math>\alpha < 0.319</math>, then the expected time is <math>O(1)</math>. Subsequent work by Lueker and Molodowitch proved a bound of <math>1/(1-\alpha)</math> for any <math>\alpha</math>, and established that the behavior of the hash table can be directly coupled to that of a standard random-probing based solution. Much more recently, in 2007, Bradford and Katehakis showed that even using universal hash functions, rather than fully random ones, suffices to get a <math>1/(1-\alpha)</math> bound.

Like all other forms of open addressing, double hashing becomes linear as the hash table approaches maximum capacity. The usual heuristic is to limit the table loading to 75% of capacity. Eventually, rehashing to a larger size will be necessary, as with all other open addressing schemes.

Variants

Peter Dillinger's PhD thesis points out that double hashing produces unwanted equivalent hash functions when the hash functions are treated as a set, as in Bloom filters: If <math>h_2(y) = -h_2(x)</math> and <math>h_1(y) = h_1(x) + k\cdot h_2(x)</math>, then <math>h(i, y) = h(k - i, x)</math> and the sets of hashes <math>\left\{h(0, x), ..., h(k, x)\right\} = \left\{h(0, y), ..., h(k, y)\right\}</math> are identical. This makes a collision twice as likely as the hoped-for <math>1/|T|^2</math>.

There are additionally a significant number of mostly-overlapping hash sets; if <math>h_2(y) = h_2(x)</math> and <math>h_1(y) = h_1(x) \pm h_2(x)</math>, then <math>h(i, y) = h(i\pm 1, x)</math>, and comparing additional hash values (expanding the range of <math>i</math>) is of no help.

Triple hashing

Adding a third hash as a quadratic term (triple hashing) makes the overlap a lot less likely, since equivalent classes now need to be generated by a collaboration of both <math>h_2(x)</math> and <math>h_3(x)</math>, at a cost of 50% more calculations due to the added hash function. Choices for the factor for this <math>h_3(x)</math> include <math>i^2</math> and the triangular numbers <math>i(i\pm1)/2</math>. The added hash function should obey the same requirements as listed above for <math>h_2(x)</math>. does solve the problem, a technique known as enhanced double hashing. The tetrahedral number can be computed efficiently by forward differencing:

struct key; /// Opaque

/// Replace "unsigned int" with other types as needed. (Must be unsigned for guaranteed wrapping.)

typedef unsigned int hashfunc(struct key const *);

extern hashfunc h1, h2;

/// Calculate k hash values from two underlying hash functions

/// h1() and h2() using enhanced double hashing. On return,

/// hashes[i] = h1(x) + i*h2(x) + (i*i*i - i)/6.

/// Takes advantage of automatic wrapping (modular reduction)

/// of unsigned types in C.

void ext_dbl_hash(struct key const *x, unsigned int hashes[], unsigned int n)

{

unsigned int a = h1(x), b = h2(x), i = 0;

hashes[i] = a;

for (i = 1; i < n; i++) {

a += b; // Add quadratic difference to get cubic

b += i; // Add linear difference to get quadratic

// i++ adds constant difference to get linear

hashes[i] = a;

}

</syntaxhighlight>

In addition to rectifying the collision problem, enhanced double hashing also removes double-hashing's numerical restrictions on <math>h_2(x)</math>'s properties, allowing a hash function similar in property to (but still independent of) <math>h_1</math> to be used. (Using the numbering in § Selection, the first two requirements are removed.)