In computer programming, primary clustering is a phenomenon that causes performance degradation in linear-probing hash tables. The phenomenon states that, as elements are added to a linear probing hash table, they have a tendency to cluster together into long runs (i.e., long contiguous regions of the hash table that contain no free slots). If the hash table is at a load factor of <math>1 - 1/x</math> for some parameter <math>x \ge 2
</math>, then the expected length of the run containing a given element <math>u</math> is <math>\Theta(x^2)</math>. This causes insertions and negative queries to take expected time <math>\Theta(x^2)</math> in a linear-probing hash table.
Causes of primary clustering
Primary clustering has two causes:
- Winner keeps winning: The longer that a run becomes, the more likely it is to accrue additional elements. This causes a positive feedback loop that contributes to the clustering effect. However, this alone would not cause the quadratic blowup.
- Joining of runs: A single insertion may not only increase the length of the run that it is in by one, but may instead connect together two runs that were already relatively long. This is what causes the quadratic blowup in expected run length.
Effect on performance
Primary clustering causes performance degradation for both insertions and queries in a linear probing hash table. Insertions must travel to the end of a run, and therefore take expected time <math>\Theta(x^2)</math>. However, as noted by Knuth, typically citing Knuth. (often referred to as Robin Hood hashing) is a technique for reducing the effects of primary clustering on queries. Ordered linear probing sorts the elements within each run by their hash. Thus, a query can terminate as soon as it encounters any element whose hash is larger than that of the element being queried. This results in both positive and negative queries taking expected time <math>O(x)
</math>.
Graveyard hashing is a variant of ordered linear probing that eliminates the asymptotic effects of primary clustering for all operations.
