Hashing

From MusiKiwi

Jump to: navigation, search

Given a key k, a good hash function h(k) should map to a unique index in table T. The hash function should exhibit uniform distribution regardless of the domain of k, therefore, the determination of where the next value will go in T should be as random as possible. If h(k) does not posess this property, the likelihood that two unique keys k1 and k2 will hash to the same index is severely increased and a collision might occur. For example, if we try to add k2 (with the same hash as k1) to T but k1 has already been added, a decision needs to be made as to where the offending key should be placed. The typical scheme to is to probe around in a methodical way to locate a vacancy in T. This probing must be methodical because it needs to be repeatable for the hashing function to re-locate this item. For this reason, once an element has been added to a hash table, it cannot be deleted unless the entire table is rehashed.

Double hashing

The double hashing function given by Cormen et al produces a somewhat pseudo-random search for vacant spots in T should a collision occur:

h(k,i)=\left(h_1(k)+ih_2(k)\right)\text{mod}\;m

where


\begin{align}
h_1(k)&=k\;\text{mod}\;m \\
h_2(k)&=1 + \left(k\;\text{mod}\;m^\prime \right)
\end{align}

and {m}\,\! is prime.

The value {m^\prime}\,\! should be set to {m-1}\,\!, which will produce {m}\left(m-1\right)\,\! unique probe sequences. This means that any given pair \left(h_1(k), h_2(k)\right) will only repeat on the order of \Theta(m^2)\,\!. Additonally, each probe sequence is a unique permutation of \{0, 1, 2, ..., m-1\}\,\!. So while this guarantees that every possible vacancy can be located, it also minimizes primary clustering typical of linear probing methods. Because double hashing uses two hash functions, it also does not suffer from secondary clustering to the extent that quadratic probing does.

On the initial hash of k, the value h1(k) is computed. If there is a collision, then the next probe location is calculated by adding h2(k) to the previous position, and taking the (modulo m) of this sum. The process of calculating (previous_pos + h2(k))(mod m) is repeated until a vacant spot is located. If the calculation is performed m-1 times, then the probe sequence given by this function guarantees that the table is full.

The load factor is given by \alpha=\frac{n}m\,\!, where n is the number of elements currently in T. If this is kept under 0.5, then the average maximum number of probes required will be 2, assuming the hash function is uniform. In this case, the running time for typical hash operations is still O(n).

References

  • T.H. Cormen, C.E. Leiserson, R.L. Rivest, C. Stein. Introduction to Algorithms, 2nd ed. Cambridge, Mass: MIT Press, 2001. Print.
  • M.A. Weiss. Data Structures and Algorithm Analysis in Java, 2nd ed. Boston: Peason Addison-Wesley, 2007. Print.
Personal tools