# Hashing

### From MusiKiwi

Given a key *k*, a good hash function *h(k)* should map to a unique index in table **T**. The hash function should exhibit uniform distribution regardless of the domain of *k*, therefore, the determination of where the next value will go in **T** should be as random as possible. If *h(k)* does not posess this property, the likelihood that two unique keys *k1* and *k2* will hash to the same index is severely increased and a collision might occur. For example, if we try to add *k2* (with the same hash as *k1*) to **T** but *k1* has already been added, a decision needs to be made as to where the offending key should be placed. The typical scheme to is to probe around in a methodical way to locate a vacancy in **T**. This probing must be *methodical* because it needs to be *repeatable* for the hashing function to re-locate this item. For this reason, once an element has been added to a hash table, it cannot be deleted unless the entire table is rehashed.

## Double hashing

The double hashing function given by Cormen et al produces a somewhat pseudo-random search for vacant spots in **T** should a collision occur:

where

and is **prime**.

The value should be set to , which will produce unique probe sequences. This means that any given pair will only repeat on the order of . Additonally, each probe sequence is a unique permutation of . So while this guarantees that every possible vacancy can be located, it also minimizes **primary clustering** typical of **linear probing** methods. Because **double hashing** uses two hash functions, it also does not suffer from **secondary clustering** to the extent that **quadratic probing** does.

On the initial hash of *k*, the value *h1(k)* is computed. If there is a collision, then the next probe location is calculated by adding *h2(k)* to the previous position, and taking the (modulo **m**) of this sum. The process of calculating (previous_pos + *h2(k)*)(*mod* m) is repeated until a vacant spot is located. If the calculation is performed *m-1* times, then the probe sequence given by this function guarantees that the table is full.

The **load factor** is given by , where *n* is the number of elements currently in **T**. If this is kept under 0.5, then the average maximum number of probes required will be 2, assuming the hash function is uniform. In this case, the running time for typical hash operations is still *O(n)*.

## References

- T.H. Cormen, C.E. Leiserson, R.L. Rivest, C. Stein. Introduction to Algorithms, 2nd ed. Cambridge, Mass: MIT Press, 2001. Print.
- M.A. Weiss. Data Structures and Algorithm Analysis in Java, 2nd ed. Boston: Peason Addison-Wesley, 2007. Print.