hashing collision resolution types
Normally, there are 2 ways to implement map:
1 HashMap : uses hash table to implement map, which make constant operation time
2 TreeMap : uses balance search tree to implement map, which make logarithmic operation time ( in C++ STL , red-black tree is employed to implement map and set )
What is Collision ?
Since all hash functions are not perfect hashes, there will be few inputs which are mapped to same hash values. This is called Collision.
Ways to resolve collision ?
2 ways – Open Hashing & Closed Hashing.
Open Hashing (Separate chaining)- The simplest form of open hashing defines each slot in the hash table to be the head of a linked list. All records that hash to a particular slot are placed on that slot’s linked list.
Records within a slot’s list can be ordered in several ways: by insertion order, by key value order, or by frequency-of-access order. Ordering the list by key value provides an advantage in the case of an unsuccessful search, because we know to stop searching the list once we encounter a key that is greater than the one being searched for. If records on the list are unordered or ordered by frequency, then an unsuccessful search will need to visit every record on the list.
In the case where a list is empty or has only one record, a search requires only one access to the list. Thus, the average cost for hashing should be Θ(1). However, if clustering causes many records to hash to only a few of the slots, then the cost to access a record will be much higher because many elements on the linked list must be searched.
Open hashing is most appropriate when the hash table is kept in main memory, with the lists implemented by a standard in-memory linked list. Storing an open hash table on disk in an efficient way is difficult, because members of a given linked list might be stored on different disk blocks. This would result in multiple disk accesses when searching for a particular key value, which defeats the purpose of using hashing.
Closed hashing(Open addressing) – finds the next free spot in the Hash table.A closed hashing implementation is one in which the elements stay in the array rather than being placed in an auxiliary collision set, such as a linked list.The implication of closed hashing is that collisions must be resolved by finding alternative array locations. These alternative locations are obtained by “probing”
The difference between the two is whether collisions are stored outside the table (open hashing), or whether collisions result in storing one of the records at another slot in the table (closed hashing).
The ideal behavior for a collision resolution mechanism is that each empty slot in the table will have equal probability of receiving the next record inserted (assuming that every slot in the table has equal probability of being hashed to initially). In this example, assume that the hash function gives each slot (roughly) equal probability of being the home position for the next key. However, consider what happens to the next record if its key has its home position at slot 0. Linear probing will send the record to slot 2. The same will happen to records whose home position is at slot 1. A record with home position at slot~2 will remain in slot 2. Thus, the probability is 3/10 that the next record inserted will end up in slot 2.
However, only records hashing to slot 3 will be stored in slot 3, yielding one chance in ten of this happening.Thus, the resulting probabilities are not equal.
To make matters worse, if the next record ends up in slot 9 (which already has a higher than normal chance of happening), then the following record will end up in slot 2 with probability 6/10. This is illustrated in the right side of the figure. This tendency of linear probing to cluster items together is known as primary clustering. Small clusters tend to merge into big clusters, making the problem worse. The objection to primary clustering is that it leads to long probe sequences.
Improved Collision Resolution Methods:
One possible improvement might be to use linear probing, but to skip slots by some constant c other than 1.Constant c must be relatively prime to M to generate a linear probing sequence that visits all slots in the table (that is, c and M must share no factors). For a hash table of size M = 10, if c is any one of 1, 3, 7, or 9, then the probe sequence will visit all slots for any key. When M = 11, any value for c between 1 and 10 generates a probe sequence that visits all slots for every key.
Another probe function that eliminates primary clustering is called quadratic probing. Here the probe function is some quadratic function p(K,i)=c1i2+c2i+c3 for some choice of constants c1, c2, and c3.
There is one problem with quadratic probing: Its probe sequence typically will not visit all slots in the hash table.For many hash table sizes, this probe function will cycle through a relatively small number of slots. If all slots on that cycle happens to be full, this means that the record cannot be inserted at all! Consider an example of a table with 105 slots. The probe sequence starting from any given slot will only visit 23 other slots in the table. If all 24 of these slots should happen to be full, even if other slots in the table are empty, then the record cannot be inserted because the probe sequence will continually hit only those same 24 slots.
Select hashtable so that hash table never gets above about half full.
If the hash function generates a cluster at a particular home position, then the cluster remains under pseudo-random and quadratic probing. This problem is called secondary clustering.
A simple technique for doing this is to return to linear probing by a constant step size for the probe function, but to have that constant be determined by a second hash function, h2. Thus, the probe sequence would be of the form p(K, i) = i * h2(K). This method is called double hashing
In summary, a properly tuned hashing system will return records with an average cost of less than two record accesses. This makes it the most effective way known to store a database of records to support exact-match queries.
Unfortunately, hashing is not effective when implementing range queries, or answering questions like “Which record in the collection has the smallest key value?”
Dynamic hashing is a hash table that grows to handle more items. The associated hash function must change as the table grows. Some schemes may shrink the table to save space when items are deleted.
BST over hash ?
If you want operations involving any ordering among elements – min, max, predecessor or successor, you would use a BST. If not you can go for hash.
Hashtable can also help in reducing memory as it will be avoiding the use of pointers(assuming limited collisions).