Data Structures

Direct Access Table

Hash Table is improvement over DAT
Direct Access Table DAT? Take huge array and use phone numbers as index in the array. if phone number is not present entry is empty, else the array entry stores pointer to records corresponding to phone number.
Searching in DA: O(1) but HUGE Extra Space required.

What is Hash Table

Data structure that associates keys with values. <key, value>
Supports constant time lookups. Average Case: O(1), Worst case: O(n)
Hash Function? Hash function maps a big number or string to a small integer that can be used as index in hash table.


key -> |Hash Function| -> index of array/table

Advantages of Hash Table: Search, Delete, Insert: O(1)=Average. O(n)=During resize
Disadvantages:
1. Elements are not sorted
2. Rehashing: Once all HT entries are filled it needs resized/rehashing which is a time-consuming operation. Let HT size = 100, we want to insert 101st element. Not only the size of hash table is enlarged to 150, all element in hash table have to be rehashed. This insertion operation takes O(n).

Internal Implementation of Hash Table

Let use consider simple example to insert following key,value pairs: {"abc",1}, {"bac",2}, {"cab",3}
Keys("abc", "bac", "cab") are passed to hashfunction and indexes are returned in bucket which acts as hash table
Collision: if same index is returned, chaining is done, ie element is inserted into next position in vector

Hash table


key value
abc   1
bac   2
zac   3


struct Entry {
    std::string key;
    int value;
};
class HashTable {
    static const int CAP = 4;
    std::vector<Entry> buckets[CAP];
    int hashFunction (const std::string &s) {
        // Very simple hash
        unsigned long h = 0;
        for (char c : s) {
            h = h * 31 + (unsigned char)c;
        }
        return (int)(h % CAP);      //index into bucket
    }
public:
    void insertIntoHashTable(const std::string &key, int value) {
        int idx = hashFunction (key);
        auto &bucket = buckets[idx];

        // update if key exists
        for (auto &e : bucket) {
            if (e.key == key) {
                e.value = value;
                return;
            }
        }
        // otherwise insert
        bucket.push_back(Entry{key, value});
    }
};

How hashFunction() works


    s = parameter = "abc"
i   value           h 
0   a=97        0x31 + 97 = 97
1   b=98        97x31 + 98 = 3105
2   c=99        3105x31 + 99 = 96354
return 96354 % 4 = 2  //index

Hash Table
| , | , |"abc",1| , |
  0   1     2     3  index

How insertIntoHashTable() works


key     value  index(generated by hashFunction())
"abc"     1     2
"bac"     2     0
"cab"     3     0

Hash Table
index 0: [("bac",2), ("cab",3)]   //Chaining on Collision
index 1: []
index 2: [("abc",1)]
index 3: []

Hash Collision

This means hash function provides same index for 2 different keys. See example provided above in Internal Implementation of Hash Table
2 methods to solve Hash Collission problem:

1. Open Addressing

All elements are stored in HT itself. Once same hash is derived, insert element in hash table itself no seperate chains. Types of Open addressing
1a. Linear/Sequential probing: Once same hash is derived, inserts the new item in the next open spot in the table ie next to already existent element with same hash.
Deletion: is tricky here removing one element might break a chain of insertions, making some elements inaccessible. We need to reinsert all the items into new holes.
1b. Quadratic Probing: look for i²th slot in i’th iteration. New hash function = (xmod7 + i²)%hash_table_size
1c. Double Hashing: use double hash function to re-calculate the hash if collision occurs. In case of collision: hash1(x) = (hash1(x) + i*hash2(x))%hash_table_size

2. Seperate Chaining

Each cell of HT point to a DOUBLY LINKED LIST of records that have same hash function value. This requires additional memory outside the table.
Insert(x): O(1). Insert at head of LL
Search(x): O(n). Need to search complete list
Delete(x): O(1). Assumed we are at element in chain, we delete x. copy prev to next pointer. In singly LL we need O(n) time to search prev node.
Advantages: Simple implementation, Space will never exhaust, Less sensitive to hash function
Disadvantages: Once LL/chain grows long, performance will degrade search Time=O(n) Space wastage, some parts of hash table may never be used

Rehashing

Let us consider unordered_map<int,string> storing unique keys. At start of program sizeof hash table=3


  Key | Value
  ----------
  01  | amit
  02  | never
  03  | give

Now, (4, up) need to be stored, but hash table has no space so size of hash table is increased to 6. (old Hash function = %3) we can only goto index number=2.
But we want to reach 5. Hence Hash function is changed (new Hash function = %6). So hash is again calculated for existing values.

Implementations

2-left hashing

2 equal sized hash tables having 2 seperate hash functions are used. Seperate chanining is used in both. Table1(hash1), Table2(hash2).
Cuckoo Hashing is a type of 2-left hashing
Inserting key into table: Always keep Table1 loaded, key is inserted into Table2 only when too many collisions happen on Table1.
Advantages? 1. On parallel systems. 2 cores can query 2 hash tables, ie Thread-1 can query Table1 and Thread2 Table2.

d-left Hashing

Split 1 hash table into d blocks. Similar to 2-left hashing, left most hashtable should be loaded.
d buckets, n entries. Each bucket has n/d entries

Code

1. Student Hash table


Hash table stores Student data (Key:Value)
(1,20)  (2,70)  (42,80)  (4,25)  (12,44)  (13,78)  (14,32) (17,11)  (37,97)

2. Leetcode Solved. Design a HashMap without using any built-in hash table libraries.