COSC 2007 TABLES AND HASHING 1. Tables are different from sequences in that they are dependent upon the index. Given the value of index you return the record, i.e. we have a function with index set as the domain and set of records as the co-domain. Example: Suppose we have 35 students numbered from 1..35 we can access record for student with the number i by just indexing into the A: array [1..35] of item, where item is the student record type. Access time is constant, i.e. O(1) less than O(log n). 2. What happens when the key can take very large number of values, but total number of records is relatively small (sparse table)? Trade off between memory (wastage) and search time The trade off may not be reasonable Can we still use some kind of function? The answer is yes, a hash function. Hash function will map a key to smaller set of indices. more than one key will map to the same index. but if the table is sparse such a collision will not be frequent and will cost relatively small amount of time in further searching. 3. Requirement of Table ADT. Constructor void insert(const hmmm& item) void erase(const hmmm& item) void search(hmmm& item, int& found) Destructor Declare one dimensional array to hold hash table H. Initialize hash table to Null values to indicate empty location (a value that will definitely not be part of a valid record, such as 0 or -1) Insertion: calculate hash function value for the key Hash(Key) and check the location H[Hash(key)] if it is not empty then if the key of existing record, H[Hash(key)].key, is the same as the new record key, quit. Because a key cannot be duplicated. If the keys are different, we have a collision which will have to be resolved. Search: Go to the location H[Hash(Key)] if H[Hash(key)].key matches key then that is the record we want other wise take the steps taken in collision resolution to find the actual record. 4. Choosing hash function Truncate (take first, third and seventh digit example: 975498017 -> 705) Fold (add groups of three, three, two digits, example: 37289938 -> 372+899+38) modulus: key % hashsize, you may have to combine it with other hash function QUIZ: Write a descendant of string called key_type which has an additional function called hash() that returns the sum of the ascii values of all the characters in the string. Assume that string has a member function called length(); class key_type : public string { public: int hash(); }; int key_type::hash() { int sum=0; for(int j = 0; j < length(); j++) sum += int((*this)[j]); return sum; } struct table_item { key_type key; double value; friend int operator==(const table_item& l, const table_item& r) {return (l.key == r.key);} }; Collision Resolution Since more than one key can map to the same index value, we will definitely get collisions, i.e., for insertion the location H[r.key.hash()] is already occupied and for retrieval H[searchitem.key.hash()].key does not match searchitem.key. There are two different approaches. 1. Open addressing 2. Chaining 1. Open addressing: Look for another open location. The open location can be found by doing a linear search (linear probing), use wrap around and come to the first location in the hash table in case there is no location ahead of the current index. See table1ma.pas. using second hash function (rehashing), if there is a collision use third hash function and so on. quadratic probing keep on looking at current := current + i2 index, where i is the number of attempts to find the open location. use an increment dependent on the key such as increment := int(k[1]); and check the current := current + increment location. Use pseudo-random number to generate the increment. The seed will depend upon the key. If we use open addressing, deletion is a problem, because we cannot just locate the record r and make its key-type equal to nil. There may be other records, whose key may have collided with the r.key and hence it may be in some other location. Making H[r.key.hash()].key nil will indicate that there are no more records with the same hash value which is not true. 2. Chaining: Use an array of lists to records draw a figure There are several advantages to chaining. If there is no record for a given hash value, we only store a pointer which takes a lot smaller space than the entire record. collision resolution for insert is easy, just add the new record to the linked list. overflow is not a problem deletion is easy, just delete the record from the linked list. One of the disadvantages is that we have to allocate extra memory to the links. template class table_type { ordered_list_type H[size]; // could have been list_type public: void insert(const hmmm& item) {H[item.key.hash()%size].insert(item);} void erase(const hmmm& item) {H[item.key.hash()%size].erase(item);} void search(hmmm& item, int& found) {H[item.key.hash()%size].search(item,found);} }; Assignment #6 Prefix calculator from assignment #5 Create a descendant of bexprtree called newexprtree. Let us not make it template. evaluate and evaluator accept one additional parameter called table_type by constant reference. class newexprtree : bexprtree { double evaluator(bnode*& root, const table_type& t); public: double evaluate(const table_type& t); }; The evaluate double newexprtree::evaluate(const table_type& t) { return evaluator(root,t); } Note that the previous definition is not a template The code for the evaluator is same as before except change all occurrences of type to string since it is not a template anymore. Replace else return double(item) as follows: else if(`0' <= item[0] <= `9') return double(item); else { table_item titem; titem.key = item; t.search(titem,found); if(found) return titem.value; return 0; } Put the calculations in a loop which continues until someone types # by itself on the line getexpression(prefix); while(prefix[1] != "#") { if(prefix[2] == "=") { table_item variable; variable.key = prefix[1]; variable.value = double(prefix[3]); tab.insert(variable); } else { newexprtree x; // why declared here instead of at the beginning? x.buildtree(prefix); cout << x.evaluate(tab) << endl; // note evaluate is different } } Comparison of different searching methods We have considered various methods of searching: sequential search, binary search, table lookup, hashing. sequential search … linear time O(n) binary search … logarithmic time O(log n) Table lookup … Constant time O(1) Hashing … In the worst case, same as sequential search. But average time requirement depends upon the probability of collision. Which is most appropriate depends upon the application If we have to deal with lists (tables), we are restricted to sequential and binary search (table lookup or hashing). In many cases, we have a choice between list and table, then we can use any of the four methods. Table lookup is fast but not practical for sparse tables. sequential search is slowest but requires less preparation. binary search needs the data in sorted order. hashing requires peculiar ordering of keys so that the retrieval can be faster. Such ordering may not be useful for any other purpose.