COSC 2007
TABLES AND HASHING

1.	Tables are different from sequences in that they are dependent 
upon the index. Given the value of index you return the record, 
i.e. we have a function with index set as the domain and set of 
records as the co-domain.
  Example: Suppose we have 35 students numbered from 1..35 we can 
access record for student with the number i by just indexing 
into the A: array [1..35] of item, where item is the student 
record type.
  Access time is constant, i.e. O(1) less than O(log n).

2.	What happens when the key can take very large number of values, 
but total number of records is relatively small (sparse table)?
  Trade off between memory (wastage) and search time
  The trade off may not be reasonable
  Can we still use some kind of function?
  The answer is yes, a hash function.
  Hash function will map a key to smaller set of indices.
  more than one key will map to the same index.
  but if the table is sparse such a collision will not be frequent 
and will cost relatively small amount of time in further 
searching.

3.	Requirement of Table ADT.
	
	Constructor
	void insert(const hmmm& item)
	void erase(const hmmm& item)
	void search(hmmm& item, int& found)
	Destructor

  Declare one dimensional array to hold hash table H.
  Initialize hash table to Null values to indicate empty location 
(a value that will definitely not be part of a valid record, 
such as 0 or -1)
  Insertion: calculate hash function value for the key Hash(Key) 
and check the location H[Hash(key)] if it is not empty then if 
the key of existing record, H[Hash(key)].key, is the same as the 
new record key, quit. Because a key cannot be duplicated. If the 
keys are different, we have a collision which will have to be 
resolved.
  Search: Go to the location H[Hash(Key)] if H[Hash(key)].key 
matches key then that is the record we want other wise take the 
steps taken in collision resolution to find the actual record.

4.	Choosing hash function
  Truncate (take first, third and seventh digit example: 975498017 
-> 705)
  Fold (add groups of three, three, two digits, example: 37289938 
-> 372+899+38)
  modulus: key % hashsize, you may have to combine it with other 
hash function
QUIZ:	Write a descendant of string called key_type which has an 
additional function called hash() that returns the sum of the 
ascii values of all the characters in the string. Assume that 
string has a member function called length();
class key_type : public string
{
	public:
	int hash();
};
int key_type::hash()
{
	int sum=0;
	for(int j = 0; j < length(); j++)
		sum += int((*this)[j]);
	return sum;
}


struct table_item
{
	key_type key;
	double value;
	friend int operator==(const table_item& l, const table_item& r)
	{return (l.key == r.key);}
};
	
Collision Resolution

  Since more than one key can map to the same index value, we will 
definitely get collisions, i.e., 
  for insertion the location H[r.key.hash()] is already occupied 
and 
  for retrieval H[searchitem.key.hash()].key does not match 
searchitem.key.
  There are two different approaches.

1.	Open addressing
2.	Chaining

1.	Open addressing: Look for another open location. The open 
location can be found by
  doing a linear search (linear probing), use wrap around and come 
to the first location in the hash table in case there is no 
location ahead of the current index. See table1ma.pas.
  using second hash function (rehashing), if there is a collision 
use third hash function and so on.
  quadratic probing keep on looking at current := current + i2 
index, where i is the number of attempts to find the open 
location.
  use an increment dependent on the key such as
  increment := int(k[1]);
  and check the current := current + increment location.
  Use pseudo-random number to generate the increment. The seed 
will depend upon the key.
  If we use open addressing, deletion is a problem, because we 
cannot just locate the record r and make its key-type equal to 
nil. There may be other records, whose key may have collided 
with the r.key and hence it may be in some other location. 
Making H[r.key.hash()].key nil will indicate that there are no 
more records with the same hash value which is not true.

2.	Chaining: Use an array of lists to records draw a figure
  There are several advantages to chaining.
  If there is no record for a given hash value, we only store a 
pointer which takes a lot smaller space than the entire record.
  collision resolution for insert is easy, just add the new record 
to the linked list.
  overflow is not a problem
  deletion is easy, just delete the record from the linked list.
  One of the disadvantages is that we have to allocate extra 
memory to the links.


template <class hmmm, int size>
class table_type
{
	ordered_list_type<hmmm> H[size]; // could have been list_type
	public:
	void insert(const hmmm& item)
	{H[item.key.hash()%size].insert(item);}
	void erase(const hmmm& item)
	{H[item.key.hash()%size].erase(item);}
	void search(hmmm& item, int& found)
	{H[item.key.hash()%size].search(item,found);}
};

  Assignment #6
  Prefix calculator from assignment #5
  Create a descendant of bexprtree<string> called newexprtree. Let 
us not make it template.
  evaluate and evaluator accept one additional parameter called 
table_type<table_item > by constant reference.
 class newexprtree : bexprtree<string>
 {
 	double evaluator(bnode<string>*& root, const 
table_type<table_item>& t);
 	public:
 	double evaluate(const table_type<table_item>& t);
 };
 
 The evaluate
 
 double newexprtree::evaluate(const table_type<table_item>& t)
 {
 	return evaluator(root,t);
 }
 
  Note that the previous definition is not a template
  The code for the evaluator is same as before except change all 
occurrences of type to string since it is not a template 
anymore. Replace else return double(item) as follows:
 
 else if(`0' <= item[0] <= `9')
 	return double(item);
 else
 {
 	table_item titem;
 	titem.key = item;
 	t.search(titem,found);
 	if(found)
 		return titem.value;
 	return 0;
 }	 
  Put the calculations in a loop which continues until someone 
types # by itself on the line
 getexpression(prefix);
 while(prefix[1] != "#")
 {
 	if(prefix[2] == "=")
 	{
 		table_item variable;
 		variable.key = prefix[1];
 		variable.value = double(prefix[3]);
 		tab.insert(variable);
 	}
 	else
 	{
 		newexprtree<string> x; // why declared here instead of 
at the beginning?
 		x.buildtree(prefix);
 		cout << x.evaluate(tab) << endl;  // note evaluate is 
different
 	}
 }
Comparison of different searching methods

  We have considered various methods of searching: sequential 
search, binary search, table lookup, hashing.
  sequential search	…	linear time O(n)
  binary search		…	logarithmic time O(log n)
  Table lookup		…	Constant time O(1)
  Hashing		…	In the worst case, same as sequential 
search. But average time requirement depends upon the 
probability of collision.
  Which is most appropriate depends upon the application
  If we have to deal with lists (tables), we are restricted to 
sequential and binary search (table lookup or hashing).
  In many cases, we have a choice between list and table, then we 
can use any of the four methods.
  Table lookup is fast but not practical for sparse tables.
  sequential search is slowest but requires less preparation.
  binary search needs the data in sorted order.
  hashing requires peculiar ordering of keys so that the retrieval 
can be faster. Such ordering may not be useful for any other 
purpose.