CSC153 2004S, Class 52: Hash Tables
Admin:
* Questions on the homework?
Overview:
* Reminder: Philosophy of hash tables
* Designing hash functions
* Methods, revisited
* Dealing with conflicts
/Philosophy/
* Hash tables implement the dictionary/Map ADT
* Key operations are set(key,value), get(key)
* "Arrays indexed by objects"
* Hash tables are implemented by converting objects into integers and using them as array indices
* Potential problem: Conflicts in which two different objects have the same integer
/Hash Functions/
* Designing good hash functions.
* As an example: Strings
* Criteria for a good hash function?
* Efficient. O(1) to compute.
* When two objects are different, it should be "unlikely" that the two objects have the same hash value. (Hmmm ... how do we evaluate this characteristic?)
* Given the same object, always returns the same value.
* Two equal objects have the same hash values (e.g., a.equals(b) implies a.hashCode() == b.hashCode())
* Unlikely to "miss" any hash values.
/Hashing Exercise/
Hash function for strings: Convert each letter to a number, sum the numbers
* (mod by the table size to figure out where it goes)
A: 1 F: 6 K: 11 P: 16 U: 21 Z: 26
B: 2 G: 7 L: 12 Q: 17 V: 22
C: 3 H: 8 M: 13 R: 18 W: 23
D: 4 I: 9 N: 14 S: 19 X: 24
E: 5 J: 10 O: 15 T: 20 Y: 25
Compute:
* The hash value of your first name.
* That value mod 16 (big hash table for 8 people)
Samuel: 19 + 1 + 13 + 21 + 5 + 12 = 71 ; 7
Erik: 43; 11
Saul: 53; 5
Andrew: 44; 12
Joe: 30; 14
Brian: 44; 12
Ilan: 36; 4
Chase: 36; 4
Observations/Comments:
* This worked out particularly badly for our class.
* Erik didn't like it to start with.
* Any time you just "add up" things, it's more likely you'll get a "middling" sum (e.g., ten-letter names should be close to 135)
* Does not cover many of the integers, except for very long strings.
* For a really long string, you might overload the integers. (Such strings are unlikely.) (Good computation can help avoid this problem.)
* The cost of computing the hash code depends on the length of the string.
How do we improve this function? (Or come up with a better one?)
* Make each character a digit (or a pair of digits) and string 'em together.
* 1 * (char 1) + 32 * (char 2) + 32*32 * char (3)
* Use primes to grow quickly
* prime^(char 1) + prime^(char 2) + ...
* Use only the first N characters/pieces
* Rely on other hash functions
How would you test a hash function?
* Do it a whole lot and look at the distribution.
/Methods related to hash tables/
* See java.whatever
* What is a load factor?
* Observation: Hash tables perform less well as they get more full; When a hash table is sufficiently full, you expand the array and rehash everything; The load factor determines what "sufficiently full" means.
* Note: Expansion can slow down the running time of add/set. How do we deal with the possibility that add/set are no longer constant time operations? Claim: The "amortized" running time of set/add is still O(1) provided you don't have too many collisions.
/Conflicts/
* Looks up in array by hash value, but only returns if the key matches
* Problem (conflict) if two keys have the same "moded" hash value.
Two standard techniques for dealing with conflicts between keys
* Add another search structure at each entry in the array (list or vector is probably the best bet, since the stuff will be relatively small)
* Move to the next "open" position in the hash table and put it there
* For put: If the current space is full, keep looking at the next space until nothing is there
* For get: If the first space does not match, keep looking at the enxt space until you find the desired key (success) or null (failure)
Observe: If the load factor is 1/2, most of the time you'll only look one or two spaces
* Removal is a pain
The first technique is usually better.