CSC152 2006S, Class 40: Dictionaries (4): Hash Tables Admin: * Monday: Discussion of this exam. * Late exams will be graded very late. * No reading or homework for Monday. * What do you think of the structure of Wednesday's class? Overview: * The Idea of Hashing. * Hash Functions. * Hashing in Java. * Handling Hashing Conflicts. Context: We are discussing dictionaries * Kind of like an array with indices that aren't necessarily numeric * What is an array? An abstract data type that collects values * Values are accessed by numeric indices * Values have predecessors and successors (consequence of numeric indices) * Often fixed size * Constant access time (O(1) put and get) * When we implement dictionaries using unordered vectors of key/value pairs * get: O(n) * put: O(n) * When we implement dictionaries using binary search trees. * If we're really lucky * get: O(1): it's at the root * put: O(1): the tree is empty * If we arrange the tree well * get: O(log_2(n)) * put: O(log_2(n)) * Arranging the tree well? * Take CSC301 * Put all of the values into a vector, sort the vector, build the tree from the vector (assumes that put is rare) * Randomize the order of values before putting * Can we do O(1) put and get for a dictionary implementation? * Two techniques to help * Use extra memory * Use "randomness" + luck * Key idea: Vectors/arrays provide fast access * But require numeric indices * We don't have numbers, we have (strings, points, ...) * We can write functions that convert our indices to integers * Bad function: (lambda (val) 1) * Good function: Likely to give different integers for different values * If we have such a function, call it hash put(k,v) = contents.set(hash(k),new KeyedValue(k,v)) get(k,v) = contents.get(hash(k)).value; * Problem: What if the hash value of the key is really large? Won't our arrays/vectors be too big? put(k,v) = contents.set(hash(k) % contents.size(), new KeyedValue(k,v)) get(k,v) = contents.get(hash(k) % contents.size()).value; * Problem: Won't we randomly get the same number more than once? * Yes! We'll have to solve that problem. * Statistics: If the hash function distributes the data uniformly and the array is pretty big (e.g., twice as big as the number of values we store), we are unlikely to get too many collisions Common strategy! Java assumes that every object that might serve as a key provides a hashValue method. Writing good hashvalue functions is *hard*. * String -> int by "convert each char to corresponding int and sum" is not good If we assume A -> 0, B -> 1, ... Z -> 25, strings of length N cluster around 12.5*n Back to the problem: Suppose we end up hashing two things to the same location in the Vector. What should we do? * Give up, try another strategy for implementing dictionaries. * Example will use a stupid hash function: Hash by first letter of first name (mod 10) 0: Kennon=>Landis 1: Ben=>Mendoza [Lindsay=>Helmrich (CONFLICT)] 2: 3: 4: 5: 6: 7: 8: SamR 9: * Strategy one (HAL9000Dictionary): Only permit one key/value pair throw new SorryICantPermitThatKeyException HAL is a one letter shift of IBM * Strategy two: Use a bigger array (whoops, SamR and SamTape woul still conflict based on our DUMB HASH FUNCTION) * One common solution: Permit more than one thing in each cell (e.g., make a vector of vectors) * Few conflicts suggests that get and put are still fast * Another common solution: Shift to the next cell 0: Kennon 1: Ben 2: Lindsay (belongs in 1, but full) 3: 4: 5: 6: 7: 8: SamR 9: * Statistics suggest that we will generally be safe, provided the table is big enough * Names: * Strategy one: "Bucket" * Strategy two: "Shift" * Implementing the shift strategy Vector> contents; public void put(K key, V value) { // Figure out the likely cell to put the value int pos = Math.abs(k.hashValue()) % this.contents.size(); KeyedValue elt = this.contents.get(pos); // Three possibilities: // Nothing in the vector at that position while (true) { if (elt == null) { // Put the key value pair in the vector this.contents.set(pos, new KeyedValue(key, value)); return; } // Thing in the vector at that position has same key else if (elt.key.equals(key)) { // Replace the old value with the new value. this.contents.set(pos, new KeyedValue(key, value)); // or elt.value = value; return; } // Thing in the vector at that position has a different key else { // Need to shift pos = (pos + 1) % this.contents.size(); } // else } // while } // void put(K, V) public V get(K key) { // Figure out the likely cell to put the value int pos = Math.abs(k.hashValue()) % this.contents.size(); KeyedValue elt = this.contents.get(pos); // Three possibilities: // Nothing in the vector at that position while (true) { if (elt == null) { throw new NoSuchKeyException(key); } // Thing in the vector at that position has same key else if (elt.key.equals(key)) { return elt.value; } // Thing in the vector at that position has a different key else { // Need to shift pos = (pos + 1) % this.contents.size(); } // else } // while } // void get(K)