Algorithms and OOD (CSC 207 2014F) : Readings
Primary: [Front Door] [Schedule] - [Academic Honesty] [Disabilities] [Email] - [Learning Outcomes] [FAQ] [Teaching & Learning] [Grading] [Rubric] - [Calendar]
Current: [Assignment] [EBoard] [Lab] [Outline] [Reading]
Sections: [Assignments] [EBoards] [Examples] [Handouts] [Labs] [Outlines] [Readings]
Reference: [Student-Curated Resources] [Java 8 API] [Java 8 Tutorials] [Code Conventions]
Related Courses: [CSC 152 2006S (Rebelsky)] [CSC 207 2014S (Rebelsky)] [CSC 207 2014F (Walker)] [CSC 207 2011S (Weinman)]
Misc: [Submit Questions] - [SamR] [Glimmer Labs] [CS@Grinnell] [Grinnell] - [Issue Tracker (Course)] [Issue Tracker (Textbook)]
Summary: We consider hash tables, one of the most
popular (and perhaps most interesting) implementations of dictionaries.
Hash tables have expected O(1) set
and get
methods.
Prerequisites: Dictionaries. Arrays.
As we've seen, dictionaries are powerful and useful abstractions that support a wide range of use cases, from simple databases to representing objects. In fact, dictionaries are such a powerful tools that many languages now include something like a dictionary as a built-in type. Hence, most programmers don't think much about the implementation of dictionaries, or even about their efficiency.
In terms of programmer efficiency, that failure to consider the implementation
isn't so bad. But in terms of program efficiency, and perhaps even depth
of knowledge, it's important to find really good ways to implement
dictionaries. And, so far, the best we've seen are structures that
have likely O(logn) running times for get
and
put
. For binary trees, that means we either
have to hope that the trees stay balanced or learn how to keep the
trees balanced. If we've learned skip lists, we need to trust that
the random number generator works well.
But can we do better? When we use arrays, we get constant time
get
and put
operations. Wouldn't we hope for the same for dictionaries?
It turns out that if you're willing to allow some overhead, rely on some luck, and deal with an occasional slowdown to rebuild the data structure, there is a way to get dictionaries that are nearly as fast as arrays. The data structures that embody this technique are called hash tables
There are very few useful algorithms we know that require only
a constant time and perhaps only one data structure that gives us
constant-time access to elements. That one data structure is the
array. So, if we want to get constant-time get
and put
, perhaps we need to focus on using
arrays.
Here's a strategy: Store key/value pairs in an array. Where does the key/value pair go? We'll compute an index into the array based only on they key. In effect, we hope to number the keys in such a way that (1) all keys have numbers between 0 and the length of the array - 1, and (2) no two keys have the same number. Compute those numbers is called hashing a key.
Using that strategy, it's fairly easy to build put
and get
. In approximate Java.
public void put(K key, V value) { this.values[hash(key)] = new KeyValuePair<K,V>(key, value); } // put(K,V) public V get(K key) throws Exception { KeyValuePair<K,V> pair = this.values[hash(key)]; if (key.equals(pair.key)) return pair.value; else return throw new Exception("Invalid key " + key); } / get(K)
Isn't that nice? Unfortunately, it's also a bit of a fiction. Without knowing the keys in advance, it is not possible to number the keys in such a way to meet the two criteria above and find the number for a key in constant time.
In general, we make the hash function return some integer, and make the computation of that integer fast and context independent - that is, we want the same return value each time we hash the same key, no matter what size table we're using and what other keys we have used or will use. (Taking those other pieces of information into account means that it's unlikely that the function will be constant.) How do we then decide where in the array the key/value pair goes? We use the result of taking the hash value of the key modulo the size of the array.
For example, a hash function on strings might sum the ASCII values
of the letters in the string. (That's not a good hash function, but
we're going to use it anyway.) The string cat
would give
us 99 + 97 + 116, or 312.
The string sat
would give us
115 + 97 + 116, or 328. If our array is sized 25, an element with a
key of cat
would go at position 12 (312 % 25 is 12), and
an element with a key of sat
would go in position 3
(328 % 25 is 3).
Since the number of valid integers is much smaller than the number
of possible values in other data types (e.g., strings, doubles),
we're pretty much guaranteed that it's possible that two keys will
have the same value. And once we've reduced by the size of the
array, we're even more likely to have two keys ending up in the
same location. For example, using the hash function above, the
string pro
would give us 337 (112 + 114 + 111), which is
different than the 312 for cat
or the 328
for sat
. But when we mod by a table size of 25, we end up
at 12, the same as the location for cat
. When two keys
have the same location, we call that a collision.
So, what do we do when we have a collision? There are two approaches:
One is called chaining and one is called
probing. In chaining, instead of storing
key/value pairs in the array, we store linked lists of key/value
pairs. (That is, we store simple association lists.) Now, when
we put a key/value pair into the array, we find the right cell and just
use the put
method of association lists for the
association list in that cell. Similarly, to get an element we just
call the get
element of the underlying
association list.
public void put(K key, V value) { AssociationList<K,V> lst = values[hash(key)]; lst.put(key, value) } // put(K, V) public V get(K key) throws Exception { AssociationList<K,V> lst = values[hash(key)]; return lst.get(key); } // get(K, V)
Since we're throwing a bunch of values into one location in the array, some programmers refer to the elements of the array in chained hash tables as buckets.
Of course, there's a bit of overhead involved in putting an association list in each cell. Hence, many programmers prefer a different approach: If there's a collision in one cell, you just go on to another cell. What cell? There are a variety of options. The simplest option is linear probing - you add some specified value to the computed index for the next place to look. If that place also collides, you add the value again. (And again and again until you find an open space.) For example, suppose we had 25 slots in the hash table and an offset of 7. If we have a collision in position 5, we look in 12. If we have a collision in 12, we look in 19. If we have a collision in 19, we look in 1 (19+7 = 26, 26 % 25 = 1). And so on and so forth.
If you're not careful, there's a significant danger in using linear probing: The indices can cycle before you've looked at every possible position. For example, suppose we used an offset of 10 rather than 7. If we have a collision in position 7, we look in 17. If we have a collision in 17, we look in 2 (27 % 25). If we have a collision in 2, we look in 12. If we have a collision in 12, we look in 22. If we have a collision in 22, we look in 7 (32 % 25). And we're back to where we started, missing about 4/5 of the possible indices. The solution to the problem is easy: We just make sure that the offset and the table size are relatively prime. Ask a friendly mathematician to explain why.
One can also try variants of linear probing. Some programmers use quadratic probing: first try an offset of 1, then an offset of 4 (from the previous location), then an offset of 9, and so on and so forth. Others use a different computed offset for different values.
Computer science professors often like probing not only because it
provides all these interesting design options in the probing technique,
but also because it leads to some interesting design questions for
the remove
operation.
So, how fast are hash tables? Well, if we have no collisions, the cost of putting an element in the table is the cost of computing the hash function and the cost of putting something into an array. We've said the cost of computing the hash function should be O(1). The cost of putting something into an array is O(1). So, putting something into a collision-free hash table is O(1). Similarly, getting something from a collision-free hash table is also O(1).
But some collisions are likely to happen because of the pigeonhole principle. So, what do we do? Here's the traditional hand-waving argument: If we have a good hash function and reasonably more space in the array than key/value pairs we expect to store, then collisions are unlikely and repeated collisions are even less likely.
You'll note that our analysis suggests that we need “reasonably more space in the array than key/value pairs we expect to store”. How do we guarantee that? One strategy is to stop allowing clients to add elements once the hash table is relatively full. Another is to accept that hash tables slow down as they fill up. But the most typical strategy is to expand the hash table once it is more than N% full. To expand the table, you build a new array and move each element from the original to the new array.
Note that after expanding the table elements are likely to be in
new locations, which might reduce collisions. For example, if we expand
our old table from 25 to 30 elements, cat
will now be at
position 9 (129 % 30) and pro
will now be at position 25
(145 % 30).
How much should you expand the hash table? Expansion is likely to be an expensive operation (it's clearly O(n)), so you want to make it rare. I make it a practice to approximately double the size of the table each time.
In order for your hash tables to work well, your hash functions have to distribute keys well. Hence, it's worth a lot of effort to do analysis on your likely input data and the efficacy of your hash function with those data.
Consider, for example, the “sum the ASCII values”
approach for hashing strings. Suppose our input strings are
three-letter sequences. There are 17,576 (26x26x26) such
sequences. Our hash function, on the other hand, returns only
values from 291 (for aaa
) to 366 (for zzz
),
a range of just 76 different values. Even if we made our hash
table size 1000, collisions are incredibly likely.
So, when possible, think about how well your hash function “spreads out” values and deals with similar values. You may even want to run some experiments to see just how well you do (or don't do).
In Java, we use an object-oriented model for hashing. That is,
each class is responsible for providing its own hash function,
called hashCode
. If you don't provide a
hashCode
method, you get a very strange one
(the one associated with generic objects, and remember that two objects
are equal only if they occupy the same area of memory).
Java programs generally assume that hashCode
returns the same value when given two equal objects (even if they
are not the same object). If you think about it for a moment, you'll
see that the requirement is essential.
Java provides two basic implementations of Hash tables, the older java.util.Hashtable and the newer java.util.HashMap.