Algorithms and OOD (CSC 207 2013F) : Outlines

# Outline 46: Implementing Dictionaries with Hash Tables

Held: Monday, 25 November 2013

Summary

We consider an essential implemntation of the Dictionary ADT - Hash tables. Hash tables provide such a common implementation that many programmers often refer to dictionaries as "hashes".

Related Pages

Overview

• An introduction to hash tables.
• Hash functions.
• An exercise.
• Hashing in Java.
• Handling collisions.
• Handling removal.

• I'll reserve time for questions on the exam.
• I've returned the exam redos. I'm not sure that it was worth allowing you to redo the exam. It looks like those who did so spent a lot of time on it, and it didn't make a big difference in grades. It also took a lot of my time and mental energy.
• Upcoming extra credit opportunities:
• "Data Sovereignty: The Challenge of Geolocating Data in the Cloud", November 25, 4:15 JRC 101
• "Gold Fever" by Andrew Sherburne '01 or so, 7:00 p.m., Monday, November 25, ARH 302
• Tuesday, November 26, 4:15 p.m., JRC 209 a gaming event with the game [d0x3d!]
• Any self-care week activity (after Turkey break)

## Hash Tables

• As you may have noticed, dictionaries are "a lot like arrays" (except that you index dictionaries by object and arrays by number).
• However, in terms of implementation, that claim hasn't held.
• The unordered list implementation has O(n) set and O(n) get.
• An ordered array implementation has O(n) set and O(log2n) get.
• The binary search tree implementation has O(logn) set and O(logn) get, if we're lucky (or if we learn how to keep the tree balanced).
• The corresponding array operations are O(1).
• Is there a dictionary implementation with O(1) `get` and `set`?
• Surprisingly, if you're willing to sacrifice some space and increase your constant, it is possible to build a dictionary that is likely to have O(1) `get` and `set`.
• How? Well, we know that arrays provide O(1) get and set, so use arrays (or vectors).
• How do we use an array? We "number" the keys in such a way that
• all numbers are between 0 and array.length-1
• no two keys have the same number (or at least few have the same number).
• If there are no collisions (keys with the same number), the system is simple
• To set a value, determine the number corresponding to the key and add it in that place of the array. This is O(1+cost of computing that number).
• To get a value, determine the number corresponding to the key and look in the appropriate cell. This is O(1+cost of computing that number).
• Implementations of dictionaries using this strategy are called hash tables.
• The function used to convert an object to a number is the hash function.
• To better understand hash tables, we need to consider
• The hash functions we might develop.
• What to do about collisions (when two keys map to the same index).

## Hash Functions

• The goal in developing a hash function is to come up with a function that is unlikely to map two objects to the same position.
• Now, it isn't possible to guarantee that no two objects have the same position (particularly if we have more objects than positions).
• We'll discuss what to do about two objects mapping to the same position later.
• Hence, we sometimes accept a situation in which the hash function distributes the objects more or less uniformly.
• It is worth some experimentation to come up with such a function.
• In addition, we should consider the cost of computing the hash function. We'd like something that is relatively low cost (not just constant time, but not too many steps within that constant).
• We'd also like a function that does (or can) give us a relatively large range of numbers, so that we can get fewer collisions by increasing the size of the hash table.
• We might want to make the size of the table a parameter to the hash function.
• We might strive for a hash function that uses the range of positive integers, and mod it by the size of the table.
• What are some hash functions you might use for strings?
• Sum the ASCII values in the string
• constant1firstLetter + constant2secondLetter
• ...

## An Exercise in Hashing

• Let's try an exercise. We'll come up with a hash value for everybody's first name. We'll then put things in the hash table.
• We'll use "sum the values of the letters in the name".
• We'll use the following table:
```A: 1   F: 6   K: 11  P: 16  U: 21  Z: 26
B: 2   G: 7   L: 12  Q: 17  V: 22
C: 3   H: 8   M: 13  R: 18  W: 23
D: 4   I: 9   N: 14  S: 19  X: 24
E: 5   J: 10  O: 15  T: 20  Y: 25
```
• There are about 19 of you in the class. Typically, our hash tables are somewhat bigger than the size of the collection we're working with, so we'll use a hash table of size 30.
• Once you've computed your result, mod it by 30 (compute the remainder after dividing by 30)
• For my first name (Samuel),
• The sum is 19 (S) + 1 (A) + 13 (M) + 21 (U) + 5 (E) + 12 (L) = 71
• The index is therefore 11.
• Is this a good hash function?
• Why?
• Why not?

## Hashing in Java

• Hash tables are so useful that Java includes them as a standard library class, `java.util.Hashtable`.
• Let's look over the documentation for `java.util.Hashtable`
• Why are there three constructors?
• What methods are there other than `get` and `set`?
• Where's the hash function?
• Answer: Every class is expected to provide its own.

## Handling Conflicting Keys

• Since we typically have many more possible keys than places in the array, we will eventually hit a point in which two keys hash to the same value.
• What should we do when this conflict happens?
• It is not reasonable to disallow the second key, since it's a different key.
• We could put a list of key/value pairs, rather than a singleton, at every position. This technique is sometimes called chaining.
• We could step through the array until we find an empty space. This technique is sometimes called linear probing. (You don't need to offset by 1; you can offset by a constant amount.)
• I thought people tended to use linear probing, but I see that Java uses chaining.

## Removing Elements from Hash Tables

• Our analysis of Hash Tables to date has been based on two simple operations: `get` and `set`.
• What happens if we want to remove elements? This operation can significantly complicate matters.
• If we've chosen linear probing to resolve collisions, what do we do when it comes time to remove elements?
• Do we shift everything back? If so, think about how far we may have to look.
• Do we leave the thing there as a blank? We might then remove it later when it's convenient to do so (e.g., when we grow the table).
• Do we do something totally different?
• Note also that there are different ways of specifying "remove". We might remove the element with a particular key. We might instead remove elements based on their value. The second is obviously a much slower operation than the first (unless we've developed a special way to handle that problem - see if you can think of one).

Copyright (c) 2013 Samuel A. Rebelsky.

This work is licensed under a Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit `http://creativecommons.org/licenses/by/3.0/` or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.