As you’ve seen, we often find it useful to collect data into structures. We’ve collected values in lists, which are useful for processing sequentially, and XML trees, which are useful for processing hierarchically. Since different structures might support different kinds of operations, the design of structured types like lists and trees is a core part of algorithm design. Computer scientists have defined a wide range of data types designed to collect data.
Consider, for example, the problem of representing a simple phone book. A phone book associates phone numbers with names. When you look up a name in a phone book, you should either get the corresponding phone number or a warning that no phone number is available for that name. There are many other instances in which you want to associate information with a “name” or something equivalent. For example, a physical dictionary associates definitions with words and each XML entity has a collection of attribute values that are associated with attribute names.
In referring to these kinds of structures, computer scientists often call the thing used to look up information the key and the associated information the value. In a phone book, the name is the key and the phone number is a value. In a traditional dictionary, the word is the key and the definition is the value. In an attribute list, the attribute name is the key and the attribute value is the value.
Computer scientists use a variety of names for structures that associate keys and values, including Dictionary, Map, Keyed Table, and even Hash. Generally, you can think of these structures as collections of key/value pairs that make it easy to reference values by key. What operations would you expect such structures to provide? It should be easy to create a new structure, to add or update a key/value pair, and to look up a value by key. It might also be useful to remove a key/value pair and to check whether a key is in the structure.
Over the years, computer scientists have come up with a large number of ways to represent dictionaries, including association lists, binary search trees, skip lists, and hash tables. Don’t worry; you don’t need to know the underlying details of any of them, at least not at this stage of your career. You do, however, need to understand how and why to use such structures.
For now, we will focus on how you use one of the most common implementations of dictionaries, known as Hash Tables. In many situations, hash tables are one of the most efficient implementations of dictionaries. Most modern languages include hash tables as a basic type. And, because hash tables are so popular, many programmers don’t distinguish hash tables from the broader concept of dictionaries. (That’s why some people refer to the general concept as a “hash”.) But you should remember that hash tables are just one way to implement dictionaries.
We’re exploring a new type. At this point, the five questions you ask about a new type should be second nature: What is its name? What is its purpose? How do I express values in the type? How does DrRacket display values in the type? What procedures are available for working with the type?
We’ve already answered the first two questions: The type is named “hash table” and its purpose is to store key/value pairs. We rarely write hash tables directly and we rarely ask DrRacket to display them (after all, they are often fairly large), so we’ll leave those questions to after our coverage of the procedures associated with hash tables.
You may recall that we said that there are three basic operations we use with hash tables: we create them, we add key/value pairs, and we look up values by keys.
You create a new hash table with (make-hash)
. You look up a value in
a hash table with (hash-ref
table
key)
. You add a key/value pair
to the table with (hash-set!
table
key
value)
. Note that
hash-set!
ends with an exclamation point. That reminds us that the
procedure actually modifies the underlying hash table.
Let’s explore those basic operations, creating a map of book titles to authors. For now, we’ll assume that each book has a single author; in a subsequent section, we’ll consider how to deal with multiple authors.
> (define book-authors (make-hash))
> (hash-set! book-authors "The Princess Bride" "William Goldman")
> (hash-set! book-authors "Homegoing" "Yaa Gasi")
> (hash-set! book-authors "Moo" "Jane Smiley")
> (hash-set! book-authors "Moo, Baa, La La La!" "Sanda Boynton")
> (hash-ref book-authors "Homegoing")
"Yaa Gasi"
> (hash-ref book-authors "Moo")
"Jane Smiley"
What happens when the book title is not in the table? Let’s check.
> (hash-ref book-authors "FunDHum")
hash-ref: no value found for key
key: "FunDHum"
It appears that the hash table issues an error. That seems like a
reasonable choice. Racket needs to signal to the user that the value is
not available. But an error message is not something many of us want to
have happen in the middle of our program. Hence, the Racket hash
library also provides a (hash-has-key?
table
key)
procedure that
checks whether the hash table contains a particular key.
> (hash-has-key? book-authors "Homegoing")
#t
> (hash-has-key? book-authors "FunDHum")
#f
At times, you may realize that you want to update a key/value pair that
you have stored in the hash table. You can use the hash-set!
operation to update the table.
> (hash-ref book-authors "The Princess Bride")
"William Goldman"
> (hash-set! book-authors "The Princess Bride" "S. Morgenstern")
> (hash-ref book-authors "The Princess Bride")
"S. Morgenstern"
You’ve learned four basic hash table operations: make-hash
,
hash-set!
, hash-ref
, and hash-has-key?
. While the Racket hash
table implementation provides a host of other operations, those four
basic operations should be all that you need at this point. In the next
few sections, we’ll explore some other design issues in making and using
hash tables.
As hash tables exist primarily for you to use to extract information, You will rarely ask DrRacket to display hash tables. Also, hash tables are generally large enough that you won’t want to see all of the information on your screen. But DrRacket will display a hash table if you ask it nicely.
> book-authors
'#hash(("Homegoing" . "Yaa Gasi") ("Moo, Baa, La La La!" . "Sanda
Boynton") ("The Princess Bride" . "S. Morgenstern") ("Moo" . "Jane
Smiley"))
Not very pretty, is it? But, if we look carefully, it’s fairly
straightforward. It begins with a tick, an octothorpe, and the word
hash
. Following that is a list of key/value pairs. And they are
literally pairs, as indicated by the dot between the two values.
Can we create a hash in the same way? Let’s try.
> (define animals '#hash(("A" . "Aaardvark")
("B" . "Badger")
("C" . "Chinchilla")
("D" . "Dingo")
("E" . "Emu")
("F" . "Fennec Fox")))
> animals
'#hash(("B" . "Badger") ("F" . "Fennec Fox") ("D" . "Dingo") ("A" .
"Aaardvark") ("E" . "Emu") ("C" . "Chinchilla"))
> (hash-ref animals "B")
"Badger"
> (hash-has-key? animals "Z")
#f
It looks like this does give us a hash table, but with the key/value pairs in a slightly different order than we started with. That’s okay: We should not care about the ordering because our focus is on using hash tables rather than understanding the underlying technology. (It turns out that the ordering is one of the things that helps keep hash tables efficient. You’ll study how to construct hash tables in a subsequent course.) Let’s add something to the hash table.
> (hash-set! animals "Z" "Zebra")
hash-set!: contract violation
expected: (and/c hash? (not/c immutable?))
given: '#hash(("A" . "Aaardvark") ("B" . "Badger") ("C" .
"Chinchilla") ("D" . "Dingo") ("E" . "Emu") ("F" . "Fennec
Fox"))
argument position: 1st
other arguments...:
"Z"
"Zebra"
Hmmm. That didn’t work. Why not? When we present a hash table to
Racket using the '#hash
notation, it treats is as an immutable
table. You can use it, but you cannot modify it. That makes sense for
some cases. For example, you might have assumptions about the contents
of the hash table and you would not want another part of the program
(perhaps written by another programmer who did not understand those
expectations) to “mess with” the table.
But what if you want to create a hash table with a sequence of key/value
pairs and you want it to be mutable? In that case, you can use a
variant of the make-hash
operation in which you present it with a list
of key/value pairs.
> (define animals
(make-hash (list (cons "A" "Anteater")
(cons "B" "Baboon")
(cons "C" "Civet")
(cons "D" "Dormouse")
(cons "E" "Echidna")
(cons "F" "Flamingo"))))
> animals
'#hash(("B" . "Baboon") ("F" . "Flamingo") ("D" . "Dormouse") ("A" .
"Anteater") ("E" . "Echidna") ("C" . "Civet"))
> (hash-ref animals "B")
"Baboon"
> (hash-has-key? animals "Z")
#f
> (hash-set! animals "Z" "Zebra")
> animals
'#hash(("B" . "Baboon") ("F" . "Flamingo") ("D" . "Dormouse") ("Z" .
"Zebra") ("A" . "Anteater") ("E" . "Echidna") ("C" . "Civet"))
> (hash-ref animals "Z")
"Zebra"
What happens if we try to use something that’s close to an existing key, but not quite the same? Let’s try and find out.
> (hash-has-key? book-authors "Homegoing")
#t
> (hash-has-key? book-authors "Home going")
#f
> (hash-has-key? book-authors "homegoing")
#f
As these two examples suggest, hash tables are not particularly smart. They expect exact matches of the hash key. Programmers who want more accommodating hash tables may want to write procedures that convert keys to a common form before hashing. Let’s consider how we might do that with strings.
First, let’s agree on a common form. Say, for example, that we decide
that the common form should be all lowercase and should contain only
letters and digits. We can use string-downcase
to convert the string
to lowercase and regexp-replace*
to drop all the non-word characters.
(define common-form
(lambda (str)
(if (string? str)
(regexp-replace* #px"\\W" (string-downcase str) "")
str)))
> (common-form "Homegoing")
"homegoing"
> (common-form "Home Going")
"homegoing"
> (common-form "Home going!")
"homegoing"
> (common-form "Going home?")
"goinghome"
> (common-form (list "Home" "Going"))
'("Home" "Going")
Now that we have this helper procedure, we can write our own versions of the basic hash table operations.
(define my-make-hash
(lambda (pairs)
(make-hash (map (lambda (pair)
(cons (common-form (car pair))
(cdr pair)))
pairs))))
(define my-hash-set!
(lambda (hash key value)
(hash-set! hash (common-form key) value)))
(define my-hash-ref
(lambda (hash key)
(hash-ref hash (common-form key))))
(define my-hash-has-key?
(lambda (hash key)
(hash-has-key? hash (common-form key))))
> (define book-authors
(my-make-hash
(list (cons "The Princess Bride" "William Goldman")
(cons "Homegoing" "Yaa Gasi")
(cons "Moo" "Jane Smiley")
(cons "Moo, Baa, La La La!" "Sanda Boynton"))))
> (my-hash-ref book-authors "Homegoing")
"Yaa Gasi"
> (my-hash-ref book-authors "Home Going")
"Yaa Gasi"
> (my-hash-ref book-authors "moo!")
"Jane Smiley"
> (my-hash-set! book-authors "Friday Black" "Nana Kwame Adjei-Brenyah")
> (my-hash-ref book-authors "fri day black")
"Nana Kwame Adjei-Brenyah"
> book-authors
'#hash(("moobaalalala" . "Sanda Boynton") ("homegoing" . "Yaa Gasi")
("theprincessbride" . "William Goldman") ("fridayblack" . "Nana Kwame
Adjei-Brenyah") ("moo" . "Jane Smiley"))
What happens if we fail to use our methods? Let’s see.
> (hash-ref book-authors "Homegoing")
hash-ref: no value found for key
key: "Homegoing"
> (hash-set! book-authors "The Princess Bride" "S. Morgenstern")
> (my-hash-ref book-authors "The Princess Bride")
"William Goldman"
> (hash-ref book-authors "The Princess Bride")
"S. Morgenstern"
> book-authors
'#hash(("moobaalalala" . "Sanda Boynton") ("homegoing" . "Yaa Gasi")
("theprincessbride" . "William Goldman") ("The Princess Bride" . "S.
Morgenstern") ("fridayblack" . "Nana Kwame Adjei-Brenyah") ("moo" .
"Jane Smiley"))
As these examples suggest, if we decide to use our alternate procedures, we need to be uniform in our use of those procedures.
What happens if we want to store more than one value with a key? For example, some books have more than one author. For example, _How to Design Programs_ is by Matthew Felleisen, Robert Bruce Findler, Matthew Flatt, and Shriram Krishnamurthi. One possibility, at least in this case, is to represent the authors not as a string, but as a list.
> (my-hash-set! book-authors
"How to Design Programs"
(list "Matthias Felleisen" "Robert Bruce Findler"
"Matthew Flatt" "Shriram Krishnamurthi"))
> (my-hash-ref book-authors "How To Design Programs")
'("Matthias Felleisen" "Robert Bruce Findler" "Matthew Flatt" "Shriram
Krishnamurthi")
> (my-hash-ref book-authors "Friday Black")
"Nana Kwame Adjei-Brenyah"
What happens when we want to associated different kinds of values with a single key? For example, a college phone directory might contain not only a phone number, but also a username and an address.
Consider the following table.
**name** **phone** **username** **address**
Avery 555-555-1212 aviary 54 Main Hall
Riley 555-555-8888 lifeof 12 James B. Mary
Reese 555-555-1234 pbcups N. Dibble
Jordan 555-555-4321 air 23 Center Dorm
How might we represent that table?
One possibility is to set up multiple hash tables, one that maps the name to the phone number, a second of which maps the name to the username, the third of which maps to an address, and so on and so forth.
(define phones (make-hash))
(define usernames (make-hash))
(define addresses (make-hash))
(define add-entry!
(lambda (name phone username address)
(my-hash-set! phones name phone)
(my-hash-set! usernames name username)
(my-hash-set! addresses name address)))
(define student-exists?
(lambda (name)
(and (hash-has-key? phones name)
(hash-has-key? usernames name)
(hash-has-key? addresses name))))
(define lookup-phone
(lambda (name)
(and (my-hash-has-key? phones name)
(my-hash-ref phones name))))
(define lookup-username
(lambda (name)
(and (my-hash-has-key? usernames name)
(my-hash-ref usernames name))))
(define lookup-address
(lambda (name)
(and (my-hash-has-key? addresses name)
(my-hash-ref addresses name))))
> (add-entry! "Avery" "555-555-1212" "aviary" "54 Main Hall")
> (add-entry! "Riley" "555-555-8888" "lifeof" "12 James B. Mary")
> (add-entry! "Reese" "555-555-1234" "pbcups" "N. Dibble")
> (add-entry! "Jordan" "555-555-4321" "air" "23 Center Dorm")
> (lookup-phone "Reese")
"555-555-1234"
> (lookup-address "Avery")
"54 Main Hall"
> (lookup-username "Quinn")
#f
> (add-entry! "Quinn" "555-555-5555" "eskimo" "11Q")
> (lookup-username "Quinn")
"eskimo"
Unfortunately, the design means that we have only one set of phones, usernames, and addresses. It’s our custom to make the table the parameter to a procedure like ‘lookup-phone’; we want to look up someone’s phone number in a particular directory, rather than in a global directory.
An alternative is to group data together into a single structure. At this point, your best bet would probably be a list in which you include the values in a fixed order: First the phone number, then the username, then the address.
(define add-entry!
(lambda (dir name phone username address)
(my-hash-set! dir name (list phone username address))))
(define lookup-phone
(lambda (dir name)
(and (my-hash-has-key? dir name)
(list-ref (my-hash-ref dir name) 0))))
(define lookup-username
(lambda (dir name)
(and (my-hash-has-key? dir name)
(list-ref (my-hash-ref dir name) 1))))
(define lookup-address
(lambda (dir name)
(and (my-hash-has-key? dir name)
(list-ref (my-hash-ref dir name) 2))))
> (define egelloc (make-hash))
> (define ytisevinu (make-hash))
> (add-entry! egelloc "Avery" "555-555-1212" "aviary" "54 Main Hall")
> (add-entry! ytisevinu "Riley" "555-555-8888" "lifeof" "12 James B.
Mary")
> (add-entry! egelloc "Reese" "555-555-1234" "pbcups" "N. Dibble")
> (add-entry! ytisevinu "Jordan" "555-555-4321" "air" "23 Center Dorm")
> (lookup-address egelloc "Reese")
"N. Dibble"
> (lookup-address ytisevinu "Reese")
#f
> (add-entry! ytisevinu "Reese" "555-555-0000" "seer" "Off campus")
> (lookup-address egelloc "Reese")
"N. Dibble"
> (lookup-address ytisevinu "Reese")
"Off campus"
We’ve listed three uses of hash tables: dictionaries, phone books, and information about books. Come up with two others.
Explain why we created two versions of add-entry!
.
Suppose we wanted to be able to change someone’s phone number. Indicate
how you might write set-phone!
for each of the two models. In the
first case, it will likely take the form (set-phone! name phone)
.
In the second case, it is likely to take the case (set-phone!
directory name phone)
.
In writing this section, we drew upon the discussion of hash tables in the Racket language documentation and the discussion of hash tables in the Racket Guide.
This section also draws upon a reading entitled “Association Lists” from Grinnell College’s CSC 151.