# Exam 3: Advanced Data Structures and Algorithms

Distributed: Friday, April 28, 2006
Due: 11:00 a.m., Monday, May 8, 2006
Extensions in extreme circumstances only.

This page may be found online at `http://www.cs.grinnell.edu/~rebelsky/Courses/CS152/2006S/Exams/exam.03.html`.

Contents

## Preliminaries

The instructions on this exam are slightly different than the instructions on the other exam. Those who correctly summarize the differences on the cover page of their exam will earn two extra points on this exam.

There are four problems on the exam. Some problems have subproblems. Those who correctly or mostly-correctly answer four problems will earn an A. Those who correctly or mostly-correctly answer three problems will earn a B. Those who correctly or mostly-correctly answer two problems will earn a C. Those who correctly or mostly-correctly answer one problem will earn a D. Those who fail to answer any problems will earn an F.

Experience shows that different people find different problems complex. Hence, if you get stuck on an early problem, try moving on to another problem and let your subconscious work on the early problem. (You'll also get a better sense of accomplishment if you can find at least one problem that you can solve early.)

This examination is open book, open notes, open mind, open computer, open Web. However, it is closed person. That means you may not talk to other people about the exam. Other than as restricted by that limitation, you should feel free to use all reasonable resources available to you. As always, you are expected to turn in your own work. If you find ideas in a book or on the Web, be sure to cite them appropriately.

Although you may use the Web for this exam, you may not post your answers to this examination on the Web (at least not until after I return exams to you). And, in case it's not clear, you may not ask others (in person, via email, via IM, by posting a please help message, or in any other way) to put answers on the Web.

This is a take-home examination. You may use any time or times you deem appropriate to complete the exam, provided you return it to me by the due date. Experience from the first two exams suggests that you should begin this exam early, or at least look at the problems early.

I expect that someone who has mastered the material and works at a moderate rate should have little trouble completing the exam in a reasonable amount of time. In particular, this exam is likely to take you about four to six hours, depending on how well you've learned topics and how fast you work. You should not work more than eight hours on this exam. Please stop at eight hours. I would recommend that after you have spent about three hours on the examination, you pick the two problems you are most likely to be able to solve, and come to speak to me about them. I am fairly confident that, with my help, you will be able to solve at least those two problems in the remaining eight hours.

I would also appreciate it if you would write down the amount of time each problem takes. Each person who does so will earn two points of extra credit. Since I worry about the amount of time my exams take, I will give two points of extra credit to the first two people who honestly report that they've spent at least five hours on the exam or completed the exam. (At that point, I may then change the exam.)

You must include both of the following statements on the cover sheet of the examination. Please sign and date each statement. Note that the statements must be true; if you are unable to sign either statement, please talk to me at your earliest convenience. You need not reveal the particulars of the dishonesty, simply that it happened. Note that inappropriate assistance is primarily assistance from anyone other than Professor Rebelsky (that's me). Inappropriate assistance also includes assistance given to another member of the class.

1. I have neither received nor given inappropriate assistance on this examination.
2. I am not aware of any other students who have given or received inappropriate assistance on this examination.

Because different students may be taking the exam at different times, you are not permitted to discuss the exam with anyone until after I have returned it. If you must say something about the exam, you are allowed to say This is among the hardest exams I have ever taken. If you don't start it early, you will have no chance of finishing the exam. You may also summarize these policies (but not the changes since the previous exam). You may not tell other students which problems you've finished. You may not tell other students how long you've spent on the exam.

In many problems, I ask you to write code. Unless I specify otherwise in a problem, you should write working code and include examples that show that you've tested the code. Unless I specify otherwise, you should document your code (using javadoc-style comments for classes, fields, and methods and slash-slash comments for particular algorithm details and end braces).

Just as you should be careful and precise when you write code and documentation, so should you be careful and precise when you write prose. Please check your spelling and grammar.

I will give partial credit for partially correct answers. You ensure the best possible grade for yourself by emphasizing your answer and including a clear set of work that you used to derive the answer.

I may not be available at the time you take the exam. If you feel that a question is badly worded or impossible to answer, note the problem you have observed and attempt to reword the question in such a way that it is answerable. If it's a reasonable hour (before 10 p.m. and after 8 a.m.), feel free to try to call me in the office (269-4410) or at home (236-7445). I also respond well to email questions.

I will also reserve time at the start of classes next week to discuss any general questions you have on the exam.

## Preparation

In this laboratory, you will use project named `Exam3` with a host of packages, including:

a. In a terminal window, type

```/home/rebelsky/bin/exam3
```

You should see messages about files being copied.

b. Start Eclipse.

c. In Eclipse, build a project named Exam3 from `/home/username/CSC152/Exam3`.

d. You are now ready to begin the examination.

## Problems

### Problem 1: Removing Values from Binary Search Trees

Topics: Binary search trees; Recursion.

Ren Remove has reprimanded me for emphasizing the process of insertion into binary search trees, rather that discussing removal. Nonetheless, in class, we devised a strategy for removing nodes from binary search trees, based on the key.

• Find the node that contains the key.
• If that node is the only node in the tree, make the tree empty.
• Otherwise, if that node contains no right subtree, replace it by its left subtree.
• Otherwise, replace the node containing the key by the leftmost node in the right subtree and replace that node by its own right subtree.

Implement that strategy in `BST.java`.

### Problem 2: Text Analysis, Revisited

Topics: Dictionaries; Sorting

Anna and Andy Analyst have argued about the homework I gave you regarding text analysis. They note that, although they appreciate the use of text analysis in that assignment, they are concerned that I asked you to use a binary search tree to store the word/counter pairs. They note that for larger documents, it probably makes sense to use a hash table, rather than a binary search tree, since the difference between expected-linear time and logarithmic time becomes significant. They've written something that fills in the hash table, but they have not yet finished the part that gets the ten most frequent words.

a. Finish writing the utility class, `Analyst`, that takes a `BufferedReader` as a parameter and returns an array of `WordFrequency` pairs of the twenty most common words. You can test `Analyst` with `AnalyzeFile`.

I would recommend that you use a technique like insertion sort or selection sort to create the sorted array.

b. You can find twenty sample files as `/home/rebelsky/Web/Courses/CS152/2006S/Examples/Exam3/Texts/##.txt` (where `##` is a number between `00` and `19`. Determine the frequencies of the most common words for each and make some observations (texts likely to be by the same author, other interesting patterns you noted, etc.).

### Problem 3: Finding the Median

Topics: Divide-and-conquer algorithms; Searching and sorting; Quicksort.

Minnie and Mickie Middle also recall our discussion of binary search trees. They particularly remember that we can build better binary search trees if we make the median value the root. In class, we noted that one way to find the median value is to sort the set of values we want to put in the tree. However, that strategy is not very efficient.

Can we do better? Yes, we can use a divide-and-conquer strategy. How do we decide how to divide? We use a key idea from Quicksort: When you want to divide and conquer, but don't know how to divide equally, pick some element (the pivot) and use it to divide the collection into smaller and larger elements. As in all divide-and-conquer algorithms, we will then recurse.

Of course, it's not quite that simple. Once we've guessed a pivot and partitioned the collection, how do we recurse? It turns out that the best way to answer that question is to solve a variant of the median problem: Instead of finding the median, find the ith smallest value.

Here is a header for a method that might just do that.

a. Here is a header for such a method

```    /**
* Find the ith-smallest value in a vector.  The ith-smallest
* value is one for which there are i smaller values.
*
* @param vec
*   The vector
* @param i
*   The "position" of the element to find.
* @param c
*   A comparator used to determine ordering.
* @return ith
*   The ith smallest value.
* @pre
*   The vector contains at least one value.
*   No two values in the vector are equal.
* @post
*   There are exactly i values for which
*     c.compare(vec.get(j),ith) < 0
*/
public static <T> T ithSmallest(Vector<T> vec, int i, Comparator<T> c)
```

How do we implement the method? We return to the variation of Quicksort (divide-and-conquer using a randomly selected pivot).

• Pick the pivot randomly from the collection..
• Separate the collection into elements smaller than the pivot and elements larger than the pivot.
• If there are exactly i smaller elements, return the pivot.
• If there are more than i smaller elements, find the ith smallest of the smaller elements.
• If there are fewer than i smaller element, we'll need to find some element of the collection of larger elements. However, since we are discarding some smaller elements (as well as the pivot), we no longer need to find the ith smallest.

a. Use this strategy to implement `ithSmallest`. You can find the header for `ithSmallest` in `Median.java`.

b. Use your implementation of `ithSmallest` to implement a median method with the following signature

```public static <T> T median(Vector<T> vec, Comparator<T> c)
```

You may assume that `vec` contains no duplicates. Suppose there are n elements in `vec`. When n is odd, the median is value for which there are (n-1)/2 smaller elements and (n-1)/2 larger elements. When n is even, the median is the value for which there are n/2 smaller elements and n/2-1 larger elements.

You may find `TestMedian.java` helpful in testing your code.

c. Carefully document the `median` method, including preconditions and postconditions.

d. A divide-and-conquer algorithm that discards half of the data set at each step should be O(n). However, since there's no guarantee that the pivot splits the data in half, this algorithm may not take On). Gather data on the number of comparisons this algorithm takes and see whether it supports the assertion that the algorithm is O(n) in most cases.

### Problem 4: Genetic Matching

Topics: Dynamic Programming, String Matching, Polymorphism

Gene and Gena Geneticist note that they like the dynamic-programming string-matching algorithm, but that it has a few significant problems:

• It treats replacement as deletion+insertion. In their experience, it is often more likely (and therefore cheaper) to have one value replaced by another than to have one deleted or inserted.
• It treats every insertion or deletion (or, one expects, replacement) as having the same cost. They have found that some deletions and insertions are more likely.

They propose that you rewrite the `ec` method to take a cost metric function as a parameter, rather than a simple insertion cost and deletion cost. They have even written the `CostMetric` interface and two implementations, `SimpleMetric`, a simple cost metric, and `SampleMetric`, a more interesting cost metric.

Rewrite `Editor.ec` to take a `CostMetric` as a parameter.

The remainder of this problem is optional.

Gene and Gena also note that insertion or removal of triplets is generally much cheaper than insertion or removal of singletons. For five points of extra credit, update `Editor.ec` and the remaining files to accommodate this change.

These are some of the questions students have asked about the exam and my answers to those questions.

General Questions

## Errors

Here you will find errors of spelling, grammar, and design that students have noted. These errors carry no credit, but remind all of us to be more careful.

## History

Thursday, 20 April 2006 [Samuel A. Rebelsky]

• Sketched all four problems.

Wednesday, 26 April 2006 [Samuel A. Rebelsky]

• Converted the median problem from a sketch to a full problem statement.
• Wrote the tester for that problem.
• Added characters for the removal and text analysis problems.
• Wrote partial code for the pattern matching problem.
• Wrote support code for genetic problem.

Thursday, 27 April 2006 [Samuel A. Rebelsky]

• Significant updates to the text analysis problem.
• Wrote supporting code for the text analysis problem.
• Finished writing genetic problem.

Friday, 28 April 2006 [Samuel A. Rebelsky]

Monday, 1 May 2006 [Samuel A. Rebelsky]

• Corrected unclear step in removal policy.
• Fixed an odd/even error [thanks EJ].

Tuesday, 2 May 2006 [Samuel A. Rebelsky]

• Changed "ten" to "twenty".

Thursday, 4 May 2006 [Samuel A. Rebelsky]

• Changed the suggestion for text analysis from "insertion sort" to "selection sort or insertion sort", as many students are having more success with selection sort.

Disclaimer: I usually create these pages on the fly, which means that I rarely proofread them and they may contain bad grammar and incorrect details. It also means that I tend to update them regularly (see the history for more details). Feel free to contact me with any suggestions for changes.

This document was generated by Siteweaver on Tue May 9 08:30:57 2006.
The source to the document was last modified on Thu May 4 09:54:06 2006.
This document may be found at `http://www.cs.grinnell.edu/~rebelsky/Courses/CS152/2006S/Exams/exam.03.html`.

You may wish to validate this document's HTML ; ; Check with Bobby

Samuel A. Rebelsky, rebelsky@grinnell.edu