Experiments in Java : Session A2: Searching

In previous experiments, we've worked with a variety of collections of elements. Why? Because a wide variety of problems in life (and therefore in computing) have to do with collections. For example, the phone book is a collection of name/number entries; most people have file cabinets that collect papers; many people collect other objects, such as books, records, beer cans, and autographs.

One of the most common problems relating to collections is that of searching. Given a collection, determine whether the collection contains a particular object. For example,

Since arrays provide one of the simplest and most universal mechanisms for collecting elements, we will primarily consider search algorithms as they pertain to arrays. In particular, we build search algorithms that determine the position of an integer in an array of integers. (You should be able to generalize these algorithms for other cases.)

Sequences

Rather than working directly with arrays, we will build a new class IntSeq for ``sequences'' of integers. How do these sequences differ from arrays?

Detour: Random sequences

If we are going to search through arrays of numbers, generating ``random'' sequences of numbers will be useful. What do we mean by ``random''? Typically, random means that each sequence of a particular size is equally likely or that at each point in the sequence, each number is equally likely as the next element. How can we generate such sequences? Fortunately, Java provides a standard utility class, java.util.Random. This class includes a nextInt method that gives that next ``random'' number in the sequence. In truth, this number is not random, in that it is generated by an algorithm. However, it is close enough to random for our purposes.

Hence, to fill the array elements with a random sequence of 100 integers, we might write

import java.util.Random;
  ...
    int i;
    Random generator = new Random();
    elements = new int[100];
    for (i = 0; i < 100; ++i) {
      elements[i] = generator.nextInt();
    } // for

However, when we're comparing two algorithms, to have the same input to both algorithms is helpful. Fortunately, Java's random number generator can take a seed that uniquely determines the random sequence. You can think of a seed as being a number for the sequence. If you use the same seed, you end up with the same sequence. For example, to get the first sequence, you would write

    Random generator = new Random(1);

Note that random sequences are not always the best test cases for your algorithms. For example, when testing an algorithm on sequences, you should also test sequences of varying lengths, sequences which contain all the same value, sorted sequences, and backwards sorted sequences (in which the numbers are organized largest to smallest). Nonetheless, random sequences still serve many purposes and are often a good starting point.

Sequential search

Sequential search is a straightforward attempt to solve the problem. In sequential search, we look at the elements of the array, one by one, until we find the element or run off of the array. In the former case, we return the index of the matching item. In the latter case, we return an error value (negative one is a good choice).

Here is a simple sequential search method written in Java. It is one of the methods supplied by IntSeq, which uses this.elements as the array to search through.

  /**
   * Search through the sequence of integers for a particular value.
   * Return the index of the value if it is found.
   * Return -1 otherwise.
   */
  public int sequentialSearch(int val) {
    // An index into the sequence.
    int i; 
    // Check each position of the array
    for (i = 0; i < this.elements.length; ++i) {
      // If the value is at position i, we're done.
      
        return i;
      } // if
    } // for
    // Unable to find the element.  Return -1.
    return -1;
  } // sequentialSearch(int)

Sequential search is useful and correct. However, it may also be slow. In particular, sequential search looks at all of the elements in the array. For example, if the array has one million elements, it may be necessary to look at one million elements. Even if each look takes 1/1000 of a second, this will take 1000 seconds or sixteen minutes. Can we do better? As we'll soon see, the answer is yes, but only in certain cases.

Detour: The number game, revisited

In the session on recursion, you encountered a simple game (guess a number) and a simple but fast strategy for guessing the number. This game illustrates one type of search: We are searching for a matching number in the set of numbers; we just don't know what number we're matching. The strategy is

guess the midpoint of the range of valid answers
if it's correct, we're done
if it's too small, shrink the range and try again
if it's too large, shrink the range and try again

The following class implements the guessing strategy and reports on the steps it uses.


/**
 * An ``intelligent'' number guesser that can guess numbers within a 
 * particular range, provided an oracle can tell it a little about its
 * guesses.
 *
 * @author Samuel A. Rebelsky
 * @version 1.0 of January 1999
 */
public class NumberGameGuesser {
  // +---------+-----------------------------------------------------------
  // | Methods |
  // +---------+

  /**
   * Try to guess a number in the range [lower ... upper].  Only the
   * oracle knows the correct answer.  The upper end of the range cannot
   * be larger than Long.MAX_VALUE/2, and the lower end of the range
   * cannot be smaller than Long.MAX_VALUE/2 (see if you can figure out
   * why).
   */
  void guessNumber(long lower, long upper, 
                   SimpleOutput out, 
                   NumberGameOracle oracle) {
    // Make sure there are possibilities.
    if (upper < lower) {
      out.println("No possible answers!");
      return;
    } // if the range is empty.
    if (upper > Long.MAX_VALUE/2) {
      out.println("The upper bound of the range cannot be higher than " +
                  Long.MAX_VALUE/2);
      return;
    } // If the upper bound is too high.
    if (lower < Long.MIN_VALUE/2) {
      out.println("The lower bound of the range cannot be smaller than " +
                  Long.MIN_VALUE/2);
      return;
    }
    // Make a guess directly between the two.
    long guess = (lower+upper) / 2;
    out.println("I guess: " + guess);
    // Is it correct?
    if (oracle.isCorrect(guess)) {
      out.println("Wa hoo!  The answer is " + guess);
    }
    // Is it too small?  If so, we know it has to be in the range
    // [guess+1 ... upper].
    else if (oracle.isSmall(guess)) {
      out.println("  Oh well, that guess is too small.");
      guessNumber(guess+1,upper,out,oracle);
    }
    // It's not correct.  It's not too small.  It must be too large.
    // Hence, the number must be in the range [lower ... guess-1].
    else {
      out.println("  Hmmm.  That guess is too large.");
      guessNumber(lower,guess-1,out,oracle);
    }
  } // guessNumber(long,long,SimpleOutput,Oracle)
} // class NumberGameGuesser

What does the oracle that this class uses look like? Basically, it has a field that contains the answer. Each of the methods compares the guess to that answer. The following is the code for that class.


/**
 * An oracle for numeric guessing games.  You tell the oracle the number,
 * and the oracle can answer questions about guesses.
 *
 * @author Samuel A. Rebelsky
 * @version 1.0 of January 1999
 */
public class NumberGameOracle {
  // +--------+------------------------------------------------------------
  // | Fields |
  // +--------+

  /** The number we're trying to guess. */
  long solution;
  
  
  // +--------------+------------------------------------------------------
  // | Constructors |
  // +--------------+

  /**
   * Create a new oracle who knows that the solution is the given value.
   */
  public NumberGameOracle(long solution) {
    this.solution = solution;
  } // NumberGameOracle
 
 
  // +---------+-----------------------------------------------------------
  // | Methods |
  // +---------+

  /** 
   * Check if a guess is correct.
   */
  public boolean isCorrect(long guess) {
    return guess == solution;
  } // isCorrect(long)

  /**
   * Check if a guess is too small.
   */
  public boolean isSmall(long guess) {
    return guess < solution;
  } // isSmall(long)
  
  /**
   * Check if a guess is too large.
   */
  public boolean isLarge(long guess) {
    return guess > solution;
  } // isLarge(long)
} // class NumberGameOracle


import SimpleOutput;
import SimpleInput;
import NumberGameOracle;	// The oracle knows the answer.
import NumberGameGuesser;	// The guesser does not.

/**
 * Let the computer play the number game.  The user tells the computer
 * which number, and then the guesser tries to guess.
 *
 * @author Samuel A. Rebelsky
 * @version 1.0 of January 1999
 */
public class NumberGame {
  /**
   * Play the game.
   */
  public static void main(String[] args) {
    // A game with no output is fairly boring.
    SimpleOutput out = new SimpleOutput();
    // We'll need to read a number to guess.
    SimpleInput in = new SimpleInput();
    // Our intelligent guesser.
    NumberGameGuesser guesser = new NumberGameGuesser();
    // And our even more intelligent oracle.  We'll create the oracle once
    // we know the solution.
    NumberGameOracle oracle;
    // The number to guess.
    long solution;
    // The lower bound of the range.
    long lower;
    // The upper bound of the range.
    long upper;
    // The minimum lower and maximum upper bounds (required by the guesser).
    long min_lower = Long.MIN_VALUE/2;
    long max_upper = Long.MAX_VALUE/2;
    // Describe the game.
    out.println("Welcome to the amazing number game.  You will give me");
    out.println("a range of numbers and a number to guess.  I will tell");
    out.println("my friend to guess the number, and we'll see how it does.");
    out.println();
    // Ask for the range.
    out.print("Enter the smallest possible answer " + 
              "(at least " + min_lower + "): ");
    lower = in.readLong();
    if (in.hadError()) { lower = min_lower; }
    out.print("Enter the largest possible answer " + 
              "(at most " + max_upper + "): ");
    upper = in.readLong();
    if (in.hadError()) { upper = max_upper; }
    // Validate the range.
    if (lower < min_lower) {
      out.println("Your smallest number is too small.");
      return;
    }
    if (upper > max_upper) {
      out.println("Your largest number is too large.");
      return;
    }
    if (upper < lower) {
      out.println("The smallest number is larger than the largest number.");
      return;
    }
    // Ask for input.
    out.print("Please enter the number to guess: " +
              "(between " + lower + " and " + upper + "): ");
    solution = in.readLong();
    // Validate the solution.
    if (solution < lower) {
      out.println("Your solution is less than the smallest number.");
      return;
    }
    if (solution > upper) {
      out.println("Your solution is larger than the largest number.");
      return;
    }
    // Create the oracle.
    oracle = new NumberGameOracle(solution);
    // Okay, let's play the game.
    guesser.guessNumber(lower,upper,out,oracle);  
    // We're done.
  } // main(String[])
} // class NumberGame

Why do we limit the lower and upper bounds? You'll learn why in our discussion of binary search below. You will investigate the problem further in the post-laboratory problems.

Binary search

We can turn this strategy into a searching algorithm, as long as the list we are searching is sorted. How could you take advantage of it being sorted? If the sequence is sorted, then you learn a lot about where the name belongs by comparing the name you're looking for to the middle element. For example, if the name you are looking for is smaller than the middle element, then you can throw away everything after the middle element. In effect, you have eliminated half the sequence in one step!

What do we do with the remaining half of the sequence? We search through it again, using the same technique. (Yes, that's recursion in action.) We keep splitting the sequence, again and again, until we find the element we are searching for or end up with no remaining elements.

  /**
   * Determine whether a name appears in a sequence.
   */
  public boolean search(String name, Sequence names) {
    // A name cannot appear in the empty sequence
    if (names.length() == 0) {
      return false;
    } // empty sequence
    // Nonempty sequences: check the middle and recurse
    else {
      // Is the element at the middle?
      if (name.equals(names.middle())) {
        return true;
      } // at the middle
      // Is the element before the middle?
      else if (name.precedes(names.middle())) {
      return search(name, names.firstHalf());
    } // In the first half
    // Otherwise, look in the second half
    else { 
      return search(name, names.secondHalf());
    } // In the first half
  } // search(String,Sequence)

Note that this is not a working method. In particular, there is no precedes method for strings, nor a built-in Sequence class. In addition, binary search requires some care at the extremes (e.g., single-element lists) to ensure that it does not recurse indefinitely. However, this code should give you a sense that recursion provides an elegant way to solve a variety of problems. In fact, the problem-solving technique used in binary search is so common that it has its own name, divide-and-conquer.

Working code for the preceding method follows. The code is part of the IntSeq class, which uses this.elements as the array it searches through. This implementation of the method also returns the index of the element, rather than a boolean value indicating whether or not it is found.

  /**
   * Search through the elements for a particular value.
   * Return the index of the value if it is found.
   * Return -1 otherwise.
   */
  public int binarySearch(int val) {
    return binarySearch(val, arr, 0, this.elements.length-1);
  } // binarySearch(int)

  /**
   * Search for a value in a subrange of the array.
   * The subarray is given by lb .. ub.
   * Return the index of the value if it is found.
   * Return -1 otherwise.
   */
  protected int binarySearch(int val, int lb, int ub) {
    // The index of the middle value
    int mid = (lb + ub) / 2;
    // Is the subarray empty?
    if (lb > ub) {
      return -1;
    } // empty subarray
    // Is the value at the middle?
    else if (val == this.elements[mid]) {
      return mid;
    } // Found it!
    // Should the value be in the left half?
    else if (val < this.elements[mid]) {
      return binarySearch(val, lb, mid-1);
    } // left half
    // The value must be in the right half.
      return binarySearch(val, mid+1, ub);
    } // left half
  } // binarySearch(int,int,int)

Problems with binary search

Although binary search is a popular and useful algorithm, many programmers end up making mistakes when they implement it, even if they've been warned that they are likely to make mistakes. What kinds of mistakes? Most are relatively subtle, but almost all can lead to significant problems in practice.

Some programmers forget to ensure that the array they're searching is nonempty. This means that when they look at the ``middle'' element, they are looking at a nonexistent element. In Java, this oversight will lead to an error at the time the attempt to access the element happens. In other languages, no error may be noticed until later.

Some programmers include the midpoint in the recursive call (using [small .. midpoint] or [midpoint .. large] rather than [small .. midpoint-1] or [midpoint+1 .. large]. While this usually works, in some cases it can lead to problems. For example, consider the case in which the element is at position 3, small is 2, and large is 3. The midpoint is 2 (Java rounds down when dividing integers). The element at position 2 is too small, so we try again, using small=middle and large=large. So, small is 2 and large is 3. The midpoint is 2. The element at position 2 is too small, so we try again. And so on and so forth.

Some programmers try to reduce the number of tests so that, for example, there are not separate tests for less-than and equals. While this can be done successfully, it may have the problem mentioned with including the midpoint in the recursive call.

A more subtle problem comes in the computation of the midpoint. Recall that the computation uses (lb+ub)/2. Since Java limits the size of integers, it is possible that lb+ub will be greater than the largest integer. In fact, this is why NumberGameGuesser limits the range to half of the two bounds. This problem is rarely encountered in practice, but is worth considering as data sets get larger and larger.

Finding the smaller elements of a sequence

In the previous examples, we've known what we were searching for (e.g., Is this number in this sequence?). There are related problems in which we have less information about what we are searching for. For example, consider the problem of searching for the smallest element of a sequence. How might you find such an element? A first guess might be the first element, but that's often an incorrect guess. You might then refine your guess by then stepping through the remaining elements, updating your estimate of the smallest whenever you found a smaller element. In pseudocode,

guess = the first element of the sequence
for each remaining element of the sequence, e
  if (e < guess) then
    guess = e;
  end if
end for

  /** 
   * Compute the smallest element in the sequence.
   */
  public int smallest() {
    // Our guess as to the smallest element
    int guess = this.elements[0];
    // A counter variable
    int i;
    // Look through all subsequent elements
    for (i = 1; i < this.elements.length; ++i) {
      // If the element is smaller than our guess, then
      //   update the guess
      if (this.elements[i] < guess) {
        guess = this.elements[i];
      } // if
    } // for
    // That's it, we're done
    return guess;
  } // smallest()

As a variation, we might write an indexOfSmallest that returns the index of the smallest element in a subsequence? Why would we want such a method? As you've seen, whenever your write a method for a sequence, it is helpful to write a similar method for a subsequence. Why return an index rather than the actual value? Because it will be helpful for the subsequent experiments. The following is the method as it appears in the IntSeq class.

  /** 
   * Compute the index of the smallest element in the subsequence
   * given by lower bound lb and upper bound ub.
   */
  public int indexOfSmallest(int lb, int ub) {
    // Make sure the upper bound and lower bound are reasonable.
    if (lb < 0) {
       lb = 0;
    }
    if (ub >= this.elements.length) {
       ub = this.elements.length - 1;
    }
    // Our guess as to the index of the smallest element
    int guess = lb;
    // A counter variable
    int i;
    // Look through all subsequent elements
    for (i = lb + 1; i <= ub; ++i) {
      // If the element is smaller than our guess, then
      //   update the guess
      if (this.elements[i] < this.elements[guess]) {
        guess = i;
      } // if
    } // for
    // That's it, we're done
    return guess;
  } // indexOfSmallest()