Assignment 7: Broader analysis

Warning: One of the data sets in this assignment pertains to breast cancer. Breast cancer has impacted the lives of many people. If you are uncomfortable working with the data set or discussing the related issues, please speak with your instructor about using another data set.
Summary: For this assignment, you will develop one of two large-scale mini-projects. One involves automated text generation based on statistical properties of text; in that mini-project, you will gather data on a text and then attempt to generate new texts based on that text. The other involves some basic techniques of “learning”; in that mini-project, you will build one or more simple classifiers.
Collaboration: You must work with your assigned partner(s) on this assignment. You may discuss this assignment with anyone, provided you credit such discussions when you submit the assignment.
Submitting: Email your answers to csc151-01-grader@grinnell.edu. The subject of your email should be [CSC151 01] Assignment 7 and should contain your answers to all parts of the assignment. Scheme code should be in the body of the message, not in an attachment.
Warning: So that this assignment is a learning experience for everyone, we may spend class time publicly critiquing your work.

Option one: Simplified text generation

Background

What distinguishes different writers? Many things. Word choice. Sentence length. Grammar. Style. You can probably add others. In writing programs that analyze texts, some programmers explore statistical properties of the text: What are the more common words? What words most likely follow each other?

At some point, someone realized that you can also use these statistical properties to generate text as well as analyze text. How? Suppose we knew that in one corpus, the word “blue” is followed by the word “world” 50% of the time, “mood” 30% of the time, and “jeans” 20% of the time. In generating a text, if we generate the word “blue”, we could then pick a random percentage. If it’s less than 50%, we add the word “mood”. If it’s less than 80% (50% + 30%), we add the word “mood”. And the rest of the time, we add “jeans”. What next? We use similar data for the next word (whichever one of the words we just chose), and the next, and the next.

Of course, this kind of text generation isn’t that simple. We generally don’t want to do the word that follows only one word; that text is too messy. Rather, we compute follow rates for two- or three- or even four-words groups.

To do statistical text generation, we need to find a way to represent the information efficiently and we need to find a way to get the data from the text file into that representation. What are the representation issues? We need to figure out how we represent something like “we need to” is followed by “figure” 10% of the time, “call” 20% of the time, and so on and so forth, then that “need to figure” is followed by “out” 90% of the time, and “this” 10% of the time.

Simplifying the problem

Rather than deal with words, we will deal with individual letters. Letter-based statistical text processing is generally less successful than word-based statistical text processing, but it’s a bit easier to do with the knowledge you have.

We’ll start by looking only at the frequency of each letter (alphabetical and space, nothing else). To keep track of frequencies, we’ll build a vector of length 28. Position 0 will store the total number of letters we’ve seen. Position 1 will store the number of #\a’s. Position 2 will store the number of #\b’s. And so on and so forth. Position 27 will store the number of spaces.

;;; Procedure:
;;;   char-stats
;;; Parameters:
;;;   [None]
;;; Purpose:
;;;   Generate a new "thing" we can use to store statistics on
;;;   character frequency.
;;; Produces:
;;;   stats, something we can use to store statistics on character
;;;     frequency.  (That is, something we can use with `record-char!`
;;;     and `random-char`.)
;;; Philosophy:
;;;   The client need not know how we store the statistics.  But
;;;   we will likely implement the statistics as a vector of length
;;;   (+ 1 "number of chars").  Element 0 of that vectors gives the
;;;   total count of chars stored.  Element i gives the number of
;;;   times we've seen char i+1.
(define charstats
  (lambda ()
    (make-vector 28 0)))

Now, if we’re going to store letters in that vector, we need some helper functions. First, it will be useful to have a procedure that determines whether a character is something we want to process. (Remember, we’re skipping anything that’s not a letter or a space. So we’re skipping periods, apostrophes, numbers, and much, much, more.) We’ll call that procedure valid?. Next, we’ll need a procedure that takes a valid character as input and converts it to an integer in the range 1 .. 27. (We’re using 1 .. 26 for the letters and 27 for space.) We’ll call that valid->num. It will also be useful to do the inverse of that conversion, converting from a number back to a character. We’ll call that num->valid. Here are those three procedures.

;;; Procedure:
;;;   valid?
;;; Parameters:
;;;   char, a character
;;; Purpose:
;;;   Determine if char is one of the valid characters used in our
;;;   text generation algorithm (space, #\a through #\z, #\A through #\Z).
;;; Produces:
;;;   ok, a boolean
(define valid?
  (lambda (char)
    (or (char=? char #\space)
        (char-ci<=? #\a char #\z))))

;;; Procedure:
;;;   valid->num
;;; Parameters:
;;;   char, a character
;;; Purpose:
;;;   Convert char to a representative number
;;; Produces:
;;;   number, an integer
;;; Preconditions:
;;;   (valid? char)
;;; Postconditions:
;;;   * space is 0
;;;   * #\a is 1
;;;   * ...
(define valid->num
  (let [(anum (char->integer #\a))]
    (lambda (char)
      (cond
        [(char=? char #\space)
         27]
        [else
         (+ 1 (- (char->integer (char-downcase char)) anum))]))))

;;; Procedure:
;;;   num->valid
;;; Parameters:
;;;   num, an integer
;;; Purpose:
;;;   Convert a number in back to a character.
;;; Produces:
;;;   char, a character
;;; Preconditions:
;;;   num is a value returned by char->num.
;;; Postconditions:
;;;   (num->valid (valid->num char)) = char
(define num->valid
  (let [(anum (char->integer #\a))]
    (lambda (num)
      (cond
        [(= num 27)
         #\space]
        [else
         (integer->char (+ anum -1 num))]))))

Let’s check on them quickly.

> (valid? #\E)
#t
> (valid? #\space)
#t
> (valid? #\3)
#f
> (valid->num #\space)
27
> (valid->num #\a)
1
> (valid->num #\A)
1
> (num->valid 26)
#\z
> (num->valid 5)
#\e

We’re now ready to put information into the vector. Each time we see a letter, we’ll convert it to an appropriate index and then increment the count at that index. We’ll also increment the count at index 0, which represents the total number of characters.

;;; Procedure:
;;;   record-char!
;;; Parameters:
;;;   stats, a value returned by `charstats`
;;;   char, a character
;;; Purpose:
;;;   Record information that we've seen another copy of char.
;;; Produces:
;;;   [Nothing; called for the side effect.]
(define record-char!
  (lambda (stats char)
    (vector-increment! stats 0)
    (vector-increment! stats (valid->num char))))

Let’s try this procedure, too.

> (define sample (charstats))
> (record-char! sample #\a)
> (record-char! sample #\a)
> (record-char! sample #\m)
> (record-char! sample #\s)
> (record-char! sample #\space)
> (record-char! sample #\s)
> (record-char! sample #\s)
> sample
'#(7 2 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 3 0 0 0 0 0 0 0 1)

Our next goal is to use these data to “randomly” select a character. How do we do that? In this case, we’ve seen seven characters, so we call (random 7). If we get 0 or 1, we return #\a. If we get a 2, we return m. If we get a 3, 4, or 5, we return #\s. Otherwise, we return a space. Here’s a procedure that achieves that goal through a slightly different process.

;;; Procedure:
;;;   random-char
;;; Parameters:
;;;   stats, a value returned by `charstats`
;;; Purpose:
;;;   Get an unpredictable character from `stats` based on the
;;;   distribution of characters represented by `stats`.
;;; Produces:
;;;   char, a character
;;; Preconditions:
;;;   * At least one char has been added to stats using `record-char!`
;;; Postconditions:
;;;   * The distribution of characters returned by this procedure is similar
;;;     to the distribution of characters in stats.
;;; Problems:
;;;   If no characters have been added, returns the space characer.
(define random-char
  (lambda (stats)
    (let ([count (vector-ref stats 0)])
      (if (zero? count)
          #\space
          (let kernel ([ran (random count)]
                       [pos 1])
            (let ([freq (vector-ref stats pos)])
              (if (< ran freq)
                  (num->valid pos)
                  (kernel (- ran freq)
                          (+ pos 1)))))))))

Let’s try it.

> (random-char sample)
#\space
> (random-char sample)
#\s
> (random-char sample)
#\s
> (random-char sample)
#\a
> (random-char sample)
#\s
> (random-char sample)
#\space
> (random-char sample)
#\s
> (random-char sample)
#\a
> (random-char sample)
#\m

Not exactly the same distribution, but close. What if we want to generate strings, rather than individual characters? That should be straightforward. We just keep generating an appropriate number of random characters and then string ‘em all together.

;;; Procedure:
;;;   random-string
;;; Parameters:
;;;   stats, a value return by `charstats`
;;;   len, a positive integer
;;; Purpose:
;;;   Generate a "random" string whose statistical properties match
;;;   those of stats.
;;; Produces:
;;;   str, a string
;;; Preconditions:
;;;   [See the preconditions of random-char]
;;; Postconditions:
;;;   * (string-length str) = len
;;;   * The contents of the string are hard to predict.
;;;   * For large enough len, the distribution of characters in str match
;;;     those in stats.
(define random-string
  (lambda (stats len)
    (let kernel ([remaining len]
                 [chars null])
      (if (zero? remaining)
          (list->string (reverse chars))
          (kernel (decrement remaining)
                  (cons (random-char stats) chars))))))

Let’s see how it works.

> (random-string sample 10)
"asaamssssa"
> (random-string sample 10)
"a ssaasaaa"
> (random-string sample 10)
"mssm sms a"
> (random-string sample 10)
" saass sam"

Yeah, that looks okay, too.

But we want better statistics. In particular, we probably want to gather statistics from a file. For that, we can use a standard pattern of file recursion.

;;; Procedure:
;;;   file->stats
;;; Parameters:
;;;   fname, a string
;;; Purpose:
;;;   Convert a file to a set of statistics as given by
;;;   `charstats` and `record-char!`.
;;; Produces:
;;;   stats, a vector of statistics.
;;; Preconditions:
;;;   fname names a valid text file.
;;; Postconditions:
;;;   * `stats` is a set of statistics that can be used by
;;;     `random-char`.
;;;   * The frequencies in `stats` correspond to the frequencies
;;;     in the file named by `fname`.
(define file->stats
  (lambda (fname)
    (let ([stats (charstats)]
          [port (open-input-file fname)])
      (let kernel ()
        (let ([char (read-char port)])
          (cond
            [(eof-object? char)
             (close-input-port port)
             stats]
            [(valid? char)
             (record-char! stats char)
             (kernel)]
            [else
             (kernel)]))))))

Let’s see how it does with the Project Gutenberg version of Jane Eyre.

> (define jane-eyre (file->stats "pg1260.txt"))
> jane-eyre
'#(981056
   63887
   11440
   19374
   37989
   102518
   17270
   15524
   46422
   57189
   1328
   6166
   32971
   22643
   55321
   61925
   12614
   958
   48590
   50882
   68646
   24008
   7743
   18986
   1306
   17634
   331
   177391)
> (random-string jane-eyre 100)
"bute xfhfvoyttshn hr  ee hsmad wteuyetii  i ontte  iy asa  m f anf s    ggyohai tahnti le eahnsn fws"
> (random-string jane-eyre 100)
"vuhtr glb edmigosno  tnh vemeiarakas rni ed tlreltnneluritf lfdettity  sdeol    htdve  e oe lote eoe"

As you can see, this approach does not work particularly well at the letter level. Let’s build a vector of vectors. The vector at position i stores the letters that follow the character whose number is i. For example, position 1 gives the frequency distribution of letters that follow #\a.

;;; Procedure:
;;;   follow-one-stats
;;; Parameters:
;;;   [None]
;;; Purpose:
;;;   Create a "thing" that allows us to keep track of statistics
;;;   for what characters follow each character.
;;; Produces:
;;;   stats, a thing
(define follow-one-stats
  (lambda ()
    (vector-map! (lambda (val) (charstats))
                 (make-vector 28))))

Just as we needed a way to record letters in the basic vector of statistics, we need ways to record “follow information” in this vector of vectors. That will be a two-step process. First we index into the main vector to get one of those charstats things. Then we just use the procedure we wrote previously.

;;; Procedure:
;;;   record-follow-one!
;;; Parameters:
;;;   stats, a value returned by follow-one-stats
;;;   prev, a valid character
;;;   char, a valid character
;;; Purpose:
;;;   Record the information that char follows prev
;;; Produces:
;;;   [Nothing; called for the side effects]
(define record-follow-one!
  (lambda (stats prev char)
    (record-char! (vector-ref stats (valid->num prev))
                  char)))

Reading the data from a file is a bit harder. Now, we have to keep track of not just the current character, but also the previous character. And we probably need conditions for when the current character is invalid and the previous character is invalid. Here’s an attempt at that.

;;; Procedure:
;;;   file->follow-one-stats
;;; Parameters:
;;;   fname, a string
;;; Purpose:
;;;   Convert a file to a set of statistics as given by `follow-one-stats`
;;;   and `record-follow-one`.
;;; Produces:
;;;   stats, the thing returned by follow-one-stats
;;; Preconditions:
;;;   fname names a valid text file.
;;; Postconditions:
;;;   * `stats` is a set of statistics that can be used wherever we use
;;;     the values from `follow-one-stats`.
;;;   * The frequencies in `stats` correspond to the frequencies
;;;     in the file named by `fname`.
(define file->follow-one-stats
  (lambda (fname)
    (let ([stats (follow-one-stats)]
          [port (open-input-file fname)])
      (let kernel ([prev (read-char port)])
        (let ([char (read-char port)])
          (cond
            ; When we reach the end, clean up and stop
            [(eof-object? char)
             (close-input-port port)
             stats]
            ; If the previous character is not valid, skip over it
            [(not (valid? prev))
             (kernel char)]
            ; If the current character is not valid, don't record
            ; anything and read over the character.
            [(not (valid? char))
             (kernel (read-char port))]
            ; Otherwise, we record that char follows prev and continue,
            ; updating char to the previous character.
            [else
             (record-follow-one! stats prev char)
             (kernel char)]))))))

Let’s see how that works.

> (define jane-eyre (file->follow-one-stats "pg1260.txt"))
> (vector-ref jane-eyre 26)
'#(324 41 1 0 0 152 0 0 0 24 1 0 30 0 0 15 0 0 0 0 0 7 0 0 0 8 36 9)

Let’s see … of the 331 letter z’s in Jane Eyre, 324 are followed by another letter or a space. 41 are followed by #\a. 1 is followed by #\b. (That’s the word “Spitzbergen”.) 152 are followed by by #\e. And so on and so forth.

Generating strings will be a bit harder. We’ll need a starting character as well as the statistics. What about the case in which we’ve never found a character that follows the current character? Fortunately, random-char handles that case.

;;; Procedure:
;;;   random-string-one
;;; Parameters:
;;;   stats, a value return by `follow-one`
;;;   prev, a valid character
;;;   len, a positive integer
;;; Purpose:
;;;   Generate a "random" string whose statistical properties match
;;;   those of stats.
;;; Produces:
;;;   str, a string
(define random-string-one
  (lambda (stats prev len)
    ; Randomly select among the successors of prev
    (let ([random-next (lambda (prev)
                         (let ([substats (vector-ref stats (valid->num prev))])
                           (random-char substats)))])
    ; Build a list of successors then convert to a string
    (let kernel ([remaining len]
                 [prev prev]
                 [chars null])
      (if (zero? remaining)
         (list->string (reverse chars))
         (let ([char (random-next prev)])
           (kernel (decrement remaining)
                   char
                   (cons char chars))))))))

Is our output from this procedure any better?

> (random-string-one jane-eyre #\space 100)
"erores shes ttre phe t oun commas at as s atheved wnd enedibusoffres  l adon atr e scompe be angh t "
> (random-string-one jane-eyre #\space 100)
"kinghey an tostertowed t delasiechaininoreved tt se ir hepithiceswe as berng adeyoroutermy wininache"
> (random-string-one jane-eyre #\space 100)
"s lofe heas coleeng ale t andelitwit tousp ut bl t dyostasabee rpe ma rimpaswale houraroven ictr i s"
> (random-string-one jane-eyre #\space 100)
"re nsthe wis dsen oun mpinavin m shef y pt s waknd bloul d mewazed onourthis a wadoo fis an tove d n"

It’s not yet English. But we’re starting to see word patterns in this output. That’s a good sign. Let’s try two characters. We’ll leave you to interpret the code.

;;; Procedure:
;;;   follow-two-stats
;;; Parameters:
;;;   [None]
;;; Purpose:
;;;   Create a "thing" that allows us to keep track of statistics
;;;   for what characters follow each pair of characters.
;;; Produces:
;;;   stats, a thing
(define follow-two-stats
  (lambda ()
    (vector-map! (lambda (val) (charstats))
                 (make-vector (+ 1 (* 27 28))))))

;;; Procedure:
;;;   index-two
;;; Parameters:
;;;   prevprev, a valid character
;;;   prev, a valid character
;;; Purpose:
;;;   Find the position in a follow-two-stats vector that we should
;;;   use for the pair (prevprev,prev).
;;; Produces:
;;;   index, an integer
(define index-two
  (lambda (prevprev prev)
    (+ (* 27 (valid->num prevprev))
       (valid->num prev))))

;;; Procedure:
;;;   record-follow-two!
;;; Parameters:
;;;   stats, a value returned by follow-two-stats
;;;   prevprev, a valid character
;;;   prev, a valid character
;;;   char, a valid character
;;; Purpose:
;;;   Record the information that char follows prevprev and prev
;;; Produces:
;;;   [Nothing; called for the side effects]
(define record-follow-two!
  (lambda (stats prevprev prev char)
    (record-char! (vector-ref stats (index-two prevprev prev))
                  char)))

;;; Procedure:
;;;   file->follow-two-stats
;;; Parameters:
;;;   fname, a string
;;; Purpose:
;;;   Convert a file to a set of statistics as given by `follow-two-stats`
;;;   and `record-follow-two`.
;;; Produces:
;;;   stats, the thing returned by follow-two-stats
;;; Preconditions:
;;;   fname names a valid text file.
;;; Postconditions:
;;;   * `stats` is a set of statistics that can be used wherever we use
;;;     the values from `follow-two-stats`.
;;;   * The frequencies in `stats` correspond to the frequencies
;;;     in the file named by `fname`.
(define file->follow-two-stats
  (lambda (fname)
    (let ([stats (follow-two-stats)]
          [port (open-input-file fname)])
      (let kernel ([prevprev (read-char port)]
                   [prev (read-char port)])
        (let ([char (read-char port)])
          (cond
            ; When we reach the end, clean up and stop
            [(eof-object? char)
             (close-input-port port)
             stats]
            ; If any of the characters is not valid, skip
            [(or (not (valid? prevprev))
                 (not (valid? prev))
                 (not (valid? char)))
             (kernel prev char)]
            ; Otherwise, we record that char follows prevprev and prev.
            ; We then continue, updating appropriately
            [else
             (record-follow-two! stats prevprev prev char)
             (kernel prev char)]))))))

;;; Procedure:
;;;   random-string-two
;;; Parameters:
;;;   stats, a value return by `follow-two`
;;;   prevprev, a valid character
;;;   prev, a valid character
;;;   len, a positive integer
;;; Purpose:
;;;   Generate a "random" string whose statistical properties match
;;;   those of stats.
;;; Produces:
;;;   str, a string
(define random-string-two
  (lambda (stats prevprev prev len)
    ; Randomly select among the successors of prevprev and prev
    (let ([random-next (lambda (prevprev prev)
                         (let ([substats
                                (vector-ref stats
                                            (index-two prevprev prev))])
                           (random-char substats)))])
      ; Build a list of successors then convert to a string
      (let kernel ([remaining (- len 2)]
                   [prevprev prevprev]
                   [prev prev]
                   [chars (list prev prevprev)])
        (if (zero? remaining)
            (list->string (reverse chars))
            (let ([char (random-next prevprev prev)])
              (kernel (decrement remaining)
                      prev
                      char
                      (cons char chars))))))))

What do we get with these two characters of history? Let’s see …

> (define jane-eyre (file->follow-two-stats "pg1260.txt"))
> (random-string-two jane-eyre #\t #\h 100)
"thersen had comet stior you like whaps ans a loseenerwastaid now colet con the rear nou sh thand maree"
> (random-string-two jane-eyre #\t #\h 100)
"the she yourritin yous ing grive heat clunch waskin whower imigin bothrojeckereivaption shown no brow "
> (random-string-two jane-eyre #\w #\h 100)
"whistiand somme pat agally of hairy coloul saing voinbe ove its thadeen antansfied prothat sery at usb"

It’s not English. But it’s getting closer to natural language, isn’t it?

Does a different author’s text produce different output? Let’s try.

> (define cat-in-the-hat (file->follow-two-stats "cat-in-the-hat.txt"))
> (random-string-two cat-in-the-hat #\t #\h 100)
"the slood hour the sirs gooey clay swill ho up i don mot backs them clay think what lack hat ored not "
> (random-string-two cat-in-the-hat #\t #\h 100)
"th thave baget ar mrs on the snamings eat cat ings clowled his of come wishem he me slike a hicks thic"
> (random-string-two cat-in-the-hat #\t #\h 100)
"the hould thim i like did she thew some sin they do nook ch wast like down we started cat abox sinds b"

> (define yertle (file->follow-two-stats "yertle.txt"))
> (random-string-two yertle #\y #\e 100)
"yerees seed of ittled us ris sand high alaide up see the birdecid turtle pland ond the wayine thed and"
> (random-string-two yertle #\y #\e 100)
"yere turtill hiseed i sh up yon that hiles rous alanothe turped he of theat ste shourtle th amed up th"
> (random-string-two yertle #\t #\u 100)
"turtle but i say seeing of they the be swittles of the burtle grom the bettler and even therflittoon f"
> (random-string-two yertle #\t #\u 100)
"turtle the was theaturtles swittle plaings the stall the king grone encle be the tur overees his a thr"
> (random-string-two yertle #\t #\u 100)
"turtled of a com the cat muds bell turte pereat plad up of thereas rules just i dond her a pill mighe "
> (random-string-two yertle #\t #\u 100)
"tur mousan ove limmack aftepped of turtle ne king ore stack that land a thight on the beyou sone whoo "

Awesome, isn’t it? I think we’ve found the secret to Olde English.

Just in case you were wondering, we really do see different output for different texts. Here’s what we get for the Spanish-language version of one of my favorite short stories, “Tlön, Uqbar, Orbis Tertius”.

> (define tuot (file->follow-two-stats "tuot-spanish.txt"))
> (random-string-two tuot #\u #\q 100)
"uqbara de pulazonjetomor hable en prios o probanza also lasor el ar he ta yo de pre no tivo dechipieza"
> (random-string-two tuot #\u #\q 100)
"uqbatilla faciraliblesapante otron dina de de ba oton undadijorrasilo rescurambirrio ind obje heradque"
> (random-string-two tuot #\u #\q 100)
"uqba mar mun crumente las arlas explibencurda suseconcerres undo teme lade cuchiblitiusta ber que el h"

The assignment

Your goal for this option is to improve our text generation process. Make the following changes to the starter code.

Add support for apostrophes (as in “can’t”) and for commas. You will need to figure out what procedures need changing. Certainly, you need to change valid?, valid->num, and num->valid. But you’ll also need to change any code that assumes that we have only 27 characters.
Treat all end-of-sentence punctuation (periods, question marks, and exclamation points) as a period. You’ll need to make similar changes to those in the previous step. But you’ll also need to have valid->num change all end-of-sentence punctuation to the same number and have num->valid change that number back to a period.
Treat all whitespace characters (space, newline, and tab) as a space.
There are many characters that don’t fit into the classes of characters we’ve chosen. Treat all those other characters as an underscore. (This may be a dangerous approach as we can possibly end up with long sequences of underscores.)
Create a series of procedures that will do prediction based on three characters, rather than two.
Make one more change that you think would be valuable. Make sure to explain the change clearly for the grader.

Option two: Simplified neural networks

Background

One of the important contributions of computer science to the broader field of data science is through so-called “machine learning” techniques. Broadly speaking, these techniques involve taking a process for classifying pieces of data and repeatedly refine the process by running it on known inputs/outputs and refining the classification scheme based on the relationship of the inputs to the outputs. The set of known inputs/outputs is called the “training set”.

In this assignment, we will consider a relatively straightforward classification approach in which we identify a variety of numeric characteristics of our input, multiply each characteristic by a computed “weight”, add the results, and then compare the result to some threshold.

For example, suppose we were trying to predict whether first-year students were going to declare a computer science major. We might have information on (a) whether or not they listed “computer science” on their list of prospective fields on their Common Application; (b) how many prospective fields they listed on their Common Application; (c) their SAT or ACT Math Scores; (d) the semester in which they first took computer science at Grinnell; (e) their grade in that class; and (f) their grade in their first Mathematics course. Which of these is most predictive? We don’t know at first, so we make guesses. That gives us a heuristic for predicting majors. We then compare our prediction to the actual state for each element in our training set (e.g., students in the class of 2018). When our predictions fail to match, we look for the components that most likely contributed to that failure and update them. For example, we might have thought that the listing of CS on the Common App is predictive, but we might discover that it fails to distinguish students well.

We typically use a variety of data in determining how much to update each of the weights. These include

the input; we often update the weights of larger inputs more than we update the weights of smaller inputs.
the output; we make larger updates if the output is far from the expected value than if it is close to the expected value.
an adjustment factor; we don’t want to make very large updates at any point. Rather, we nudge the values closer and closer to the expected value.

An example

Let’s start with a simple three-parameter function that is relatively to understand. The majority function takes only 0 or 1 values and input and returns whichever value appears the most frequently. We can represent those data as a table of the following form. In the table, each line has four items, representing the three inputs and the one output.

0,0,0,0
0,0,1,0
0,1,0,0
0,1,1,1
1,0,0,0
1,0,1,1
1,1,0,1
1,1,1,1

We can’t design a perfect set of weights for this set of data. But we can come close. Think about what weights you would choose if you had the option.

But our goal is not to choose weights. Our goal is to get the computer to figure them out. So, let’s pick a starting set of weights. Suppose we pick the set #(0 0.5 1). We’ve represented the weights in a vector for convenience. We’ll use a relatively high adjustment factor of 0.1.

We pick a random element of the vector. Let’s say we pick “0,0,1,0”.

inputs are 0 0 1
weights are 0 0.5 1
expected 0
output is 1.0, which is off by -1.0
adjustments
- change 0 by -1.0(0/1.0)0.1 => 0
- change 0.5 by -1.0(0/1.0)0.1 => 0.5
- change 1 by -1.0(1/1.0)0.1 => 0.9

It’s not a big change, but I think we’ve made progress. Let’s try giving the same input and the new set of weights.

inputs are 0 0 1
weights are 0 0.5 0.9
expected 0
output is 0.9, which is off by -0.9
adjustments
- change 0 by -0.9(0/1.0)0.1 => 0
- change 0.5 by -0.9(0/1.0)0.1 => 0.5
- change 0.9 by -0.9(1/1.0)0.1 => 0.81

Yup, it’s slightly closer. Now we’re only off by -0.9 rather than -1. Note that because the first two parameters are zero, it seems unlikely we’ll change them based on this input. We might choose a different approach to updating. But if we’re sampling the input randomly, it’s okay if not everything gets updated. In addition, most sample data does not provide only 0’s and 1’s as inputs.

Let’s try another input. Suppose we pick “1,0,1,1”.

inputs are 1 0 1
weights are 0 0.5 0.9
expected 1
output is 0.9, which is off by 0.1
adjustments
- change 0 by 0.1(1/2.0)0.1 => 0.005
- change 0.5 by 0.1(0/2.0)0.1 => 0.5
- change 0.9 by 0.1(1/2.0)0.1 => 0.905

This change improved the output for this input. But it likely made our answer less good for the other input. However, if we do the same thing again and again and again, we will eventually end up with a reasonably good selection of weights.

For example, if I start with those weights (0, 0.5, and 1) and run the algorithm for 20000 iterations with an adjustment factor of .001, I get the weights '#(0.31436463842814283 0.33315929468375294 0.3494344849572705). How does that relate to what we might have guessed in the first step? Well, I would have guessed that each weight should be about 1/3, since we want the three values to count equally. The computed value is not perfect, but it’s reasonable.

Is this better? Let’s check. We’ll look at the differences in each case and see if they go up or down.

First, we’ll explore what happens when the weights are '#(0 0.5 1).

Inputs: #(0 0 0)        Predicted: 0       Actual: 0      Difference: 0   
Inputs: #(0 0 1)        Predicted: 1       Actual: 0      Difference: -1  
Inputs: #(0 1 0)        Predicted: 0.5     Actual: 0      Difference: -0.5        
Inputs: #(0 1 1)        Predicted: 1.5     Actual: 1      Difference: -0.5        
Inputs: #(1 0 0)        Predicted: 0       Actual: 0      Difference: 0   
Inputs: #(1 0 1)        Predicted: 1       Actual: 1      Difference: 0   
Inputs: #(1 1 0)        Predicted: 0.5     Actual: 1      Difference: 0.5 
Inputs: #(1 1 1)        Predicted: 1.5     Actual: 1      Difference: -0.5        
Sum of squared errors: 2.0

You’ll note that we have a “sum of squared errors” description. That suggests how good or bad the weights are on a particular data set.

Now, let’s try it with weights of '#(0.31 0.33 0.35), those given by repeated adjustment.

Inputs: #(0 0 0)        Predicted: 0       Actual: 0      Difference: 0   
Inputs: #(0 0 1)        Predicted: 0.34    Actual: 0      Difference: -0.34       
Inputs: #(0 1 0)        Predicted: 0.33    Actual: 0      Difference: -0.33       
Inputs: #(0 1 1)        Predicted: 0.67    Actual: 1      Difference: 0.33
Inputs: #(1 0 0)        Predicted: 0.31    Actual: 0      Difference: -0.31       
Inputs: #(1 0 1)        Predicted: 0.65    Actual: 1      Difference: 0.35        
Inputs: #(1 1 0)        Predicted: 0.64    Actual: 1      Difference: 0.36        
Inputs: #(1 1 1)        Predicted: 0.98    Actual: 1      Difference: 0.02
Sum of squared errors: 0.682

It looks better, doesn’t it?

Another example

Suppose we are trying to write something for which we have the following starting information. Each line represents three inputs and one output.

29,0.44,0,1
42,0.99,0.85,1
39,0.47,0.05,1
7,0.45,0.98,0
17,0.42,0.04,1
58,0.91,0.44,1
21,0.02,0.48,0
38,0.01,0.86,0
17,0.44,0.93,0
66,0.14,0.75,0
92,0.43,0.28,1
59,0.13,0.75,0
52,0.02,0.65,0
33,0.33,0.27,1
88,0.38,0.56,1
8,0.01,0.99,0
58,0.67,0.52,1
88,0.38,0.67,1
87,0.53,0.11,1
69,0.64,0.19,1

In this case, we have no idea what the function is supposed to be. But we can explore the error terms. We’ll start with #(0.5,0.6,0.7), which seems as good as any set of weights. How good are these weights? Let’s see.

Inputs: #(0.29 0.44 0)  Predicted: 0.336   Actual: 1      Difference: 0.664
Inputs: #(0.42 0.99 0.85)       Predicted: 1.173   Actual: 1      Difference: -0.173
Inputs: #(0.39 0.47 0.05)       Predicted: 0.421   Actual: 1      Difference: 0.579
Inputs: #(0.7 0.45 0.98)        Predicted: 1.093   Actual: 0      Difference: -1.093
Inputs: #(0.17 0.42 0.04)       Predicted: 0.302   Actual: 1      Difference: 0.698
Inputs: #(0.58 0.91 0.44)       Predicted: 0.951   Actual: 1      Difference: 0.049
Inputs: #(0.21 0.02 0.48)       Predicted: 0.382   Actual: 0      Difference: -0.382
Inputs: #(0.38 0.01 0.86)       Predicted: 0.673   Actual: 0      Difference: -0.673
Inputs: #(0.17 0.44 0.93)       Predicted: 0.846   Actual: 0      Difference: -0.846
Inputs: #(0.66 0.14 0.75)       Predicted: 0.784   Actual: 0      Difference: -0.784
Inputs: #(0.92 0.43 0.28)       Predicted: 0.751   Actual: 1      Difference: 0.249
Inputs: #(0.59 0.13 0.75)       Predicted: 0.759   Actual: 0      Difference: -0.751
Inputs: #(0.52 0.02 0.65)       Predicted: 0.608   Actual: 0      Difference: -0.608
Inputs: #(0.33 0.33 0.27)       Predicted: 0.459   Actual: 1      Difference: 0.541
Inputs: #(0.88 0.38 0.56)       Predicted: 0.878   Actual: 1      Difference: 0.122
Inputs: #(0.8 0.01 0.99)        Predicted: 0.919   Actual: 0      Difference: -0.919
Inputs: #(0.58 0.67 0.52)       Predicted: 0.879   Actual: 1      Difference: 0.121
Inputs: #(0.88 0.38 0.67)       Predicted: 0.944   Actual: 1      Difference: 0.056
Inputs: #(0.87 0.53 0.11)       Predicted: 0.679   Actual: 1      Difference: 0.321
Inputs: #(0.69 0.64 0.19)       Predicted: 0.71    Actual: 1      Difference: 0.29
Sum of squared errors: 6.772255

Now, let’s repeatedly adjust using the approach described earlier. That is, we add something like (* difference (/ input sum-of-inputs) adjustment) to each weight, where input is the corresponding input. After 20,000 repetitions with an adjustment of .01, I end up with #(0.6838130212100684 1.3819690813246843 -0.6488317339705226), which I’ll simplify to #(0.68 1.38 -0.65). Note that this is an interesting set of weights. One of the weights is greater than 1. One of the weights is negative. But that’s how this approach works sometimes. Are these weights better? Let’s see.

Inputs: #(0.29 0.44 0)  Predicted: 0.8044  Actual: 1      Difference: 0.1956
Inputs: #(0.42 0.99 0.85)       Predicted: 1.0993  Actual: 1      Difference: -0.0993
Inputs: #(0.39 0.47 0.05)       Predicted: 0.8813  Actual: 1      Difference: 0.1187
Inputs: #(0.7 0.45 0.98)        Predicted: 0.46    Actual: 0      Difference: -0.46
Inputs: #(0.17 0.42 0.04)       Predicted: 0.6692  Actual: 1      Difference: 0.3308
Inputs: #(0.58 0.91 0.44)       Predicted: 1.3642  Actual: 1      Difference: -0.3642
Inputs: #(0.21 0.02 0.48)       Predicted: -0.1416 Actual: 0      Difference: 0.1416
Inputs: #(0.38 0.01 0.86)       Predicted: -0.2868 Actual: 0      Difference: 0.2868
Inputs: #(0.17 0.44 0.93)       Predicted: 0.1183  Actual: 0      Difference: -0.1183
Inputs: #(0.66 0.14 0.75)       Predicted: 0.1545  Actual: 0      Difference: -0.1545
Inputs: #(0.92 0.43 0.28)       Predicted: 1.037   Actual: 1      Difference: -0.037
Inputs: #(0.59 0.13 0.75)       Predicted: 0.0931  Actual: 0      Difference: -0.0931
Inputs: #(0.52 0.02 0.65)       Predicted: -0.0413 Actual: 0      Difference: 0.0413
Inputs: #(0.33 0.33 0.27)       Predicted: 0.5043  Actual: 1      Difference: 0.4957
Inputs: #(0.88 0.38 0.56)       Predicted: 0.7588  Actual: 1      Difference: 0.2412
Inputs: #(0.8 0.01 0.99)        Predicted: -0.0857 Actual: 0      Difference: 0.0857
Inputs: #(0.58 0.67 0.52)       Predicted: 0.981   Actual: 1      Difference: 0.019
Inputs: #(0.88 0.38 0.67)       Predicted: 0.6873  Actual: 1      Difference: 0.3127
Inputs: #(0.87 0.53 0.11)       Predicted: 1.2515  Actual: 1      Difference: -0.2515
Inputs: #(0.69 0.64 0.19)       Predicted: 1.2289  Actual: 1      Difference: -0.2289
Sum of squared errors: 1.19282223

Yup, that’s definitely a big improvement. Importantly, all but one of the differences are now clearly less than 1/2. If we rounded, all of the answers would be correct.

Getting started

Your first goals are naturally to compute predictions and to figure out ways to improve weights.

a. Write a procedure, (predict inputs weights) that takes equal-length vectors of inputs and weights as input and produces the output that the weights predict. Your procedure should multiply corresponding elements of inputs and weights and then add the together.

b. Write a procedure, (improve inputs weights actual adjustment) that takes as input one set of input values, one set of weights, the actual (or expected) output for those inputs, and the adjustment factors. The procedure should produce a new set of “improved” weights. For example, if we mimicked the first example from above, we might see

> (improve #(0 0 1) #(0 0.5 1) 0 0.1)
'#(0 0.5 0.9)
> (improve #(0 0.5 0.9) #(0 0.5 0.9))
'#(0.005 0.5 0.905)

c. Write a procedure, (refine samples weights adjustment steps) that takes a vector of samples, a vector of weights, an adjustment, and a number of steps as inputs. Each sample in samples should contain the input values and the output. For example, we would have #(0 0 1 0) and #(1 0 1 1) in the samples for the majority function. Your procedure should repeatedly improve the set of weights by randomly selecting one of the samples and then calling improve appropriately. How many times do you improve the weights? steps times, of course.

d. Test your procedure on the majority function. Here some useful code to help with that.

;;; Value
;;;   majority-samples
;;; Type:
;;;   vector of vectors of reals
;;; Summary:
;;;   All the sample data for the majority function.
(define majority-samples
  (vector
   (vector 0 0 0 0)
   (vector 0 0 1 0)
   (vector 0 1 0 0)
   (vector 0 1 1 1)
   (vector 1 0 0 0)
   (vector 1 0 1 1)
   (vector 1 1 0 1)
   (vector 1 1 1 1)))

;;; Procedure:
;;;   analyze-solution
;;; Parameters:
;;;   samples, a vector of vectors of real numbers
;;;   weights, a vector of real numbers
;;; Purpose:
;;;   Conduct and print out an analysis of the given set of weights on the
;;;   given set of samples.
;;; Produces:
;;;   total-error, a real number
;;; Preconditions:
;;;   * Each element of samples is of the form
;;;     `#(input1 ... inputn desired-output)`
;;;   * Each element of samples has length one greater than the
;;;     length of weights.
;;; Postconditions:
;;;   total-error represents the sum of the squares of errors.
(define analyze-solution
  (lambda (samples weights)
    (let kernel ([pos 0]
                 [total-error 0])
      (cond
        [(< pos (vector-length samples))
         (let* ([sample (vector-ref samples pos)]
                [len (vector-length sample)]
                [inputs (vector-take sample (- len 1))]
                [actual (vector-ref sample (- len 1))]
                [prediction (predict inputs weights)]
                [difference (- actual prediction)])
           (println "Inputs: " inputs #\tab
                    "Predicted: " prediction #\tab
                    "Actual: " actual #\tab
                    "Difference: " difference)
           (kernel (+ pos 1)
                   (+ total-error (square difference))))]
        [else
         (println "Sum of squared errors: " total-error)
         total-error]))))

;;; Procedure:
;;;   println
;;; Parameters:
;;;   val1, a Scheme value
;;;   val2, a Scheme value
;;;   ...
;;;   valn, a Scheme value
;;; Purpose:
;;;   Print all of the values followed by a newline
;;; Produces:
;;;   [Nothing; called for the side effect.]
(define println
  (lambda vals
    (map display vals)
    (newline)))

Working with real data

You’re now ready to move on to a real data set. For this assignment, we will be working with a set of data on cell characteristics along with an associated diagnosis.

The data are described more fully at http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29.
The CSV file for the data set is available at http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data.
An explanation of the underlying meaning of the data is available at

http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names

Here are the first few lines of the data file.

1036172,2,1,1,1,2,1,2,1,1,2
1041801,5,3,3,3,2,3,4,4,1,4
1043999,1,1,1,1,2,3,3,1,1,2
1044572,8,7,5,10,7,9,5,5,4,4
1047630,7,4,6,4,6,1,4,3,1,4
1048672,4,1,1,1,2,1,2,1,1,2
1049815,4,1,1,1,2,1,3,1,1,2
1050670,10,7,7,6,4,10,4,1,2,4
1050718,6,1,1,1,2,1,3,1,1,2
1054590,7,3,2,10,5,10,5,4,4,4
1054593,10,5,5,3,6,7,7,10,1,4

The first column represents the identifier of the sample. The last column represents the diagnosis (2 for benign, 4 for malignant). The remaining columns represent the different attributes.

We’ve put all of the data in the file /home/rebelsky/Desktop/breast-cancer-data.csv.

a. Arrange to read all of the data into a variable called raw-cell-data.

b. As you know from experience, large data sets sometimes have missing columns. Write an instruction or instructions to filter out any data that do not fit the form (e.g., that have fewer than 11 values in a row, that have middle values that are not integers in the range 1-10, that have a final value that is not 2 or 4). Call that result clean-cell-data.

c. It turns out that the standard learning algorithms do better if you have one extra column that is identical for all data. Add a column which has a 1 for every row.

d. You may find it useful to change the last column to 0 and 1, rather than 2 and 4. (That’s up to you.)

e. All of our examples used data in the range [0..1]. These data are clearly not in that range. You may find it useful to write a procedure to convert from the current to the [0..1] range.

f. The first 400 rows will be our training data. Arrange to put all of those rows into a new vector of vectors called training-samples. Arrange to put the remaining rows into a vector of vectors called test-samples.

g. Using the refine procedure you wrote earlier, develop a set of weights that seem to successful predict whether the cell is benign or malignant.

h. Determine how well your procedure does on the test samples. Indicate how the percent of false positives and false negatives.

i. Write an English-language interpretation of the set of weights.

Notes

There are a few subtleties to the algorithm.

You may notice that if all of the inputs are 0, they won’t change. In that case, I would suggest that you change each weight equally. (You can tell that you have all zero inputs if their sum is zero.)

In computing the sum, you should take the sum of the absolute values of the inputs. However, in computing the change, you should use the actual value of the input. (I don’t think we have any negative inputs in our model, but the general case does have negative inputs.)

Evaluation

We will primarily evaluate your work on correctness (does your code compute what it’s supposed to and are your procedure descriptions accurate); clarity (is it easy to tell what your code does and how it achieves its results; is your writing clear and free of jargon); and concision (have you kept your work short and clean, rather than long and rambly).

Accessibility statement. Check accessibility with WAVE.

Copyright © Charlie Curtsinger, Sarah Dahlby Albright, Janet Davis, Titus Klinge, Samuel A. Rebelsky, and Jerod Weinman. Selected materials are copyright by John David Stone or Henry Walker and are used with permission.

This work is licensed under a Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/3.0/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.

This website was built using Jekyll, Twitter Bootstrap, and the Bootswatch Cosmo Theme.