Assignment 7: Broader analysis
 Warning
 One of the data sets in this assignment pertains to breast cancer. Breast cancer has impacted the lives of many people. If you are uncomfortable working with the data set or discussing the related issues, please speak with your instructor about using another data set.
 Summary
 For this assignment, you will develop one of two largescale miniprojects. One involves automated text generation based on statistical properties of text; in that miniproject, you will gather data on a text and then attempt to generate new texts based on that text. The other involves some basic techniques of “learning”; in that miniproject, you will build one or more simple classifiers.
 Collaboration
 You must work with your assigned partner(s) on this assignment. You may discuss this assignment with anyone, provided you credit such discussions when you submit the assignment.
 Submitting
 Email your answers to csc15101grader@grinnell.edu. The subject of your email should be [CSC151 01] Assignment 7 and should contain your answers to all parts of the assignment. Scheme code should be in the body of the message, not in an attachment.
 Warning
 So that this assignment is a learning experience for everyone, we may spend class time publicly critiquing your work.
Option one: Simplified text generation
Background
What distinguishes different writers? Many things. Word choice. Sentence length. Grammar. Style. You can probably add others. In writing programs that analyze texts, some programmers explore statistical properties of the text: What are the more common words? What words most likely follow each other?
At some point, someone realized that you can also use these statistical properties to generate text as well as analyze text. How? Suppose we knew that in one corpus, the word “blue” is followed by the word “world” 50% of the time, “mood” 30% of the time, and “jeans” 20% of the time. In generating a text, if we generate the word “blue”, we could then pick a random percentage. If it’s less than 50%, we add the word “mood”. If it’s less than 80% (50% + 30%), we add the word “mood”. And the rest of the time, we add “jeans”. What next? We use similar data for the next word (whichever one of the words we just chose), and the next, and the next.
Of course, this kind of text generation isn’t that simple. We generally don’t want to do the word that follows only one word; that text is too messy. Rather, we compute follow rates for two or three or even fourwords groups.
To do statistical text generation, we need to find a way to represent the information efficiently and we need to find a way to get the data from the text file into that representation. What are the representation issues? We need to figure out how we represent something like “we need to” is followed by “figure” 10% of the time, “call” 20% of the time, and so on and so forth, then that “need to figure” is followed by “out” 90% of the time, and “this” 10% of the time.
Simplifying the problem
Rather than deal with words, we will deal with individual letters. Letterbased statistical text processing is generally less successful than wordbased statistical text processing, but it’s a bit easier to do with the knowledge you have.
We’ll start by looking only at the frequency of each letter (alphabetical
and space, nothing else). To keep track of frequencies, we’ll build a
vector of length 28. Position 0 will store the total number of letters
we’ve seen. Position 1 will store the number of #\a
’s. Position
2 will store the number of #\b
’s. And so on and so forth. Position
27 will store the number of spaces.
;;; Procedure:
;;; charstats
;;; Parameters:
;;; [None]
;;; Purpose:
;;; Generate a new "thing" we can use to store statistics on
;;; character frequency.
;;; Produces:
;;; stats, something we can use to store statistics on character
;;; frequency. (That is, something we can use with `recordchar!`
;;; and `randomchar`.)
;;; Philosophy:
;;; The client need not know how we store the statistics. But
;;; we will likely implement the statistics as a vector of length
;;; (+ 1 "number of chars"). Element 0 of that vectors gives the
;;; total count of chars stored. Element i gives the number of
;;; times we've seen char i+1.
(define charstats
(lambda ()
(makevector 28 0)))
Now, if we’re going to store letters in that vector, we need some helper
functions. First, it will be useful to have a procedure that determines
whether a character is something we want to process. (Remember, we’re
skipping anything that’s not a letter or a space. So we’re skipping
periods, apostrophes, numbers, and much, much, more.) We’ll call that
procedure valid?
. Next, we’ll need a procedure that takes a valid
character as input and converts it to an integer in the range 1 .. 27.
(We’re using 1 .. 26 for the letters and 27 for space.) We’ll call
that valid>num
. It will also be useful to do the inverse of that
conversion, converting from a number back to a character. We’ll call
that num>valid
. Here are those three procedures.
;;; Procedure:
;;; valid?
;;; Parameters:
;;; char, a character
;;; Purpose:
;;; Determine if char is one of the valid characters used in our
;;; text generation algorithm (space, #\a through #\z, #\A through #\Z).
;;; Produces:
;;; ok, a boolean
(define valid?
(lambda (char)
(or (char=? char #\space)
(charci<=? #\a char #\z))))
;;; Procedure:
;;; valid>num
;;; Parameters:
;;; char, a character
;;; Purpose:
;;; Convert char to a representative number
;;; Produces:
;;; number, an integer
;;; Preconditions:
;;; (valid? char)
;;; Postconditions:
;;; * space is 0
;;; * #\a is 1
;;; * ...
(define valid>num
(let [(anum (char>integer #\a))]
(lambda (char)
(cond
[(char=? char #\space)
27]
[else
(+ 1 ( (char>integer (chardowncase char)) anum))]))))
;;; Procedure:
;;; num>valid
;;; Parameters:
;;; num, an integer
;;; Purpose:
;;; Convert a number in back to a character.
;;; Produces:
;;; char, a character
;;; Preconditions:
;;; num is a value returned by char>num.
;;; Postconditions:
;;; (num>valid (valid>num char)) = char
(define num>valid
(let [(anum (char>integer #\a))]
(lambda (num)
(cond
[(= num 27)
#\space]
[else
(integer>char (+ anum 1 num))]))))
Let’s check on them quickly.
> (valid? #\E)
#t
> (valid? #\space)
#t
> (valid? #\3)
#f
> (valid>num #\space)
27
> (valid>num #\a)
1
> (valid>num #\A)
1
> (num>valid 26)
#\z
> (num>valid 5)
#\e
We’re now ready to put information into the vector. Each time we see a letter, we’ll convert it to an appropriate index and then increment the count at that index. We’ll also increment the count at index 0, which represents the total number of characters.
;;; Procedure:
;;; recordchar!
;;; Parameters:
;;; stats, a value returned by `charstats`
;;; char, a character
;;; Purpose:
;;; Record information that we've seen another copy of char.
;;; Produces:
;;; [Nothing; called for the side effect.]
(define recordchar!
(lambda (stats char)
(vectorincrement! stats 0)
(vectorincrement! stats (valid>num char))))
Let’s try this procedure, too.
> (define sample (charstats))
> (recordchar! sample #\a)
> (recordchar! sample #\a)
> (recordchar! sample #\m)
> (recordchar! sample #\s)
> (recordchar! sample #\space)
> (recordchar! sample #\s)
> (recordchar! sample #\s)
> sample
'#(7 2 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 3 0 0 0 0 0 0 0 1)
Our next goal is to use these data to “randomly” select a character.
How do we do that? In this case, we’ve seen seven characters, so we
call (random 7)
. If we get 0 or 1, we return #\a
. If we get a
2, we return m
. If we get a 3, 4, or 5, we return #\s
. Otherwise,
we return a space. Here’s a procedure that achieves that goal through
a slightly different process.
;;; Procedure:
;;; randomchar
;;; Parameters:
;;; stats, a value returned by `charstats`
;;; Purpose:
;;; Get an unpredictable character from `stats` based on the
;;; distribution of characters represented by `stats`.
;;; Produces:
;;; char, a character
;;; Preconditions:
;;; * At least one char has been added to stats using `recordchar!`
;;; Postconditions:
;;; * The distribution of characters returned by this procedure is similar
;;; to the distribution of characters in stats.
;;; Problems:
;;; If no characters have been added, returns the space characer.
(define randomchar
(lambda (stats)
(let ([count (vectorref stats 0)])
(if (zero? count)
#\space
(let kernel ([ran (random count)]
[pos 1])
(let ([freq (vectorref stats pos)])
(if (< ran freq)
(num>valid pos)
(kernel ( ran freq)
(+ pos 1)))))))))
Let’s try it.
> (randomchar sample)
#\space
> (randomchar sample)
#\s
> (randomchar sample)
#\s
> (randomchar sample)
#\a
> (randomchar sample)
#\s
> (randomchar sample)
#\space
> (randomchar sample)
#\s
> (randomchar sample)
#\a
> (randomchar sample)
#\m
Not exactly the same distribution, but close. What if we want to generate strings, rather than individual characters? That should be straightforward. We just keep generating an appropriate number of random characters and then string ‘em all together.
;;; Procedure:
;;; randomstring
;;; Parameters:
;;; stats, a value return by `charstats`
;;; len, a positive integer
;;; Purpose:
;;; Generate a "random" string whose statistical properties match
;;; those of stats.
;;; Produces:
;;; str, a string
;;; Preconditions:
;;; [See the preconditions of randomchar]
;;; Postconditions:
;;; * (stringlength str) = len
;;; * The contents of the string are hard to predict.
;;; * For large enough len, the distribution of characters in str match
;;; those in stats.
(define randomstring
(lambda (stats len)
(let kernel ([remaining len]
[chars null])
(if (zero? remaining)
(list>string (reverse chars))
(kernel (decrement remaining)
(cons (randomchar stats) chars))))))
Let’s see how it works.
> (randomstring sample 10)
"asaamssssa"
> (randomstring sample 10)
"a ssaasaaa"
> (randomstring sample 10)
"mssm sms a"
> (randomstring sample 10)
" saass sam"
Yeah, that looks okay, too.
But we want better statistics. In particular, we probably want to gather statistics from a file. For that, we can use a standard pattern of file recursion.
;;; Procedure:
;;; file>stats
;;; Parameters:
;;; fname, a string
;;; Purpose:
;;; Convert a file to a set of statistics as given by
;;; `charstats` and `recordchar!`.
;;; Produces:
;;; stats, a vector of statistics.
;;; Preconditions:
;;; fname names a valid text file.
;;; Postconditions:
;;; * `stats` is a set of statistics that can be used by
;;; `randomchar`.
;;; * The frequencies in `stats` correspond to the frequencies
;;; in the file named by `fname`.
(define file>stats
(lambda (fname)
(let ([stats (charstats)]
[port (openinputfile fname)])
(let kernel ()
(let ([char (readchar port)])
(cond
[(eofobject? char)
(closeinputport port)
stats]
[(valid? char)
(recordchar! stats char)
(kernel)]
[else
(kernel)]))))))
Let’s see how it does with the Project Gutenberg version of Jane Eyre.
> (define janeeyre (file>stats "pg1260.txt"))
> janeeyre
'#(981056
63887
11440
19374
37989
102518
17270
15524
46422
57189
1328
6166
32971
22643
55321
61925
12614
958
48590
50882
68646
24008
7743
18986
1306
17634
331
177391)
> (randomstring janeeyre 100)
"bute xfhfvoyttshn hr ee hsmad wteuyetii i ontte iy asa m f anf s ggyohai tahnti le eahnsn fws"
> (randomstring janeeyre 100)
"vuhtr glb edmigosno tnh vemeiarakas rni ed tlreltnneluritf lfdettity sdeol htdve e oe lote eoe"
As you can see, this approach does not work particularly well at the letter
level. Let’s build a vector of vectors. The vector at position i
stores
the letters that follow the character whose number is i
. For example,
position 1 gives the frequency distribution of letters that follow #\a
.
;;; Procedure:
;;; followonestats
;;; Parameters:
;;; [None]
;;; Purpose:
;;; Create a "thing" that allows us to keep track of statistics
;;; for what characters follow each character.
;;; Produces:
;;; stats, a thing
(define followonestats
(lambda ()
(vectormap! (lambda (val) (charstats))
(makevector 28))))
Just as we needed a way to record letters in the basic vector of statistics, we need ways to record “follow information” in this vector of vectors. That will be a twostep process. First we index into the main vector to get one of those charstats things. Then we just use the procedure we wrote previously.
;;; Procedure:
;;; recordfollowone!
;;; Parameters:
;;; stats, a value returned by followonestats
;;; prev, a valid character
;;; char, a valid character
;;; Purpose:
;;; Record the information that char follows prev
;;; Produces:
;;; [Nothing; called for the side effects]
(define recordfollowone!
(lambda (stats prev char)
(recordchar! (vectorref stats (valid>num prev))
char)))
Reading the data from a file is a bit harder. Now, we have to keep track of not just the current character, but also the previous character. And we probably need conditions for when the current character is invalid and the previous character is invalid. Here’s an attempt at that.
;;; Procedure:
;;; file>followonestats
;;; Parameters:
;;; fname, a string
;;; Purpose:
;;; Convert a file to a set of statistics as given by `followonestats`
;;; and `recordfollowone`.
;;; Produces:
;;; stats, the thing returned by followonestats
;;; Preconditions:
;;; fname names a valid text file.
;;; Postconditions:
;;; * `stats` is a set of statistics that can be used wherever we use
;;; the values from `followonestats`.
;;; * The frequencies in `stats` correspond to the frequencies
;;; in the file named by `fname`.
(define file>followonestats
(lambda (fname)
(let ([stats (followonestats)]
[port (openinputfile fname)])
(let kernel ([prev (readchar port)])
(let ([char (readchar port)])
(cond
; When we reach the end, clean up and stop
[(eofobject? char)
(closeinputport port)
stats]
; If the previous character is not valid, skip over it
[(not (valid? prev))
(kernel char)]
; If the current character is not valid, don't record
; anything and read over the character.
[(not (valid? char))
(kernel (readchar port))]
; Otherwise, we record that char follows prev and continue,
; updating char to the previous character.
[else
(recordfollowone! stats prev char)
(kernel char)]))))))
Let’s see how that works.
> (define janeeyre (file>followonestats "pg1260.txt"))
> (vectorref janeeyre 26)
'#(324 41 1 0 0 152 0 0 0 24 1 0 30 0 0 15 0 0 0 0 0 7 0 0 0 8 36 9)
Let’s see … of the 331 letter z’s in Jane Eyre, 324 are followed
by another letter or a space. 41 are followed by #\a
. 1 is
followed by #\b
. (That’s the word “Spitzbergen”.) 152 are followed
by by #\e
. And so on and so forth.
Generating strings will be a bit harder. We’ll need a starting character
as well as the statistics. What about the case in which we’ve never found
a character that follows the current character? Fortunately, randomchar
handles that case.
;;; Procedure:
;;; randomstringone
;;; Parameters:
;;; stats, a value return by `followone`
;;; prev, a valid character
;;; len, a positive integer
;;; Purpose:
;;; Generate a "random" string whose statistical properties match
;;; those of stats.
;;; Produces:
;;; str, a string
(define randomstringone
(lambda (stats prev len)
; Randomly select among the successors of prev
(let ([randomnext (lambda (prev)
(let ([substats (vectorref stats (valid>num prev))])
(randomchar substats)))])
; Build a list of successors then convert to a string
(let kernel ([remaining len]
[prev prev]
[chars null])
(if (zero? remaining)
(list>string (reverse chars))
(let ([char (randomnext prev)])
(kernel (decrement remaining)
char
(cons char chars))))))))
Is our output from this procedure any better?
> (randomstringone janeeyre #\space 100)
"erores shes ttre phe t oun commas at as s atheved wnd enedibusoffres l adon atr e scompe be angh t "
> (randomstringone janeeyre #\space 100)
"kinghey an tostertowed t delasiechaininoreved tt se ir hepithiceswe as berng adeyoroutermy wininache"
> (randomstringone janeeyre #\space 100)
"s lofe heas coleeng ale t andelitwit tousp ut bl t dyostasabee rpe ma rimpaswale houraroven ictr i s"
> (randomstringone janeeyre #\space 100)
"re nsthe wis dsen oun mpinavin m shef y pt s waknd bloul d mewazed onourthis a wadoo fis an tove d n"
It’s not yet English. But we’re starting to see word patterns in this output. That’s a good sign. Let’s try two characters. We’ll leave you to interpret the code.
;;; Procedure:
;;; followtwostats
;;; Parameters:
;;; [None]
;;; Purpose:
;;; Create a "thing" that allows us to keep track of statistics
;;; for what characters follow each pair of characters.
;;; Produces:
;;; stats, a thing
(define followtwostats
(lambda ()
(vectormap! (lambda (val) (charstats))
(makevector (+ 1 (* 27 28))))))
;;; Procedure:
;;; indextwo
;;; Parameters:
;;; prevprev, a valid character
;;; prev, a valid character
;;; Purpose:
;;; Find the position in a followtwostats vector that we should
;;; use for the pair (prevprev,prev).
;;; Produces:
;;; index, an integer
(define indextwo
(lambda (prevprev prev)
(+ (* 27 (valid>num prevprev))
(valid>num prev))))
;;; Procedure:
;;; recordfollowtwo!
;;; Parameters:
;;; stats, a value returned by followtwostats
;;; prevprev, a valid character
;;; prev, a valid character
;;; char, a valid character
;;; Purpose:
;;; Record the information that char follows prevprev and prev
;;; Produces:
;;; [Nothing; called for the side effects]
(define recordfollowtwo!
(lambda (stats prevprev prev char)
(recordchar! (vectorref stats (indextwo prevprev prev))
char)))
;;; Procedure:
;;; file>followtwostats
;;; Parameters:
;;; fname, a string
;;; Purpose:
;;; Convert a file to a set of statistics as given by `followtwostats`
;;; and `recordfollowtwo`.
;;; Produces:
;;; stats, the thing returned by followtwostats
;;; Preconditions:
;;; fname names a valid text file.
;;; Postconditions:
;;; * `stats` is a set of statistics that can be used wherever we use
;;; the values from `followtwostats`.
;;; * The frequencies in `stats` correspond to the frequencies
;;; in the file named by `fname`.
(define file>followtwostats
(lambda (fname)
(let ([stats (followtwostats)]
[port (openinputfile fname)])
(let kernel ([prevprev (readchar port)]
[prev (readchar port)])
(let ([char (readchar port)])
(cond
; When we reach the end, clean up and stop
[(eofobject? char)
(closeinputport port)
stats]
; If any of the characters is not valid, skip
[(or (not (valid? prevprev))
(not (valid? prev))
(not (valid? char)))
(kernel prev char)]
; Otherwise, we record that char follows prevprev and prev.
; We then continue, updating appropriately
[else
(recordfollowtwo! stats prevprev prev char)
(kernel prev char)]))))))
;;; Procedure:
;;; randomstringtwo
;;; Parameters:
;;; stats, a value return by `followtwo`
;;; prevprev, a valid character
;;; prev, a valid character
;;; len, a positive integer
;;; Purpose:
;;; Generate a "random" string whose statistical properties match
;;; those of stats.
;;; Produces:
;;; str, a string
(define randomstringtwo
(lambda (stats prevprev prev len)
; Randomly select among the successors of prevprev and prev
(let ([randomnext (lambda (prevprev prev)
(let ([substats
(vectorref stats
(indextwo prevprev prev))])
(randomchar substats)))])
; Build a list of successors then convert to a string
(let kernel ([remaining ( len 2)]
[prevprev prevprev]
[prev prev]
[chars (list prev prevprev)])
(if (zero? remaining)
(list>string (reverse chars))
(let ([char (randomnext prevprev prev)])
(kernel (decrement remaining)
prev
char
(cons char chars))))))))
What do we get with these two characters of history? Let’s see …
> (define janeeyre (file>followtwostats "pg1260.txt"))
> (randomstringtwo janeeyre #\t #\h 100)
"thersen had comet stior you like whaps ans a loseenerwastaid now colet con the rear nou sh thand maree"
> (randomstringtwo janeeyre #\t #\h 100)
"the she yourritin yous ing grive heat clunch waskin whower imigin bothrojeckereivaption shown no brow "
> (randomstringtwo janeeyre #\w #\h 100)
"whistiand somme pat agally of hairy coloul saing voinbe ove its thadeen antansfied prothat sery at usb"
It’s not English. But it’s getting closer to natural language, isn’t it?
Does a different author’s text produce different output? Let’s try.
> (define catinthehat (file>followtwostats "catinthehat.txt"))
> (randomstringtwo catinthehat #\t #\h 100)
"the slood hour the sirs gooey clay swill ho up i don mot backs them clay think what lack hat ored not "
> (randomstringtwo catinthehat #\t #\h 100)
"th thave baget ar mrs on the snamings eat cat ings clowled his of come wishem he me slike a hicks thic"
> (randomstringtwo catinthehat #\t #\h 100)
"the hould thim i like did she thew some sin they do nook ch wast like down we started cat abox sinds b"
> (define yertle (file>followtwostats "yertle.txt"))
> (randomstringtwo yertle #\y #\e 100)
"yerees seed of ittled us ris sand high alaide up see the birdecid turtle pland ond the wayine thed and"
> (randomstringtwo yertle #\y #\e 100)
"yere turtill hiseed i sh up yon that hiles rous alanothe turped he of theat ste shourtle th amed up th"
> (randomstringtwo yertle #\t #\u 100)
"turtle but i say seeing of they the be swittles of the burtle grom the bettler and even therflittoon f"
> (randomstringtwo yertle #\t #\u 100)
"turtle the was theaturtles swittle plaings the stall the king grone encle be the tur overees his a thr"
> (randomstringtwo yertle #\t #\u 100)
"turtled of a com the cat muds bell turte pereat plad up of thereas rules just i dond her a pill mighe "
> (randomstringtwo yertle #\t #\u 100)
"tur mousan ove limmack aftepped of turtle ne king ore stack that land a thight on the beyou sone whoo "
Awesome, isn’t it? I think we’ve found the secret to Olde English.
Just in case you were wondering, we really do see different output for different texts. Here’s what we get for the Spanishlanguage version of one of my favorite short stories, “Tlön, Uqbar, Orbis Tertius”.
> (define tuot (file>followtwostats "tuotspanish.txt"))
> (randomstringtwo tuot #\u #\q 100)
"uqbara de pulazonjetomor hable en prios o probanza also lasor el ar he ta yo de pre no tivo dechipieza"
> (randomstringtwo tuot #\u #\q 100)
"uqbatilla faciraliblesapante otron dina de de ba oton undadijorrasilo rescurambirrio ind obje heradque"
> (randomstringtwo tuot #\u #\q 100)
"uqba mar mun crumente las arlas explibencurda suseconcerres undo teme lade cuchiblitiusta ber que el h"
The assignment
Your goal for this option is to improve our text generation process. Make the following changes to the starter code.

Add support for apostrophes (as in “can’t”) and for commas. You will need to figure out what procedures need changing. Certainly, you need to change
valid?
,valid>num
, andnum>valid
. But you’ll also need to change any code that assumes that we have only 27 characters. 
Treat all endofsentence punctuation (periods, question marks, and exclamation points) as a period. You’ll need to make similar changes to those in the previous step. But you’ll also need to have
valid>num
change all endofsentence punctuation to the same number and havenum>valid
change that number back to a period. 
Treat all whitespace characters (space, newline, and tab) as a space.

There are many characters that don’t fit into the classes of characters we’ve chosen. Treat all those other characters as an underscore. (This may be a dangerous approach as we can possibly end up with long sequences of underscores.)

Create a series of procedures that will do prediction based on three characters, rather than two.

Make one more change that you think would be valuable. Make sure to explain the change clearly for the grader.
Option two: Simplified neural networks
Background
One of the important contributions of computer science to the broader field of data science is through socalled “machine learning” techniques. Broadly speaking, these techniques involve taking a process for classifying pieces of data and repeatedly refine the process by running it on known inputs/outputs and refining the classification scheme based on the relationship of the inputs to the outputs. The set of known inputs/outputs is called the “training set”.
In this assignment, we will consider a relatively straightforward classification approach in which we identify a variety of numeric characteristics of our input, multiply each characteristic by a computed “weight”, add the results, and then compare the result to some threshold.
For example, suppose we were trying to predict whether firstyear students were going to declare a computer science major. We might have information on (a) whether or not they listed “computer science” on their list of prospective fields on their Common Application; (b) how many prospective fields they listed on their Common Application; (c) their SAT or ACT Math Scores; (d) the semester in which they first took computer science at Grinnell; (e) their grade in that class; and (f) their grade in their first Mathematics course. Which of these is most predictive? We don’t know at first, so we make guesses. That gives us a heuristic for predicting majors. We then compare our prediction to the actual state for each element in our training set (e.g., students in the class of 2018). When our predictions fail to match, we look for the components that most likely contributed to that failure and update them. For example, we might have thought that the listing of CS on the Common App is predictive, but we might discover that it fails to distinguish students well.
We typically use a variety of data in determining how much to update each of the weights. These include
 the input; we often update the weights of larger inputs more than we update the weights of smaller inputs.
 the output; we make larger updates if the output is far from the expected value than if it is close to the expected value.
 an adjustment factor; we don’t want to make very large updates at any point. Rather, we nudge the values closer and closer to the expected value.
An example
Let’s start with a simple threeparameter function that is relatively to understand. The majority function takes only 0 or 1 values and input and returns whichever value appears the most frequently. We can represent those data as a table of the following form. In the table, each line has four items, representing the three inputs and the one output.
0,0,0,0
0,0,1,0
0,1,0,0
0,1,1,1
1,0,0,0
1,0,1,1
1,1,0,1
1,1,1,1
We can’t design a perfect set of weights for this set of data. But we can come close. Think about what weights you would choose if you had the option.
But our goal is not to choose weights. Our goal is to get the computer
to figure them out. So, let’s pick a starting set of weights. Suppose
we pick the set #(0 0.5 1)
. We’ve represented the weights in a vector
for convenience. We’ll use a relatively high adjustment factor of 0.1.
We pick a random element of the vector. Let’s say we pick “0,0,1,0”.
 inputs are 0 0 1
 weights are 0 0.5 1
 expected 0
 output is 1.0, which is off by 1.0
 adjustments
 change 0 by 1.0(0/1.0)0.1 => 0
 change 0.5 by 1.0(0/1.0)0.1 => 0.5
 change 1 by 1.0(1/1.0)0.1 => 0.9
It’s not a big change, but I think we’ve made progress. Let’s try giving the same input and the new set of weights.
 inputs are 0 0 1
 weights are 0 0.5 0.9
 expected 0
 output is 0.9, which is off by 0.9
 adjustments
 change 0 by 0.9(0/1.0)0.1 => 0
 change 0.5 by 0.9(0/1.0)0.1 => 0.5
 change 0.9 by 0.9(1/1.0)0.1 => 0.81
Yup, it’s slightly closer. Now we’re only off by 0.9 rather than 1. Note that because the first two parameters are zero, it seems unlikely we’ll change them based on this input. We might choose a different approach to updating. But if we’re sampling the input randomly, it’s okay if not everything gets updated. In addition, most sample data does not provide only 0’s and 1’s as inputs.
Let’s try another input. Suppose we pick “1,0,1,1”.
 inputs are 1 0 1
 weights are 0 0.5 0.9
 expected 1
 output is 0.9, which is off by 0.1
 adjustments
 change 0 by 0.1(1/2.0)0.1 => 0.005
 change 0.5 by 0.1(0/2.0)0.1 => 0.5
 change 0.9 by 0.1(1/2.0)0.1 => 0.905
This change improved the output for this input. But it likely made our answer less good for the other input. However, if we do the same thing again and again and again, we will eventually end up with a reasonably good selection of weights.
For example, if I start with those weights (0, 0.5, and 1) and run
the algorithm for 20000 iterations with an adjustment factor of
.001, I get the weights '#(0.31436463842814283 0.33315929468375294
0.3494344849572705)
. How does that relate to what we might have
guessed in the first step? Well, I would have guessed that each weight
should be about 1/3, since we want the three values to count equally.
The computed value is not perfect, but it’s reasonable.
Is this better? Let’s check. We’ll look at the differences in each case and see if they go up or down.
First, we’ll explore what happens when the weights are '#(0 0.5 1)
.
Inputs: #(0 0 0) Predicted: 0 Actual: 0 Difference: 0
Inputs: #(0 0 1) Predicted: 1 Actual: 0 Difference: 1
Inputs: #(0 1 0) Predicted: 0.5 Actual: 0 Difference: 0.5
Inputs: #(0 1 1) Predicted: 1.5 Actual: 1 Difference: 0.5
Inputs: #(1 0 0) Predicted: 0 Actual: 0 Difference: 0
Inputs: #(1 0 1) Predicted: 1 Actual: 1 Difference: 0
Inputs: #(1 1 0) Predicted: 0.5 Actual: 1 Difference: 0.5
Inputs: #(1 1 1) Predicted: 1.5 Actual: 1 Difference: 0.5
Sum of squared errors: 2.0
You’ll note that we have a “sum of squared errors” description. That suggests how good or bad the weights are on a particular data set.
Now, let’s try it with weights of '#(0.31 0.33 0.35)
, those given
by repeated adjustment.
Inputs: #(0 0 0) Predicted: 0 Actual: 0 Difference: 0
Inputs: #(0 0 1) Predicted: 0.34 Actual: 0 Difference: 0.34
Inputs: #(0 1 0) Predicted: 0.33 Actual: 0 Difference: 0.33
Inputs: #(0 1 1) Predicted: 0.67 Actual: 1 Difference: 0.33
Inputs: #(1 0 0) Predicted: 0.31 Actual: 0 Difference: 0.31
Inputs: #(1 0 1) Predicted: 0.65 Actual: 1 Difference: 0.35
Inputs: #(1 1 0) Predicted: 0.64 Actual: 1 Difference: 0.36
Inputs: #(1 1 1) Predicted: 0.98 Actual: 1 Difference: 0.02
Sum of squared errors: 0.682
It looks better, doesn’t it?
Another example
Suppose we are trying to write something for which we have the following starting information. Each line represents three inputs and one output.
0.29,0.44,0,1
0.42,0.99,0.85,1
0.39,0.47,0.05,1
0.7,0.45,0.98,0
0.17,0.42,0.04,1
0.58,0.91,0.44,1
0.21,0.02,0.48,0
0.38,0.01,0.86,0
0.17,0.44,0.93,0
0.66,0.14,0.75,0
0.92,0.43,0.28,1
0.59,0.13,0.75,0
0.52,0.02,0.65,0
0.33,0.33,0.27,1
0.88,0.38,0.56,1
0.8,0.01,0.99,0
0.58,0.67,0.52,1
0.88,0.38,0.67,1
0.87,0.53,0.11,1
0.69,0.64,0.19,1
In this case, we have no idea what the function is supposed to be. But
we can explore the error terms. We’ll start with #(0.5,0.6,0.7)
, which
seems as good as any set of weights. How good are these weights? Let’s
see.
Inputs: #(0.29 0.44 0) Predicted: 0.336 Actual: 1 Difference: 0.664
Inputs: #(0.42 0.99 0.85) Predicted: 1.173 Actual: 1 Difference: 0.173
Inputs: #(0.39 0.47 0.05) Predicted: 0.421 Actual: 1 Difference: 0.579
Inputs: #(0.7 0.45 0.98) Predicted: 1.093 Actual: 0 Difference: 1.093
Inputs: #(0.17 0.42 0.04) Predicted: 0.302 Actual: 1 Difference: 0.698
Inputs: #(0.58 0.91 0.44) Predicted: 0.951 Actual: 1 Difference: 0.049
Inputs: #(0.21 0.02 0.48) Predicted: 0.382 Actual: 0 Difference: 0.382
Inputs: #(0.38 0.01 0.86) Predicted: 0.673 Actual: 0 Difference: 0.673
Inputs: #(0.17 0.44 0.93) Predicted: 0.846 Actual: 0 Difference: 0.846
Inputs: #(0.66 0.14 0.75) Predicted: 0.784 Actual: 0 Difference: 0.784
Inputs: #(0.92 0.43 0.28) Predicted: 0.751 Actual: 1 Difference: 0.249
Inputs: #(0.59 0.13 0.75) Predicted: 0.759 Actual: 0 Difference: 0.751
Inputs: #(0.52 0.02 0.65) Predicted: 0.608 Actual: 0 Difference: 0.608
Inputs: #(0.33 0.33 0.27) Predicted: 0.459 Actual: 1 Difference: 0.541
Inputs: #(0.88 0.38 0.56) Predicted: 0.878 Actual: 1 Difference: 0.122
Inputs: #(0.8 0.01 0.99) Predicted: 0.919 Actual: 0 Difference: 0.919
Inputs: #(0.58 0.67 0.52) Predicted: 0.879 Actual: 1 Difference: 0.121
Inputs: #(0.88 0.38 0.67) Predicted: 0.944 Actual: 1 Difference: 0.056
Inputs: #(0.87 0.53 0.11) Predicted: 0.679 Actual: 1 Difference: 0.321
Inputs: #(0.69 0.64 0.19) Predicted: 0.71 Actual: 1 Difference: 0.29
Sum of squared errors: 6.772255
Now, let’s repeatedly adjust using the approach described earlier. That
is, we add something like (* difference (/ input sumofinputs) adjustment)
to each weight, where input
is the corresponding input. After 20,000
repetitions with an adjustment of .01, I end up with
#(0.6838130212100684 1.3819690813246843 0.6488317339705226)
, which
I’ll simplify to #(0.68 1.38 0.65)
. Note that this is an interesting
set of weights. One of the weights is greater than 1. One of the weights
is negative. But that’s how this approach works sometimes. Are these
weights better? Let’s see.
Inputs: #(0.29 0.44 0) Predicted: 0.8044 Actual: 1 Difference: 0.1956
Inputs: #(0.42 0.99 0.85) Predicted: 1.0993 Actual: 1 Difference: 0.0993
Inputs: #(0.39 0.47 0.05) Predicted: 0.8813 Actual: 1 Difference: 0.1187
Inputs: #(0.7 0.45 0.98) Predicted: 0.46 Actual: 0 Difference: 0.46
Inputs: #(0.17 0.42 0.04) Predicted: 0.6692 Actual: 1 Difference: 0.3308
Inputs: #(0.58 0.91 0.44) Predicted: 1.3642 Actual: 1 Difference: 0.3642
Inputs: #(0.21 0.02 0.48) Predicted: 0.1416 Actual: 0 Difference: 0.1416
Inputs: #(0.38 0.01 0.86) Predicted: 0.2868 Actual: 0 Difference: 0.2868
Inputs: #(0.17 0.44 0.93) Predicted: 0.1183 Actual: 0 Difference: 0.1183
Inputs: #(0.66 0.14 0.75) Predicted: 0.1545 Actual: 0 Difference: 0.1545
Inputs: #(0.92 0.43 0.28) Predicted: 1.037 Actual: 1 Difference: 0.037
Inputs: #(0.59 0.13 0.75) Predicted: 0.0931 Actual: 0 Difference: 0.0931
Inputs: #(0.52 0.02 0.65) Predicted: 0.0413 Actual: 0 Difference: 0.0413
Inputs: #(0.33 0.33 0.27) Predicted: 0.5043 Actual: 1 Difference: 0.4957
Inputs: #(0.88 0.38 0.56) Predicted: 0.7588 Actual: 1 Difference: 0.2412
Inputs: #(0.8 0.01 0.99) Predicted: 0.0857 Actual: 0 Difference: 0.0857
Inputs: #(0.58 0.67 0.52) Predicted: 0.981 Actual: 1 Difference: 0.019
Inputs: #(0.88 0.38 0.67) Predicted: 0.6873 Actual: 1 Difference: 0.3127
Inputs: #(0.87 0.53 0.11) Predicted: 1.2515 Actual: 1 Difference: 0.2515
Inputs: #(0.69 0.64 0.19) Predicted: 1.2289 Actual: 1 Difference: 0.2289
Sum of squared errors: 1.19282223
Yup, that’s definitely a big improvement. Importantly, all but one of the differences are now clearly less than 1/2. If we rounded, all of the answers would be correct.
Getting started
Your first goals are naturally to compute predictions and to figure out ways to improve weights.
a. Write a procedure, (predict inputs weights)
that takes equallength vectors of inputs and weights as input and produces the output
that the weights predict. Your procedure should multiply corresponding
elements of inputs and weights and then add the together.
b. Write a procedure, (improve inputs weights actual adjustment)
that takes as input one set of input values, one set of weights, the
actual (or expected) output for those inputs, and the adjustment factors.
The procedure should produce a new set of “improved” weights. For example,
if we mimicked the first example from above, we might see
> (improve #(0 0 1) #(0 0.5 1) 0 0.1)
'#(0 0.5 0.9)
> (improve #(0 0.5 0.9) #(0 0.5 0.9))
'#(0.005 0.5 0.905)
c. Write a procedure, (refine samples weights adjustment steps)
that takes a vector of samples, a vector of weights, an adjustment,
and a number of steps as inputs. Each sample in samples
should
contain the input values and the output. For example, we would have
#(0 0 1 0)
and #(1 0 1 1)
in the samples for the majority function.
Your procedure should repeatedly improve the set of weights by randomly
selecting one of the samples and then calling improve
appropriately.
How many times do you improve the weights? steps
times, of course.
d. Test your procedure on the majority function. Here some useful code to help with that.
;;; Value
;;; majoritysamples
;;; Type:
;;; vector of vectors of reals
;;; Summary:
;;; All the sample data for the majority function.
(define majoritysamples
(vector
(vector 0 0 0 0)
(vector 0 0 1 0)
(vector 0 1 0 0)
(vector 0 1 1 1)
(vector 1 0 0 0)
(vector 1 0 1 1)
(vector 1 1 0 1)
(vector 1 1 1 1)))
;;; Procedure:
;;; analyzesolution
;;; Parameters:
;;; samples, a vector of vectors of real numbers
;;; weights, a vector of real numbers
;;; Purpose:
;;; Conduct and print out an analysis of the given set of weights on the
;;; given set of samples.
;;; Produces:
;;; totalerror, a real number
;;; Preconditions:
;;; * Each element of samples is of the form
;;; `#(input1 ... inputn desiredoutput)`
;;; * Each element of samples has length one greater than the
;;; length of weights.
;;; Postconditions:
;;; totalerror represents the sum of the squares of errors.
(define analyzesolution
(lambda (samples weights)
(let kernel ([pos 0]
[totalerror 0])
(cond
[(< pos (vectorlength samples))
(let* ([sample (vectorref samples pos)]
[len (vectorlength sample)]
[inputs (vectortake sample ( len 1))]
[actual (vectorref sample ( len 1))]
[prediction (predict inputs weights)]
[difference ( actual prediction)])
(println "Inputs: " inputs #\tab
"Predicted: " prediction #\tab
"Actual: " actual #\tab
"Difference: " difference)
(kernel (+ pos 1)
(+ totalerror (square difference))))]
[else
(println "Sum of squared errors: " totalerror)
totalerror]))))
;;; Procedure:
;;; println
;;; Parameters:
;;; val1, a Scheme value
;;; val2, a Scheme value
;;; ...
;;; valn, a Scheme value
;;; Purpose:
;;; Print all of the values followed by a newline
;;; Produces:
;;; [Nothing; called for the side effect.]
(define println
(lambda vals
(map display vals)
(newline)))
Working with real data
You’re now ready to move on to a real data set. For this assignment, we will be working with a set of data on cell characteristics along with an associated diagnosis.
 The data are described more fully at http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29.
 The CSV file for the data set is available at http://archive.ics.uci.edu/ml/machinelearningdatabases/breastcancerwisconsin/breastcancerwisconsin.data.
 An explanation of the underlying meaning of the data is available at
Here are the first few lines of the data file.
1036172,2,1,1,1,2,1,2,1,1,2
1041801,5,3,3,3,2,3,4,4,1,4
1043999,1,1,1,1,2,3,3,1,1,2
1044572,8,7,5,10,7,9,5,5,4,4
1047630,7,4,6,4,6,1,4,3,1,4
1048672,4,1,1,1,2,1,2,1,1,2
1049815,4,1,1,1,2,1,3,1,1,2
1050670,10,7,7,6,4,10,4,1,2,4
1050718,6,1,1,1,2,1,3,1,1,2
1054590,7,3,2,10,5,10,5,4,4,4
1054593,10,5,5,3,6,7,7,10,1,4
The first column represents the identifier of the sample. The last column represents the diagnosis (2 for benign, 4 for malignant). The remaining columns represent the different attributes.
We’ve put all of the data in the file
/home/rebelsky/Desktop/breastcancerdata.csv
.
a. Arrange to read all of the data into a variable called
rawcelldata
.
b. As you know from experience, large data sets sometimes have missing
columns. Write an instruction or instructions to filter out any data
that do not fit the form (e.g., that have fewer than 11 values in a
row, that have middle values that are not integers in the range 110,
that have a final value that is not 2 or 4). Call that result
cleancelldata
.
c. It turns out that the standard learning algorithms do better if you have one extra column that is identical for all data. Add a column which has a 1 for every row.
d. You may find it useful to change the last column to 0 and 1, rather than 2 and 4. (That’s up to you.)
e. All of our examples used data in the range [0..1]. These data are clearly not in that range. You may find it useful to write a procedure to convert from the current to the [0..1] range.
f. The first 400 rows will be our training data. Arrange to put all
of those rows into a new vector of vectors called trainingsamples
.
Arrange to put the remaining rows into a vector of vectors called
testsamples
.
g. Using the refine
procedure you wrote earlier, develop a set of
weights that seem to successful predict whether the cell is benign or
malignant.
h. Determine how well your procedure does on the test samples. Indicate how the percent of false positives and false negatives.
i. Write an Englishlanguage interpretation of the set of weights.
Notes
There are a few subtleties to the algorithm.
You may notice that if all of the inputs are 0, they won’t change. In that case, I would suggest that you change each weight equally. (You can tell that you have all zero inputs if their sum is zero.)
In computing the sum, you should take the sum of the absolute values of the inputs. However, in computing the change, you should use the actual value of the input. (I don’t think we have any negative inputs in our model, but the general case does have negative inputs.)
Evaluation
We will primarily evaluate your work on correctness (does your code compute what it’s supposed to and are your procedure descriptions accurate); clarity (is it easy to tell what your code does and how it achieves its results; is your writing clear and free of jargon); and concision (have you kept your work short and clean, rather than long and rambly).