Skip to main content

Assignment 7: Perceptrons and Learning

This assignment is still being developed. Stay tuned!

Warning
The data set in this assignment pertains to breast cancer. Breast cancer has impacted the lives of many people. If you are uncomfortable working with the data set or discussing the related issues, please speak with your instructor about using another data set.
Summary
For this assignment, you will explore some basic techniques of “learning” and build one or more simple classifiers.
Collaboration
You must work with your assigned partner(s) on this assignment. You may discuss this assignment with anyone, provided you credit such discussions when you submit the assignment.
Submitting
Email your answers to csc151-03-grader@grinnell.edu. The subject of your email should be [CSC151 03] Assignment 7: Perceptrons and Learning and should contain your answers to all parts of the assignment. Scheme code should be in the body of the message, not in an attachment.
Warning
So that this assignment is a learning experience for everyone, we may spend class time publicly critiquing your work.

Background

One of the important contributions of computer science to the broader field of data science is through so-called “machine learning” techniques. Broadly speaking, these techniques involve taking a process for classifying pieces of data and repeatedly refine the process by running it on known inputs/outputs and refining the classification scheme based on the relationship of the inputs to the outputs. The set of known inputs/outputs is called the “training set”.

In this assignment, we will consider a relatively straightforward classification approach in which we identify a variety of numeric characteristics of our input, multiply each characteristic by a computed “weight”, add the results, and then compare the result to some threshold.

For example, suppose we were trying to predict whether first-year students were going to declare a computer science major. We might have information on (a) whether or not they listed “computer science” on their list of prospective fields on their Common Application; (b) how many prospective fields they listed on their Common Appplication; (c) their SAT or ACT Math Scores; (d) the semester in which they first took computer science at Grinnell; (e) their grade in that class; and (f) their grade in their first Mathematics course. Which of these is most predictive? We don’t know at first, so we make guesses. That gives us a heuristic for predicting majors. We then compare our prediction to the actual state for each element in our training set (e.g., students in the class of 2018). When our predictions fail to match, we look for the components that most likely contributed to that failure and update them. For example, we might have thought that the listing of CS on the Common App is predictive, but we might discover that it fails to distinguish students well.

Preparation

For this assignment, we will be working with a set of data on cell characteristics along with an associated diagnosis.

http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names

Here are the first few lines of the data file.

1036172,2,1,1,1,2,1,2,1,1,2
1041801,5,3,3,3,2,3,4,4,1,4
1043999,1,1,1,1,2,3,3,1,1,2
1044572,8,7,5,10,7,9,5,5,4,4
1047630,7,4,6,4,6,1,4,3,1,4
1048672,4,1,1,1,2,1,2,1,1,2
1049815,4,1,1,1,2,1,3,1,1,2
1050670,10,7,7,6,4,10,4,1,2,4
1050718,6,1,1,1,2,1,3,1,1,2
1054590,7,3,2,10,5,10,5,4,4,4
1054593,10,5,5,3,6,7,7,10,1,4

The first column represents the identifier of the sample. The last column represents the diagnosis (2 for benign, 4 for malignant). The remaining columns represent the different attributes.

We’ve put all of the data in the file /home/rebelsky/Desktop/breast-cancer-data.csv.

a. Arrange to read all of the data into a variable called raw-cell-data.

b. As you know from experience, large data sets sometimes have missing columns. Write an instruction or instructions to filter out any data that do not fit the form (e.g., that have fewer than 11 values in a row, that have middle values that are not integers in the range 1-10, that have a final value that is not 2 or 4). Call that result clean-cell-data.

c. It turns out that the standard learning algorithms do better if you have one extra column that is identical for all data. Add that column.

d. The first 400 rows will be our training data. Arrange to put the benign elements in a list of vectors called benign-diagnosis and the malignant ones into a list of vectors called malignant-diagnosis.

e. Create a vector of weights.

f. Write a procedure, (predict row weights) that multiplies each element of row by the corresponding

A naive approach

Evaluation

We will primarily evaluate your work on correctness (does your code compute what it’s supposed to and are your procedure descriptions accurate); clarity (is it easy to tell what your code does and how it achieves its results; is your writing clear and free of jargon); and concision (have you kept your work short and clean, rather than long and rambly).