Assignment 5 - Distant Reading
- Summary
- For this assignment, you will put your data science skills to work by exploring ways to gather more complex information about text files.
- Collaboration
- You must work with your assigned partner(s) on this assignment. You may discuss this assignment with anyone, provided you credit such discussions when you submit the assignment.
- Submitting
- Email your answers to csc151-03-grader@grinnell.edu. The subject of your email should be [CSC151 03] Assignment 5 - Distant Reading and should contain your answers to all parts of the assignment. Scheme code should be in the body of the message, not in an attachment.
- Warning
- So that this assignment is a learning experience for everyone, we may spend class time publicly critiquing your work.
- Preparation
- In preparation for this assignment, pick a dozen or so moderate-length texts (at least fifty and no more than a few hundred pages) from Project Gutenberg. At least three but no more than six of the texts should be by the same author.
Background
A number of scholars in the humanities have begun exploring computer-based approaches to uncovering ideas or themes that some term “distant reading” (to contrast it with the “close reading” that is so core to the understanding of works). In this assignment, we will explore a variety of simple techniques for exploring texts with computers. (Traditional distant reading uses much more sophisticated techniques.)
Problem 1: Word lengths
Topics: files, strings, displaying data with histograms
In this problem, you will analyze texts based on the lengths of the words in the text.
Document and write a procedure, (explore-lengths fname), that takes
the name of a text file as input and produces a graph of the frequencies
of the word lengths in the file. You will likely want to follow a
series of steps similar to the following.
a. Read all of the words from the file.
b. Convert the list of words to a list of lengths.
c. Tally the lengths using tally-all.
d. Scale the tallies by dividing each by the total number of words. (That gives us a frequency between 0 and 1.)
d. Put the tallies in order using sort.
e. Display the data using plot with discrete-histogram.
Using this procedure, create histograms for the books you chose. Write a short note about any similarities or differences you see.
In turning in the assignment, do not submit the histograms themselves.
Rather, submit the documentation and code for explore-lengths and
the instructions for building the histograms.
Problem 2: Word lengths, revisited
Topics: files, strings, displaying data with dot plots
Another characteristic of books we might explore is the relative proportion of long words to short words.
Document and write a procedure, (compare-word-lengths files), that takes
a list of file names as input and produces a scatterplot (a points plot)
with one point for each file in which the x coordinate of the point is
the percentage of words of seven characters or more in the file and the
y coordinate is the percentage of words of four characters or fewer.
Problem 3: Subsequent words
Topics: files, strings, lists
One of the more substantive things about books that computers can help us explore is words that the author tends to use together.
Document and write a procedure, (subsequent-words
filename word) that, given the name of a file and a word, makes a
list of all the words that follow one, two, or three words after
word. For example, suppose the file "story.txt" contains the
following.
the cat chased the dog around the cat bowls and the dog dish
We should get output like the following.
> (subsequent-words "/home/username/Desktop/story.txt" "dog")
'("around" "the" "cat" "dish")
> (subsequent-words "/home/username/Desktop/story.txt" "cat")
'("chased" "the" "dog" "bowls" "and" "the")
> (subsequent-words "/home/username/Desktop/story.txt" "the")
'("cat" "chased" "the" "dog" "around" "the" "cat" "bowls" "and" "dog" "dish")
Hint: You may find it easiest to start by building a list of four-tuples like the following.
'(("the" "cat" "chased" "the")
("cat" "chased" "the" "dog")
("chased" "the" "dog" "around")
("the" "dog" "around" "the")
...)
There are many approaches to building those lists. One fairly
straightforward one is to make four lists, each of which is “off by one”
from the previous one, and then to join the elements together with map.
E.g., we’d start with the lists
'("the" "cat" chased" "the" "dog" "around" "the" "cat" "bowls " ...)'("cat" chased" "the" "dog" "around" "the" "cat" "bowls" ... "")'("chased" "the" "dog" "around" "the" "cat" "bowls" ... "" "")'("the" "dog" "around" "the" "cat" "bowls" ... "" "" "")
Problem 4: Common connections
Topics: files, strings, tallying, sorting
Document and write a procedure (common-connections filename word),
that takes as input a file name and a word and produces a list of the
five words that most commonly follow close after the word (one, two,
or three words away) and the number of times they appear nearby.
> (common-connections "/home/username/Desktop/something-weird.txt" "jabberwock")
'(("alice" 191)
("borogoves" 83)
("vorpal" 23)
("sword" 18)
("wabe" 11))
Next, pick three words that you expect to appear in six of your books and find the most common connections to those words in each of those six books.
Problem 5: Categorizing words
Topics: files, strings, conditionals, tallying
We’ve seen a number of ways to categorize words. They may be short or long. They may start with or contain certain letters. They may contain repeated letters. They may be near other words. They may be common. They may be uncommon.
First, Pick and describe between six and ten categories. Then, document and
write a procedure, (categorize-word word), that gives the category
for a word as a string. You should use “uncategorized” for words that
do not fit into your categories. For example,
> (categorize-word "aardvark")
"starts with vowel"
> (categorize-word "jabberwocky")
"Carrollian"
> (categorize-word "defenestrate")
"uncommon"
> (categorize-word "madam")
"palindrome"
> (categorize-word "Grinnell")
"proper-name"
> (categorize-word "elephant")
"starts-with-vowel"
> (categorize-word "me")
"short"
> (categorize-word "hello")
"uncategorized"
If a word falls into multiple categories, you will pick only one.
Next, document and write a procedure, categorize-words-in-file, that
takes a file name as input and creates a histogram of the categories
in alphabetical order. For each category, you should indicate the
percentage of words that fall in that category.
Finally, categorize six of the books you chose and see whether the categorization tells you anything about the book.
Note: You will almost certainly use a conditional to write
categorize-word.
Evaluation
We will primarily evaluate your work on correctness (does your code compute what it’s supposed to and are your procedure descriptions accurate); clarity (is it easy to tell what your code does and how it achieves its results; is your writing clear and free of jargon); and concision (have you kept your work short and clean, rather than long and rambly).