Assignment 4: Text analysis

Assigned: Wednesday, 6 February 2019
Due: Tuesday, 12 February 2019 by 10:30pm
Summary: In this assignment, you will develop a variety of tools that might help you or someone else analyze texts.
Collaboration: You must work with your assigned partner(s) on this assignment. You may discuss this assignment with anyone, provided you credit such discussions when you submit the assignment.
Submitting: Email your answer to csc151-01-grader@grinnell.edu. The subject of your email should be [CSC151-01] Assignment 4 (Your Names) and should contain your answers to all parts of the assignment. Scheme code should be in the body of the message, not in an attachment.

Problem 1: From strings to sentences

Topics: Regular expressions, String processing

In the lab on regular expressions, you wrote a variety of procedures that took a string as input and created a list of words in the string.

Write a procedure (string->sentences str), that takes a string as input and produces a list of all the sentences that appear in the string.

> (string->sentences "Hi!  My name is DocR.  Did you get that?  It's short for Doctor Racket.")
'("Hi" "My name is DocR" "Did you get that"  "It's short for Doctor Racket")

For extra credit, include the punctuation in the sentences.

> (string->sentences "Hi!  My name is DocR.  Did you get that?  It's short for Doctor Racket.")
'("Hi!" "My name is DocR." "Did you get that?"  "It's short for Doctor Racket.")

Problem 2: Extracting text from HTML

Topics: HTML, Regular expressions, File basics

In the lab on regular expressions, you started to explore how you might take a page created in HTML and extract just the text on that page. In case you’ve forgotten, we can start by using regexp-replace* and a pattern like #px"<[^>]*> to remove all of the tags.

But that’s not enough. We also want to remove extra space at the start of each line, collapse multiple blank lines into a single blank line, and collapse multiple spaces into a single space.

a. Write a procedure, (extract-text str), that takes a string (potentially with HTML tags) as input and returns the same string with the policies above applied. That is, remove HTML tags, remove spaces at the beginning of the line, collapse multiple spaces into a single space, and collapse multiple blank lines into a single line.

b. Write a procedure, (html->text htmlfile textfile), that takes two strings that name files as parameters, reads the contents of the first file, extracts the text, and then saves the result in the second.

c. One deficiency of the extract-text procedure we’ve written, at least as it applies to some files, is that it does not deal with lines that are not very long. For example, it may be that after stripping tags and whitespace we end up with something like the following.

For example,
it may be that after
stripping tags and
whitespace
we end up with something like the following.

In 
many cases, the 
best strategy is to combine all 
of the text up to each
blank line into a single line.

In many cases, the best strategy is to combine all of the text up to each blank line into a single line. (Your Web browser may wrap the following, but they are each one line.)

For example, it may be that after stripping tags and whitespace we end up with something like the following. 

In many cases, the best strategy is to combine all of the text up to each blank line into a single line.

Write a procedure, (merge-lines source target), that takes two filenames as parameters, applies that process to the contents of the first file, and writes the result to the second file.

Note: You may find it helpful to recall that when you use (regexp-replace* regexp string replacement), if you put \\1 in the replacement, it contains the first parenthesized expression in regexp. Similarly, \\2 contains the second parenthesized expression, and so on and so forth.

Problem 3: Simplified sentiment analysis

Topics: Regular expressions, Text files

The technique of sentiment analysis examines texts and extracts information about the sentiments expressed within that text. Are the statements positive? Negative? Nuanced? Do we see joy expressed? Sorrow?

While there are a variety of complex processes used in sentiment analysis, it is possible to do a simplified version of sentiment analysis by looking at the percentage of times that words that represent a certain sentiment appear in a text.

a. Identify at least a half-dozen words or phrases whose presence you expect signal joy. “Joy” is an obvious one. “Enthusiasm” is perhaps another (or, more generally, words that begin with “enthus”). Then write a procedure, (analyze-joy str), that takes a string as input and determines what percentage of the words in str signal joy.

b. Do the same thing with words or phrases you expect to signal anger. For example, “shout” or “throw”.

c. Pick one other sentiment and develop a procedure to analyze that sentiment.

d. Using your three procedures, analyze the sentiment of three books you download from Project Gutenberg. Include a commented-out copy of your interactions pane in your code file. You start a multi-line comment with #| and end it with |#

Problem 4: Exploring verb usage

Topics: Regular expressions, Text files

As you may recall from the introduction to this course, digital humanists often use their programming tools interactively. That is, they identify possible issues of interest, use programming tools to tease out some details, then return to the text to do more analysis.

Let’s try a simple version of that.

Identify a text of interest from Project Gutenberg.
Identify three protagonists in the text. (If the text does not have at least three protagonists, pick one that does.)
Write a regular expression to help you identify some of the verbs associated with those protagonists. Most likely, you will just extract two-word phrases whose first word is the protagonist’s name and then skim through the output.
Choose four of those verbs.
Identify how often each protagonist is associated with the verb. For example, if your protagonists are Andy, Bo, and Charlie and your verb is “eats”, you would count the number of times “Andy eats”, “Bo eats”, and “Charlie eats” appear in the text.
Write a paragraph explaining what you’ve discovered. For example, “Food plays a large role in …. However, while Bo is regularly identified as eating something (a sandwich, a hot dog), Andy and Charlie are more regularly described in terms of what they drink. I expect that this represents ….”

Problem 5: Text visualization

Topics: Regular expressions, String processing, Images

You’ve now explored two very different kinds of values in Racket: images and text files. (Well, you’ve explored much more than those two, but those are a good starting point.) Let’s combine those two ideas.

Write a procedure, (visualize filename), that does at least three computations based on the named file (e.g., percentage of joyful words, number of sentences) and then provides an “interesting” visual representation of the results of those computations. You might use larger shapes to represent higher frequencies or counts or more transparent colors to represent smaller frequencies or values.

Use your procedure on at least three different files downloaded from Project Gutenberg.

Note: If you have vision limitations that make this assignment difficult, you may instead choose to write a procedure that provides a textual description of the files.

Documentation

For this assignment, you should document your procedures using the 4P documentation style. However, if you’d like to try to add the additional 2P’s (preconditions and postconditions), you are certainly welcome to do so.

Evaluation

We will primarily evaluate your work on correctness (does your code compute what it’s supposed to and are your procedure descriptions accurate); clarity (is it easy to tell what your code does and how it acheives its results; is your writing clear and free of jargon); and concision (have you kept your work short and clean, rather than long and rambly). In a few cases, we will also consider the creativity of your result.

Copyright © Charlie Curtsinger, Sarah Dahlby Albright, Janet Davis, Fahmida Hamid, Titus Klinge, Samuel A. Rebelsky, and Jerod Weinman. Selected materials are copyright by John David Stone or Henry Walker and are used with permission.

Unless specified otherwise elsewhere on this page, this work is licensed under a Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/3.0/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.

This website was built using Jekyll, Twitter Bootstrap, and the Bootswatch Cosmo Theme.