Mini-Project 3: Textual analysis

Assigned: Friday, 13 November 2020
Summary: In this assignment, you will develop some basic tools for analyzing texts.
Collaboration: Each student should submit their own responses to this assignment. You may consult other students in the class as you prepare this assignment. If you receive help from anyone, make sure to cite them in your responses. You do not need to cite course pages or classroom comments (but it doesn’t hurt).

One of the many ways in which humanists now employ computers is in analyzing texts. That is, they use programs to identify characteristics of texts as a starting point for deeper reflections on those texts. These initial analyses can be complex—such as identifying sets of words that regularly occur together, which raises issues not only of efficiency but how you segment sets—while others can be more straightforward, such as looking at the frequency of adverbs or adjectives in a text.

Note: Please submit this file as mp3.rkt. (No jokes, please.)

Part the first: Word counts

As you saw in a recent lab, the tools you already know permit you to count appearances of words in a longer text. You wrote code to look for a particular set of characters in the books (no, not the componets of strings). However, as a computer scientist you know that you should write general code.

Document and write a procedure, (count-words list-of-words filename) that takes as input a list of words and produces as output a list of lists, each with a word and its frequency. Note that you should do case-insensitive counting; “Sam”, “sam”, “SAM”, and even “saM” should match “Sam”.

> (count-words (list "Sam" "Amazing" "Evil" "Funny") "sams-course-reviews.txt")
'(("Sam" 132) ("Amazing" 2) ("Evil" 666) ("Funny" 0))

We would recommend that you create some simple sample files so that you can run tests on them.

Part the second: Readability

There are a number of algorithms for computing the readability of prose. One that we will implement now is the Dale-Chall readability formula:

https://en.wikipedia.org/wiki/Dale%E2%80%93Chall_readability_formula

The formula given in 1948 as:

score = 0.1579 × (PDW × 100) + 0.0496 × ASL

PDW is the “percentage of difficult words”, the number of words in the text not found in Dale-Chall’s list of about 3000 familiar words:

http://countwordsworth.com/download/DaleChallEasyWordList.txt

Furthermore, if the percentage of difficult words is above 5%, then add 3.6365 to the score. Otherwise, keep the score as it is.

a. Document and write a procedure, compute-dale-chall-score, that takes three parameters:

num-difficult-words, the number of difficult words in the text
total-words, the total number of words in the text
num-sentences, the number of sentences in the text

And computes the Dale-Chall score from these statistics. Here is an example execution of the procedure:

> (compute-dale-chall-score 35 200 40)
6.64775
> (compute-dale-chall-score 25 100 18)
7.859555555555556

Experiment with your procedure using a variety of inputs to gain confidence that your procedure is implemented correctly.

This score can be then translated into a grade level for which the text is appropriate. The score-to-grade table from Wikipedia is given below:

9 or lower: "4th grade or lower"
0–5.9:      "5th-6th grade"
0–6.9:      "7th-8th grade"
0–7.9:      "9th-10th grade"
0–8.9:      "11th-12th grade"
0 or above: "13th-15th grade"

b. Document and write a procedure, (score->grade score), that takes a Dale-Chall score and returns the grade level corresponding to the range that score falls under. Your function should return the grade level as the strings given in the above table.

c. Come up with a set of short texts that will be useful for exploring Dale-Chall scores. If you email them to me, I’ll post them to the end of this document for others to use. Feel free to use other sources (with citation) such as novels, articles, etc. or try to write sensible sentences that target these grade levels.

d. Write a procedure (dale-chall-score str) that takes a string that represents a text, computes the various aspects of the string (number of words, number of difficult words, number of sentences), applies the computations above, and gives the Dale-Chall score for the word. You can assume that every period represents a sentence break and that only periods represent sentence breaks. You can assume that every space represents a word break and that only spaces represent word breaks.

Note: When reading the list of easy words, make sure to use file->lines rather than file->words.

Part the third: Improving computation speed

While the computer can search a list of about 3000 words fairly quickly, when it’s doing this searching again and again for each word in longer texts, you will start to see a bit of a lag.

By replicating a text multiple times with (apply string-append (make-list NUM TEXT)), you should be able to find a text long enough that the computer takes about ten seconds to compute the Dale-Chall score. For example,

> (define sentence
    "Twas brillig and the slithy toves did gyre and gimble in the wabe. ")
> (dale-chall-score (apply string-append (make-list 10000 sentence))
; Wait wait wait
11.2 ; An invented number

You can find out precisely how long it took with the time operation.

> (time (dale-chall-score (apply string-append (make-list 10000 sentence))))
cpu time: 320 real time: 364 gc time: 230
11.2 ; Still an invented number

“cpu time” represents how much time the computer’s central processing unit spent on the problem. “real time” represents how much time you would see pass on a wall clock. “gc time” is a strange version of time used at Grinnell College, which we will ignore. (In reality, it’s garbage collection time, but we’ll still ignore it.)

We should be able to improve the procedure by breaking the longer list apart into smaller lists that depend only on the first letter.

(define easy-word?
  (let ([easy-a-words '("a" "able" "aboard" ...)]
        [easy-b-words '("baa" "babe" "babies" ...)]
        ...)
    (lambda (word)
      (let ([first (char-downcase (string-ref word 0))])
        (cond
          [(char=? first #\a)
           ; code to check if word is in easy-a-words
           ]
          [(char=? first #\b)
           ; code to check if word is in easy-b-words
           ]
          ...
          [else
           #f])))))

Finish writing easy-word?. Note that you should not manually write the various sets of words. Rather, you should use a helper to extract them automatically from the Dale-Chall list of easy words.

(define easy-word?
  (let ([easy-a-words (words-starting-with #\a easy-words)]
        [easy-b-words (words-starting-with #\b easy-words)]
        ...
        [easy-z-words (words-starting-with #\z easy-words)])
    (lambda (word)
      (let ([first (char-downcase (string-ref word 0))])
        (cond
          [(char=? first #\a)
           (word-in-list? word easy-a-words)]
          [(char=? first #\b)
           (word-in-list? word easy-b-words)]
          ...
          [(char=? first #\z)
           (word-in-list? word easy-z-words)]
          [else
           #f])))))

a. Write a dale-chall-score2 procedure that uses the new easy-word?.

You do not need to call your helpers words-starting-with and word-in-list?, but you may find it helpful to do so.

If you don’t want to type out all of easy-word?, there’s a full version at the end of these instructions. You can also consider writing equally efficient, but less repetitive code.

b. Determine experimentally how much, if any, improvement this change makes. Enter the results of your experiments along with a bit of analysis as a comment in your Racket file.

Part the fourth: Sentiment analysis

Another common approach to text analysis is called “sentiment analysis”. Broadly, sentiment analysis is intended to determine the writer’s overall sentiment in a piece of writing. Are they happy? Sad? Angry? Enthusiastic? I believe Amazon uses sentiment analysis in selecting reviews to prioritize and for other things, too.

The most straightforward form of sentiment analysis involves looking at word frequencies. A document with more positive than negative words is likely to be interepreted as positivve. A document with more negative than positive words is likely to be interpreted as negative.

Conveniently, there are two long lists of positive and negative words available to you.

http://ptrckprry.com/course/ssd/data/positive-words.txt

http://ptrckprry.com/course/ssd/data/negative-words.txt

Document and write a procedure, (posneg str), that takes a string as an input and returns “positive” if the number of positive words is greater than the number of negative words, “negative” if the number of negative words is greater than the positive words, and “neutral” if the two are the same.

Part the fifth: Freestyle

Document and write a procedure that does some otiher kind of textual analysis that you expect would be fun or useful.

Code for part the third

Here’s the ugly code for the third part. I think I got it right.

(define easy-word?
  (let ([easy-a-words (words-starting-with #\a easy-words)]
        [easy-b-words (words-starting-with #\b easy-words)]
        [easy-c-words (words-starting-with #\c easy-words)]
        [easy-d-words (words-starting-with #\d easy-words)]
        [easy-e-words (words-starting-with #\e easy-words)]
        [easy-f-words (words-starting-with #\f easy-words)]
        [easy-g-words (words-starting-with #\g easy-words)]
        [easy-h-words (words-starting-with #\h easy-words)]
        [easy-i-words (words-starting-with #\i easy-words)]
        [easy-j-words (words-starting-with #\j easy-words)]
        [easy-k-words (words-starting-with #\k easy-words)]
        [easy-l-words (words-starting-with #\l easy-words)]
        [easy-m-words (words-starting-with #\m easy-words)]
        [easy-n-words (words-starting-with #\n easy-words)]
        [easy-o-words (words-starting-with #\o easy-words)]
        [easy-p-words (words-starting-with #\p easy-words)]
        [easy-q-words (words-starting-with #\q easy-words)]
        [easy-r-words (words-starting-with #\r easy-words)]
        [easy-s-words (words-starting-with #\s easy-words)]
        [easy-t-words (words-starting-with #\t easy-words)]
        [easy-u-words (words-starting-with #\u easy-words)]
        [easy-v-words (words-starting-with #\v easy-words)]
        [easy-w-words (words-starting-with #\w easy-words)]
        [easy-x-words (words-starting-with #\x easy-words)]
        [easy-y-words (words-starting-with #\y easy-words)]
        [easy-z-words (words-starting-with #\z easy-words)])
    (lambda (word)
      (let ([first (char-downcase (string-ref word 0))])
        (cond
          [(char=? first #\a)
           (word-in-list? word easy-a-words)]
          [(char=? first #\b)
           (word-in-list? word easy-b-words)]
          [(char=? first #\c)
           (word-in-list? word easy-c-words)]
          [(char=? first #\d)
           (word-in-list? word easy-d-words)]
          [(char=? first #\e)
           (word-in-list? word easy-e-words)]
          [(char=? first #\f)
           (word-in-list? word easy-f-words)]
          [(char=? first #\g)
           (word-in-list? word easy-g-words)]
          [(char=? first #\h)
           (word-in-list? word easy-h-words)]
          [(char=? first #\i)
           (word-in-list? word easy-i-words)]
          [(char=? first #\j)
           (word-in-list? word easy-j-words)]
          [(char=? first #\k)
           (word-in-list? word easy-k-words)]
          [(char=? first #\l)
           (word-in-list? word easy-l-words)]
          [(char=? first #\m)
           (word-in-list? word easy-m-words)]
          [(char=? first #\n)
           (word-in-list? word easy-n-words)]
          [(char=? first #\o)
           (word-in-list? word easy-o-words)]
          [(char=? first #\p)
           (word-in-list? word easy-p-words)]
          [(char=? first #\q)
           (word-in-list? word easy-q-words)]
          [(char=? first #\r)
           (word-in-list? word easy-r-words)]
          [(char=? first #\s)
           (word-in-list? word easy-s-words)]
          [(char=? first #\t)
           (word-in-list? word easy-t-words)]
          [(char=? first #\u)
           (word-in-list? word easy-u-words)]
          [(char=? first #\v)
           (word-in-list? word easy-v-words)]
          [(char=? first #\w)
           (word-in-list? word easy-w-words)]
          [(char=? first #\x)
           (word-in-list? word easy-x-words)]
          [(char=? first #\y)
           (word-in-list? word easy-y-words)]
          [(char=? first #\z)
           (word-in-list? word easy-z-words)]
          [else
           #f])))))

Yes, this is hideous. But it’s more efficient than the original version (or it should be). I would, of course, prover that you found a better way to do this. Nontheless, it’s a starting point. Perhaps the utter awfulness of it will incentivize you to find something equally fast but ore elegant.

Sample definitions for Dale-Chall

; Invented by SamR.  Should be fairly readable.
(define simple-01
  "The lamb says baa.")

; Stolen from Sandra Boynton's _Moo, Baa, La La La_.  Transcribed by SamR (almost from memory).
(define boynton-01
  "A cow says moo.  A sheep says baa.  Three singing pigs say la la la.  No no, you say, that isn't right.  The pigs say oink all day and night.  Rhinoceroses snort and snuff and little dogs go ruff ruff ruff.  Some other dogs go bow wow.  And cats and kittens say meow.  Quack says the duck.  A horse says neigh.  It's quiet now.  What do you say?")

Copyright © Charlie Curtsinger, Sarah Dahlby Albright, Janet Davis, Fahmida Hamid, Titus Klinge, Peter-Michael Osera, Samuel A. Rebelsky, and Jerod Weinman. Selected materials are copyright by John David Stone or Henry Walker and are used with permission.

Unless specified otherwise elsewhere on this page, this work is licensed under a Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/3.0/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.

This website was built using Jekyll, Twitter Bootstrap, and the Bootswatch Cosmo Theme.