Mini-Project 3: Textual analysis

Assigned
Wednesday, 22 September 2021
Summary
In this assignment, you will develop some basic tools for analyzing texts.
Collaboration
Each student should submit their own responses to this assignment. You may consult other students in the class as you prepare this assignment. If you receive help from anyone, make sure to cite them in your responses. You do not need to cite course pages or classroom comments (but it doesn’t hurt).

One of the many ways in which humanists now employ computers is in analyzing texts. That is, they use programs to identify characteristics of texts as a starting point for deeper reflections on those texts. These initial analyses can be complex—such as identifying sets of words that regularly occur together, which raises issues not only of efficiency but how you segment sets—while others can be more straightforward, such as looking at the frequency of adverbs or adjectives in a text.

Note: Please submit this file as language.rkt.

In addition, please make sure to submit any text files you rely on with this file, and make sure that you read from files using only the direct file name, not with a full path.

Part the first: Word counts

As you saw in a recent lab, the tools you already know permit you to count appearances of words in a longer text. You wrote code to look for a particular set of characters in the books (no, not the componets of strings). However, as a computer scientist you know that you should write general code.

Document and write a procedure, (count-words list-of-words filename) that takes as input a list of words and produces as output a list of lists, each with a word and its frequency. Note that you should do case-insensitive counting; “Sam”, “sam”, “SAM”, and even “saM” should match “Sam”.

> (count-words (list "Sam" "Amazing" "Evil" "Funny") "sams-course-reviews.txt")
'(("Sam" 132) ("Amazing" 2) ("Evil" 666) ("Funny" 0))

We would recommend that you create some simple sample files so that you can run tests or experiments on them.

Part the second: Readability

There are a number of algorithms for computing the readability of prose. One that we will implement now is the Dale-Chall readability formula:

https://en.wikipedia.org/wiki/Dale%E2%80%93Chall_readability_formula

The formula given in 1948 as:

score = 0.1579 × (PDW × 100) + 0.0496 × ASL

ASL is the “average sentence length”, the total number of words in the text divided by the number of sentences in the text.

PDW is the “percentage of difficult words”, the number of words in the text not found in Dale-Chall’s list of about 3000 familiar words:

http://countwordsworth.com/download/DaleChallEasyWordList.txt

Furthermore, if the percentage of difficult words is above 5%, then add 3.6365 to the score. Otherwise, keep the score as it is.

a. Document and write a procedure, compute-dale-chall-score, that takes three parameters:

  • num-difficult-words, the number of difficult words in the text
  • total-words, the total number of words in the text
  • num-sentences, the number of sentences in the text

And computes the Dale-Chall score from these statistics. Here is an example execution of the procedure:

> (compute-dale-chall-score 35 200 40)
6.64775
> (compute-dale-chall-score 25 100 18)
7.859555555555556

Experiment with your procedure using a variety of inputs to gain confidence that your procedure is implemented correctly.

This score can be then translated into a grade level for which the text is appropriate. The score-to-grade table from Wikipedia is given below:

4.9 or lower: "4th grade or lower"
5.0–5.9:      "5th-6th grade"
6.0–6.9:      "7th-8th grade"
7.0–7.9:      "9th-10th grade"
8.0–8.9:      "11th-12th grade"
9.0 or above: "13th-15th grade"

You should, of course, deal with the numbers that the table makes unclear. As is normal in our class, please use ranges like 5.0 (inclusive) to 6.0 (exclusive).

less than 5.0:      "4th grade or lower"
5.0 <= score < 6.0: "5th-6th grade"
6.0 <= score < 7.0: "7th-8th grade"
7.0 <= score < 8.0: "9th-10th grade"
8.0 <= score < 9.0: "11th-12th grade"
at least 9.0:       "13th-15th grade"

b. Document and write a procedure, (score->grade score), that takes a Dale-Chall score and returns the grade level corresponding to the range that score falls under. Your function should return the grade level as the strings given in the above table.

c. Come up with a set of short texts that will be useful for exploring Dale-Chall scores. If you email them to me, I’ll post them to the end of this document for others to use. Feel free to use other sources (with citation) such as novels, articles, etc. or try to write sensible sentences that target these grade levels.

d. Write a procedure (dale-chall-score str) that takes a string that represents a text, computes the various aspects of the string (number of words, number of difficult words, number of sentences), applies the computations above, and gives the Dale-Chall score for the word. You can assume that every period represents a sentence break and that only periods represent sentence breaks. You can assume that every space represents a word break and that only spaces represent word breaks.

Note: When reading the list of easy words, please use file->string and the following procedure. (Why not use file->words? Because some of the words in the file have apostrophes, and file->words does not include apostrophes. Why not file->lines? Because some of the lines end with spaces. This approach seems safer.)

;;; (extract-words str)
;;;   str : string?
;;; Extract all of the words from str.  A word is a sequence of
;;; letters and apostrophes.
(define extract-words
  (let ([rex-word
         (rex-repeat (rex-any-of (rex-char-range #\a #\z)
                                 (rex-char-range #\A #\Z)
                                 (rex-string "'")))])
    (lambda (str)
      (rex-find-matches rex-word str))))

(test-equal? "empty string"
             (extract-words "")
             '())
(test-equal? "a few words separated by spaces"
             (extract-words "this and that and the other thing")
             '("this" "and" "that" "and" "the" "other" "thing"))
(test-equal? "a few words separated by spaces"
             (extract-words "this and that and the other thing")
             '("this" "and" "that" "and" "the" "other" "thing"))
(test-equal? "a few words with strange punctuation"
             (extract-words "this; and!that    and the?, other thing")
             '("this" "and" "that" "and" "the" "other" "thing"))

Part the third: Sentiment analysis

Another common approach to text analysis is called “sentiment analysis”. Broadly, sentiment analysis is intended to determine the writer’s overall sentiment in a piece of writing. Are they happy? Sad? Angry? Enthusiastic? Rumor has it that Amazon uses sentiment analysis in selecting reviews to prioritize and for other things, too.

The most straightforward form of sentiment analysis involves looking at word frequencies. A document with more positive than negative words is likely to be interepreted as positivve. A document with more negative than positive words is likely to be interpreted as negative.

Conveniently, there are two long lists of positive and negative words available to you.

http://ptrckprry.com/course/ssd/data/positive-words.txt

http://ptrckprry.com/course/ssd/data/negative-words.txt

Please read the notes at the top of those files before you use them.

Document and write a procedure, (posneg str), that takes a string as an input and returns “positive” if the percent of positive words is at least 5% higher than the percent of negative words, “negative” if the percent of negative words is at least 5% higher than the percent of positive words, and “neutral” the rest of the time.

Part the fourth: Basic gender analysis

a. Write a procedure, (count-male-pronouns str), that takes a string as input and counts the number of male pronouns (“he”, “him”, “his”) in the string. Be careful not to count words like “history”.

;;; (count-male-pronouns str) -> integer?
;;;   str : string
;;; Count how many times that male pronouns appear as
;;; individual words in str.

b. Write a procedure, (count-female-pronouns str), that takes a string as input and counts the number of female pronouns (she, her, hers) in the string. Once again, you should be careful not to count words like “heroic”.

;;; (count-female-pronouns str) -> integer?
;;;   str : string
;;; Count how many times that female pronouns appear as
;;; individual words in str.

c. Using your two procedures, determine whether the book you chose is more likely to use male pronouns or female pronouns. Insert the results of your experiments as a comment in your code file. If you deem it appropriate, you might also explore the use of non-gendered pronouns.

Part the fifth: Freestyle

Document and write two non-trivial procedures that does some other kind of textual analysis that you expect would be fun or useful.

Both procedures must make use of regular expressions in some way.

Please do not just write small variants of the procedures above.

Partial rubric

In grading these assignment, we will look for the following for each level. We may also identify other characteristics that move your work between levels.

This rubric may be updated closer to the due date.

Redo or above

Submissions that lack any of these characteristics will get an I.

[ ] File named correctly (`language.rkt`).
[ ] File contains introductory comment with name, date, assignment, 
    course, citations.
[ ] File "runs" correctly in DrRacket.
[ ] Reformatted code with Ctrl-I before submitting.
[ ] If the program references other files, those files were submitted.
[ ] If the program references other files, it does so with the base file
    name, rather than a complex path.
[ ] In part two, Dale-Chall scores are close to correct, but need not
    be exactly correct.
[ ] In part three, cites the posititve-words and negative-words files.

Meets expectations or above

Submissions that lack any of these characteristics will get an R.

[ ] Good style.
[ ] All procedures are documented and the documentation is (mostly)
    correct.
[ ] Clear variable names. 
[ ] In part two, avoids repeated computation in `compute-dale-chall-score`.
[ ] In part two, avoids reading files more than once.

Exemplary / Exceeds expectations

[ ] Avoids other expensive repeated work. 
[ ] Appropriately decomposes the problem in part three.
[ ] Particularly interesting procedures in part five.
[ ] In computing the Dale-Chall score, finds words with something more
    clever than "break at spaces".
[ ] In computing the Dale-Chall score, finds sentences with something more
    clever than "break at periods".

Sample definitions for Dale-Chall experiments.

; Invented by SamR.  Should be fairly readable.
(define simple-01
  "The lamb says baa.")

; Stolen from Sandra Boynton's _Moo, Baa, La La La_.  Transcribed by SamR (almost from memory).
(define boynton-01
  "A cow says moo.  A sheep says baa.  Three singing pigs say la la la.  No no, you say, that isn't right.  The pigs say oink all day and night.  Rhinoceroses snort and snuff and little dogs go ruff ruff ruff.  Some other dogs go bow wow.  And cats and kittens say meow.  Quack says the duck.  A horse says neigh.  It's quiet now.  What do you say?")