One of the many ways in which humanists now employ computers is in analyzing texts. That is, they use programs to identify characteristics of texts as a starting point for deeper reflections on those texts. These initial analyses can be complex—such as identifying sets of words that regularly occur together, which raises issues not only of efficiency but how you segment sets—while others can be more straightforward, such as looking at the frequency of adverbs or adjectives in a text.
Note: Please submit this file as mp3.rkt. (No jokes, please.)
As you saw in a recent lab, the tools you already know permit you to count appearances of words in a longer text. You wrote code to look for a particular set of characters in the books (no, not the componets of strings). However, as a computer scientist you know that you should write general code.
Document and write a procedure, (count-words list-of-words filename)
that takes as input a list of words and produces as output a list
of lists, each with a word and its frequency. Note that you should
do case-insensitive counting; “Sam”, “sam”, “SAM”, and even “saM”
should match “Sam”.
> (count-words (list "Sam" "Amazing" "Evil" "Funny") "sams-course-reviews.txt")
'(("Sam" 132) ("Amazing" 2) ("Evil" 666) ("Funny" 0))
We would recommend that you create some simple sample files so that you can run tests on them.
There are a number of algorithms for computing the readability of prose. One that we will implement now is the Dale-Chall readability formula:
https://en.wikipedia.org/wiki/Dale%E2%80%93Chall_readability_formula
The formula given in 1948 as:
score = 0.1579 × (PDW × 100) + 0.0496 × ASL
PDW is the “percentage of difficult words”, the number of words in the
text not found in Dale-Chall’s list of about 3000 familiar words:
http://countwordsworth.com/download/DaleChallEasyWordList.txt
Furthermore, if the percentage of difficult words is above 5%, then add 3.6365 to the score. Otherwise, keep the score as it is.
a. Document and write a procedure, compute-dale-chall-score, that takes three parameters:
And computes the Dale-Chall score from these statistics. Here is an example execution of the procedure:
> (compute-dale-chall-score 35 200 40)
6.64775
> (compute-dale-chall-score 25 100 18)
7.859555555555556
Experiment with your procedure using a variety of inputs to gain confidence that your procedure is implemented correctly.
This score can be then translated into a grade level for which the text is appropriate. The score-to-grade table from Wikipedia is given below:
4.9 or lower: "4th grade or lower"
5.0–5.9: "5th-6th grade"
6.0–6.9: "7th-8th grade"
7.0–7.9: "9th-10th grade"
8.0–8.9: "11th-12th grade"
9.0 or above: "13th-15th grade"
b. Document and write a procedure, (score->grade score), that
takes a Dale-Chall score and returns the grade level corresponding
to the range that score falls under. Your function should return
the grade level as the strings given in the above table.
c. Come up with a set of short texts that will be useful for exploring Dale-Chall scores. If you email them to me, I’ll post them to the end of this document for others to use. Feel free to use other sources (with citation) such as novels, articles, etc. or try to write sensible sentences that target these grade levels.
d. Write a procedure (dale-chall-score str) that takes a string
that represents a text, computes the various aspects of the string
(number of words, number of difficult words, number of sentences),
applies the computations above, and gives the Dale-Chall score for
the word. You can assume that every period represents a sentence
break and that only periods represent sentence breaks. You can
assume that every space represents a word break and that only spaces
represent word breaks.
Note: When reading the list of easy words, make sure to use file->lines
rather than file->words.
While the computer can search a list of about 3000 words fairly quickly, when it’s doing this searching again and again for each word in longer texts, you will start to see a bit of a lag.
By replicating a text multiple times with (apply string-append
(make-list NUM TEXT)), you should be able to find a text long
enough that the computer takes about ten seconds to compute the
Dale-Chall score. For example,
> (define sentence
"Twas brillig and the slithy toves did gyre and gimble in the wabe. ")
> (dale-chall-score (apply string-append (make-list 10000 sentence))
; Wait wait wait
11.2 ; An invented number
You can find out precisely how long it took with the time operation.
> (time (dale-chall-score (apply string-append (make-list 10000 sentence))))
cpu time: 320 real time: 364 gc time: 230
11.2 ; Still an invented number
“cpu time” represents how much time the computer’s central processing unit spent on the problem. “real time” represents how much time you would see pass on a wall clock. “gc time” is a strange version of time used at Grinnell College, which we will ignore. (In reality, it’s garbage collection time, but we’ll still ignore it.)
We should be able to improve the procedure by breaking the longer list apart into smaller lists that depend only on the first letter.
(define easy-word?
(let ([easy-a-words '("a" "able" "aboard" ...)]
[easy-b-words '("baa" "babe" "babies" ...)]
...)
(lambda (word)
(let ([first (char-downcase (string-ref word 0))])
(cond
[(char=? first #\a)
; code to check if word is in easy-a-words
]
[(char=? first #\b)
; code to check if word is in easy-b-words
]
...
[else
#f])))))
Finish writing easy-word?. Note that you should not manually
write the various sets of words. Rather, you should use a helper
to extract them automatically from the Dale-Chall list of easy words.
(define easy-word?
(let ([easy-a-words (words-starting-with #\a easy-words)]
[easy-b-words (words-starting-with #\b easy-words)]
...
[easy-z-words (words-starting-with #\z easy-words)])
(lambda (word)
(let ([first (char-downcase (string-ref word 0))])
(cond
[(char=? first #\a)
(word-in-list? word easy-a-words)]
[(char=? first #\b)
(word-in-list? word easy-b-words)]
...
[(char=? first #\z)
(word-in-list? word easy-z-words)]
[else
#f])))))
a. Write a dale-chall-score2 procedure that uses the new easy-word?.
You do not need to call your helpers words-starting-with and
word-in-list?, but you may find it helpful to do so.
If you don’t want to type out all of easy-word?, there’s a full version at the end of these instructions.
You can also consider writing equally efficient, but less repetitive code.
b. Determine experimentally how much, if any, improvement this change makes. Enter the results of your experiments along with a bit of analysis as a comment in your Racket file.
Another common approach to text analysis is called “sentiment analysis”. Broadly, sentiment analysis is intended to determine the writer’s overall sentiment in a piece of writing. Are they happy? Sad? Angry? Enthusiastic? I believe Amazon uses sentiment analysis in selecting reviews to prioritize and for other things, too.
The most straightforward form of sentiment analysis involves looking at word frequencies. A document with more positive than negative words is likely to be interepreted as positivve. A document with more negative than positive words is likely to be interpreted as negative.
Conveniently, there are two long lists of positive and negative words available to you.
http://ptrckprry.com/course/ssd/data/positive-words.txt
http://ptrckprry.com/course/ssd/data/negative-words.txt
Document and write a procedure, (posneg str), that takes a
string as an input and returns “positive” if the number of
positive words is greater than the number of negative words,
“negative” if the number of negative words is greater than the
positive words, and “neutral” if the two are the same.
Document and write a procedure that does some otiher kind of textual analysis that you expect would be fun or useful.
Here’s the ugly code for the third part. I think I got it right.
(define easy-word?
(let ([easy-a-words (words-starting-with #\a easy-words)]
[easy-b-words (words-starting-with #\b easy-words)]
[easy-c-words (words-starting-with #\c easy-words)]
[easy-d-words (words-starting-with #\d easy-words)]
[easy-e-words (words-starting-with #\e easy-words)]
[easy-f-words (words-starting-with #\f easy-words)]
[easy-g-words (words-starting-with #\g easy-words)]
[easy-h-words (words-starting-with #\h easy-words)]
[easy-i-words (words-starting-with #\i easy-words)]
[easy-j-words (words-starting-with #\j easy-words)]
[easy-k-words (words-starting-with #\k easy-words)]
[easy-l-words (words-starting-with #\l easy-words)]
[easy-m-words (words-starting-with #\m easy-words)]
[easy-n-words (words-starting-with #\n easy-words)]
[easy-o-words (words-starting-with #\o easy-words)]
[easy-p-words (words-starting-with #\p easy-words)]
[easy-q-words (words-starting-with #\q easy-words)]
[easy-r-words (words-starting-with #\r easy-words)]
[easy-s-words (words-starting-with #\s easy-words)]
[easy-t-words (words-starting-with #\t easy-words)]
[easy-u-words (words-starting-with #\u easy-words)]
[easy-v-words (words-starting-with #\v easy-words)]
[easy-w-words (words-starting-with #\w easy-words)]
[easy-x-words (words-starting-with #\x easy-words)]
[easy-y-words (words-starting-with #\y easy-words)]
[easy-z-words (words-starting-with #\z easy-words)])
(lambda (word)
(let ([first (char-downcase (string-ref word 0))])
(cond
[(char=? first #\a)
(word-in-list? word easy-a-words)]
[(char=? first #\b)
(word-in-list? word easy-b-words)]
[(char=? first #\c)
(word-in-list? word easy-c-words)]
[(char=? first #\d)
(word-in-list? word easy-d-words)]
[(char=? first #\e)
(word-in-list? word easy-e-words)]
[(char=? first #\f)
(word-in-list? word easy-f-words)]
[(char=? first #\g)
(word-in-list? word easy-g-words)]
[(char=? first #\h)
(word-in-list? word easy-h-words)]
[(char=? first #\i)
(word-in-list? word easy-i-words)]
[(char=? first #\j)
(word-in-list? word easy-j-words)]
[(char=? first #\k)
(word-in-list? word easy-k-words)]
[(char=? first #\l)
(word-in-list? word easy-l-words)]
[(char=? first #\m)
(word-in-list? word easy-m-words)]
[(char=? first #\n)
(word-in-list? word easy-n-words)]
[(char=? first #\o)
(word-in-list? word easy-o-words)]
[(char=? first #\p)
(word-in-list? word easy-p-words)]
[(char=? first #\q)
(word-in-list? word easy-q-words)]
[(char=? first #\r)
(word-in-list? word easy-r-words)]
[(char=? first #\s)
(word-in-list? word easy-s-words)]
[(char=? first #\t)
(word-in-list? word easy-t-words)]
[(char=? first #\u)
(word-in-list? word easy-u-words)]
[(char=? first #\v)
(word-in-list? word easy-v-words)]
[(char=? first #\w)
(word-in-list? word easy-w-words)]
[(char=? first #\x)
(word-in-list? word easy-x-words)]
[(char=? first #\y)
(word-in-list? word easy-y-words)]
[(char=? first #\z)
(word-in-list? word easy-z-words)]
[else
#f])))))
Yes, this is hideous. But it’s more efficient than the original version (or it should be). I would, of course, prover that you found a better way to do this. Nontheless, it’s a starting point. Perhaps the utter awfulness of it will incentivize you to find something equally fast but ore elegant.
; Invented by SamR. Should be fairly readable.
(define simple-01
"The lamb says baa.")
; Stolen from Sandra Boynton's _Moo, Baa, La La La_. Transcribed by SamR (almost from memory).
(define boynton-01
"A cow says moo. A sheep says baa. Three singing pigs say la la la. No no, you say, that isn't right. The pigs say oink all day and night. Rhinoceroses snort and snuff and little dogs go ruff ruff ruff. Some other dogs go bow wow. And cats and kittens say meow. Quack says the duck. A horse says neigh. It's quiet now. What do you say?")