One of the many ways in which humanists now employ computers is in analyzing texts. That is, they use programs to identify characteristics of texts as a starting point for deeper reflections on those texts. These initial analyses can be complex—such as identifying sets of words that regularly occur together, which raises issues not only of efficiency but how you segment sets—while others can be more straightforward, such as looking at the frequency of adverbs or adjectives in a text.
Note: Please submit this file as language.rkt.
In addition, please make sure to submit any text files you rely on with this file, and make sure that you read from files using only the direct file name, not with a full path.
As you saw in a recent lab, the tools you already know permit you to count appearances of words in a longer text. You wrote code to look for a particular set of characters in the books (no, not the componets of strings). However, as a computer scientist you know that you should write general code.
Document and write a procedure, (count-words list-of-words filename)
that takes as input a list of words and produces as output a list
of lists, each with a word and its frequency. Note that you should
do case-insensitive counting; “Sam”, “sam”, “SAM”, and even “saM”
should match “Sam”.
> (count-words (list "Sam" "Amazing" "Evil" "Funny") "sams-course-reviews.txt")
'(("Sam" 132) ("Amazing" 2) ("Evil" 666) ("Funny" 0))
We would recommend that you create some simple sample files so that you can run tests or experiments on them.
There are a number of algorithms for computing the readability of prose. One that we will implement now is the Dale-Chall readability formula:
https://en.wikipedia.org/wiki/Dale%E2%80%93Chall_readability_formula
The formula given in 1948 as:
score = 0.1579 × (PDW × 100) + 0.0496 × ASL
ASL is the “average sentence length”, the total number of words
in the text divided by the number of sentences in the text.
PDW is the “percentage of difficult words”, the number of words in the
text not found in Dale-Chall’s list of about 3000 familiar words:
http://countwordsworth.com/download/DaleChallEasyWordList.txt
Furthermore, if the percentage of difficult words is above 5%, then add 3.6365 to the score. Otherwise, keep the score as it is.
a. Document and write a procedure, compute-dale-chall-score, that takes three parameters:
And computes the Dale-Chall score from these statistics. Here is an example execution of the procedure:
> (compute-dale-chall-score 35 200 40)
6.64775
> (compute-dale-chall-score 25 100 18)
7.859555555555556
Experiment with your procedure using a variety of inputs to gain confidence that your procedure is implemented correctly.
This score can be then translated into a grade level for which the text is appropriate. The score-to-grade table from Wikipedia is given below:
4.9 or lower: "4th grade or lower"
5.0–5.9: "5th-6th grade"
6.0–6.9: "7th-8th grade"
7.0–7.9: "9th-10th grade"
8.0–8.9: "11th-12th grade"
9.0 or above: "13th-15th grade"
You should, of course, deal with the numbers that the table makes unclear. As is normal in our class, please use ranges like 5.0 (inclusive) to 6.0 (exclusive).
less than 5.0: "4th grade or lower"
5.0 <= score < 6.0: "5th-6th grade"
6.0 <= score < 7.0: "7th-8th grade"
7.0 <= score < 8.0: "9th-10th grade"
8.0 <= score < 9.0: "11th-12th grade"
at least 9.0: "13th-15th grade"
b. Document and write a procedure, (score->grade score), that
takes a Dale-Chall score and returns the grade level corresponding
to the range that score falls under. Your function should return
the grade level as the strings given in the above table.
c. Come up with a set of short texts that will be useful for exploring Dale-Chall scores. If you email them to me, I’ll post them to the end of this document for others to use. Feel free to use other sources (with citation) such as novels, articles, etc. or try to write sensible sentences that target these grade levels.
d. Write a procedure (dale-chall-score str) that takes a string
that represents a text, computes the various aspects of the string
(number of words, number of difficult words, number of sentences),
applies the computations above, and gives the Dale-Chall score for
the word. You can assume that every period represents a sentence
break and that only periods represent sentence breaks. You can
assume that every space represents a word break and that only spaces
represent word breaks.
Note: When reading the list of easy words, please use file->string
and the following procedure. (Why not use file->words? Because some
of the words in the file have apostrophes, and file->words does not
include apostrophes. Why not file->lines? Because some of the
lines end with spaces. This approach seems safer.)
;;; (extract-words str)
;;; str : string?
;;; Extract all of the words from str. A word is a sequence of
;;; letters and apostrophes.
(define extract-words
(let ([rex-word
(rex-repeat (rex-any-of (rex-char-range #\a #\z)
(rex-char-range #\A #\Z)
(rex-string "'")))])
(lambda (str)
(rex-find-matches rex-word str))))
(test-equal? "empty string"
(extract-words "")
'())
(test-equal? "a few words separated by spaces"
(extract-words "this and that and the other thing")
'("this" "and" "that" "and" "the" "other" "thing"))
(test-equal? "a few words separated by spaces"
(extract-words "this and that and the other thing")
'("this" "and" "that" "and" "the" "other" "thing"))
(test-equal? "a few words with strange punctuation"
(extract-words "this; and!that and the?, other thing")
'("this" "and" "that" "and" "the" "other" "thing"))
Another common approach to text analysis is called “sentiment analysis”. Broadly, sentiment analysis is intended to determine the writer’s overall sentiment in a piece of writing. Are they happy? Sad? Angry? Enthusiastic? Rumor has it that Amazon uses sentiment analysis in selecting reviews to prioritize and for other things, too.
The most straightforward form of sentiment analysis involves looking at word frequencies. A document with more positive than negative words is likely to be interepreted as positivve. A document with more negative than positive words is likely to be interpreted as negative.
Conveniently, there are two long lists of positive and negative words available to you.
http://ptrckprry.com/course/ssd/data/positive-words.txt
http://ptrckprry.com/course/ssd/data/negative-words.txt
Please read the notes at the top of those files before you use them.
Document and write a procedure, (posneg str), that takes a
string as an input and returns “positive” if the percent of
positive words is at least 5% higher than the percent of negative
words, “negative” if the percent of negative words is at least 5%
higher than the percent of positive words, and “neutral” the rest
of the time.
a. Write a procedure, (count-male-pronouns str), that takes a string as input and counts the number of male pronouns (“he”, “him”, “his”) in the string. Be careful not to count words like “history”.
;;; (count-male-pronouns str) -> integer?
;;; str : string
;;; Count how many times that male pronouns appear as
;;; individual words in str.
b. Write a procedure, (count-female-pronouns str), that takes a string as input and counts the number of female pronouns (she, her, hers) in the string. Once again, you should be careful not to count words like “heroic”.
;;; (count-female-pronouns str) -> integer?
;;; str : string
;;; Count how many times that female pronouns appear as
;;; individual words in str.
c. Using your two procedures, determine whether the book you chose is more likely to use male pronouns or female pronouns. Insert the results of your experiments as a comment in your code file. If you deem it appropriate, you might also explore the use of non-gendered pronouns.
Document and write two non-trivial procedures that does some other kind of textual analysis that you expect would be fun or useful.
Both procedures must make use of regular expressions in some way.
Please do not just write small variants of the procedures above.
In grading these assignment, we will look for the following for each level. We may also identify other characteristics that move your work between levels.
This rubric may be updated closer to the due date.
Submissions that lack any of these characteristics will get an I.
[ ] File named correctly (`language.rkt`).
[ ] File contains introductory comment with name, date, assignment,
course, citations.
[ ] File "runs" correctly in DrRacket.
[ ] Reformatted code with Ctrl-I before submitting.
[ ] If the program references other files, those files were submitted.
[ ] If the program references other files, it does so with the base file
name, rather than a complex path.
[ ] In part two, Dale-Chall scores are close to correct, but need not
be exactly correct.
[ ] In part three, cites the posititve-words and negative-words files.
Submissions that lack any of these characteristics will get an R.
[ ] Good style.
[ ] All procedures are documented and the documentation is (mostly)
correct.
[ ] Clear variable names.
[ ] In part two, avoids repeated computation in `compute-dale-chall-score`.
[ ] In part two, avoids reading files more than once.
[ ] Avoids other expensive repeated work.
[ ] Appropriately decomposes the problem in part three.
[ ] Particularly interesting procedures in part five.
[ ] In computing the Dale-Chall score, finds words with something more
clever than "break at spaces".
[ ] In computing the Dale-Chall score, finds sentences with something more
clever than "break at periods".
; Invented by SamR. Should be fairly readable.
(define simple-01
"The lamb says baa.")
; Stolen from Sandra Boynton's _Moo, Baa, La La La_. Transcribed by SamR (almost from memory).
(define boynton-01
"A cow says moo. A sheep says baa. Three singing pigs say la la la. No no, you say, that isn't right. The pigs say oink all day and night. Rhinoceroses snort and snuff and little dogs go ruff ruff ruff. Some other dogs go bow wow. And cats and kittens say meow. Quack says the duck. A horse says neigh. It's quiet now. What do you say?")