Mini-Project 3 : Textual analysis

One of the many ways in which humanists now employ computers is in analyzing texts.
That is, they use programs to identify characteristics of texts as a starting point for deeper reflections on those texts. These initial analyses can be complex—such as identifying sets of words that regularly occur together, which raises issues not only of efficiency but how you segment sets—while others can be more straightforward, such as looking at the frequency of adverbs or adjectives in a text.

Submitting your work

Note: Please submit your work on this assignment as text-analysis.rkt.

In addition, please make sure to submit any text files you rely on with this file, and make sure that you read from files using only the direct file name, not with a full path.

Please provide proper citations for those text files. You will find that some of them even include citation guidelines at thes tart.

Part the first: Word counts

As you saw in a recent lab, the tools you already know permit you to count appearances of words in a longer text. For example, you may have written code to look for a particular set of characters in the books (no, not the componets of strings). However, as a computer scientist you know that you should write general code.

Document and write a procedure, (count-words list-of-words filename) that takes as input a list of words and produces as output a list of lists, each with a word and its frequency. Note that you should do case-insensitive counting; “River”, “river”, “RIVER”, and even “riVer” should all match “River”.

> (count-words (list "Sam" "Amazing" "Evil" "Funny") "sams-course-reviews.txt")
'(("Sam" 132) ("Amazing" 2) ("Evil" 666) ("Funny" 0))

We would recommend that you create some simple sample files so that you can run tests or experiments on them.

Note: You will find it best to decompose the problem. Hints on doing so are in the question and answer section at/near the end.

Expectation: You should read the file only once in a call to count-words, no matter how many words are in the list. You can find how many times you read a file by using file-to-chars, file-to-words, file-to-lines, and file-to-string in place of file->chars, file->words, file->lines, and file->string. (You may need to update your csc151 package to access those new procedures.)

Useful Concepts: Anonymous procedures (probably with lambda or section); decomposition; tally; map; string-ci=?; file-to-words (or other file procedure).

Part the second: Readability

As you may recall from the previous assignment, there are a number of algorithms for computing the readability of prose. In that assignment, we used the Dale-Chall readability formula. Here’s an implementation of that formula.

;;; (dale-chall-computation difficult-words words sentences) -> real?
;;;   difficult-words : non-negative-integer?
;;;   words : positive-integer?
;;;   sentences : positive-integer?
;;; Compute the Dale-Chall score for a text with the given characteristics
;;; (number of difficult words, number of words, number of sentences).
(define dale-chall-computation
  (lambda (num-difficult-words total-words num-sentences)
    (dale-chall-formula (/ num-difficult-words total-words)
                        (/ total-words num-sentences))))

;;; (dale-chall-formula pdw asl) -> real?
;;;   pdw : real?
;;;   asl : real?
;;; Compute the Dale-Chall score for a text given the percentage
;;; of difficult words and the average sentence length, using the
;;; Dale-Chall formula 
;;;   0.1579 × (pdw × 100) + 0.0496 × asl
;;;   + 3.6365 if the percent of difficult words is at least 5%.
(define dale-chall-formula
  (lambda (pdw asl)
    (+ (* 0.1579 pdw 100)
       (* 0.0496 asl)
       (if (> pdw 0.05) 3.6365 0))))

a. To determine whether or not a word is easy, we’ll need a list of easy words. Here’s one.

http://countwordsworth.com/download/DaleChallEasyWordList.txt

Define easy-words as the words in that file.

Note that you may not be able to just use file->words or file->lines on that file because of the formatting. Hence, you may have to do additional work.

b. Write a procedure, (dale-chall-score str), that takes a string as input, computes the various aspects of the string, and calls dale-chall-computation to determine the Dale-Chall score. You can find some sample texts at the end of this assignment.

For this procedure, you can assume that every period represents a sentence break and that only periods represent sentence breaks. For this procedure, you can also assume that every space (or sequence of spaces) represents a word break and that only spaces (or sequences of spaces) represent word breaks.

c. Write a procedure, (dale-chall-score-improved str), that takes a string as input, computes the various aspects of the string, and calls dale-chall-computation to determine the Dale-Chall score.

For this procedure, you should use more sophisticated mechanisms for determining sentence breaks and for extracting words. For example, some periods might not represent the end of a sentence because, say, they are used with an initial. And some sentences might end with characters other than periods, such as exclamation points or question marks. Words generally don’t include quotation marks or commas or such. On the other hand, they may include some non-letter values, such as hyphens.

Useful concepts: decomposition, rex-find-matches, tally, length, string-replace, string-downcase, potentially procedures from part one.

Partial rubric

In grading these assignment, we will look for the following. We may also identify other characteristics that move your work between levels. All one-star items are required for a grade of R or above. All two-star items are required for a grade of M or above. All three-star items are required for a grade of E.

This rubric may be updated closer to the due date.

Unit tests

[ ] Passes all the one-star unit tests (*)
[ ] Passes all the two-star unit tests (**)
[ ] Passes all the three-star unit tests (***)

General

[ ] The Racket file is correctly named (`text-analysis.rkt`) (*)
[ ] The Racket file contains an introductory comment with name, date, 
    assignment, course, and citations (*)
[ ] All associated/referenced files are included (*)
[ ] All associated files are done with the file name, not with a full path (*)
[ ] All the associated files are cited (*)
[ ] The code has been reformatted with Ctrl-I before submitting (*)
[ ] Variables have clear names (**)
[ ] Generally good style, with few stylistic errors (**)
[ ] All procedures are documented and the documentation is (mostly)
    correct (**)
[ ] Efficient code (e.g., does not create unnecessary lists) (***)

Part one

[ ] Appropriately decomposes the problem (**)
[ ] Avoids `let` and local defines (**)
[ ] Avoids reading files more than once (***)
[ ] Does not use `make-list` to make a list of identical values (***)

Part two

[ ] Cites the file of Dale-Chall easy words (*)
[ ] Appropriately decomposes the problem (**)
[ ] Does not change the provided Dale-Chall code (**)
[ ] Has code to handle titles, such as Ms. and Dr. (***)
[ ] Uses nicely concise/decomposed regular expressions (***)

Questions and answers

Do you have hints on part one?

Make sure to decompose.

One natural procedure to write will be one that returns a word and its count. You could then map that over the list of words.

If I have trouble on part one, would it make sense to try the other parts and come back to it?

Yup.

How would you suggest I decompose part one?

Write a procedure (count-word word list-of-strings), that counts how many times the given word appears in the list of strings and returns a two-element list consisting of the word and the count. If you count with tally, you may need an anonymous procedure.

 > (count-word "one" (list "a" "one" "and" "a" "two" "and" "a" "three"))
 '("one" 1)
 > (count-word "and" (list "a" "one" "and" "a" "two" "and" "a" "three"))
 '("and" 2)
 > (count-word "a" (list "a" "one" "and" "a" "two" "and" "a" "three"))
 '("a" 3)
 > (count-word "four" (list "a" "one" "and" "a" "two" "and" "a" "three"))
 '("four" 0)

Write a procedure (count-words-in-list list-of-words list-of-strings), that takes each word in list-of-words and determines how many times it appears in list-of-strings, returning a list of two-element lists of the form returned by count-word. You will likely need map and a sectioned version of count-word to write this procedure. Since you can use map, we would prefer that you not use recursion in writing count-words-in-list.
```
 > (count-words-in-list (list "one" "a" "and" "five")
                        (list "a" "one" "and" "a" "two" "and" "a" "three"))
 '(("one" 1) ("a" 3) ("and" 2) ("five" 0))
```
You should now be able to write (count-words list-of-words filename). Its body will likely consist of a call to count-words-in-list.

I’d like to use list-ref in exercise 1. Do you think that’s a good idea?

No! If you use map, tally, and such correctly, you should not need to reference individual elements.

Can I use recursion?

No. Our goal is that you get used to using map, tally, and such.

Should I let you know when I hit more than four hours?

Yes, I’d appreciate it.

Sample definitions for Dale-Chall experiments.

; Invented by SamR.  Should be fairly readable.
(define simple-01
  "The lamb says baa.")

; Invented by SamR.  All simple words, but the simple word split will
; consider "hat." to be hard.
(define simple-02
  "The cat is in the hat.")

; Invented by SamR.  Should be the same as the prior one, but with
; mixed case.  Also creates a potential problem for the more advanced
; sentence count, since the period in "HAT." follows an uppercase
; letter but still marks the end of a sentence.
(define simple-03
  "ThE cAt IS iN the HAT.")

; Invented by SamR.  Edge case.  All hard words!
(define blah
  "Blah blah blah blah blah. Blah blah blah.")

; Stolen from Sandra Boynton's _Moo, Baa, La La La_.  Transcribed by SamR (almost from memory).
(define boynton-01
  "A cow says moo.  A sheep says baa.  Three singing pigs say la la la.  No no, you say, that isn't right.  The pigs say oink all day and night.  Rhinoceroses snort and snuff and little dogs go ruff ruff ruff.  Some other dogs go bow wow.  And cats and kittens say meow.  Quack says the duck.  A horse says neigh.  It's quiet now.  What do you say?")

Copyright © Charlie Curtsinger, Sarah Dahlby Albright, Janet Davis, Nicole Eikmeier, Fahmida Hamid, Priscilla Jiménez, Barbara Johnson, Titus Klinge, Peter-Michael Osera, Samuel A. Rebelsky, John David Stone, Henry Walker, and Jerod Weinman.

Unless specified otherwise elsewhere on this page, this work is licensed under a Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/3.0/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.

This website was built using Jekyll, Twitter Bootstrap, and the Bootswatch Cosmo Theme.