Mini-Project 4 : More textual analysis

One of the many ways in which humanists now employ computers is in analyzing texts.
That is, they use programs to identify characteristics of texts as a starting point for deeper reflections on those texts. These initial analyses can be complex—such as identifying sets of words that regularly occur together, which raises issues not only of efficiency but how you segment sets—while others can be more straightforward, such as looking at the frequency of adverbs or adjectives in a text.

Submitting your work

Note: Please submit your work on this assignment as more-text-analysis.rkt.

In addition, please make sure to submit any text files you rely on with this file, and make sure that you read from files using only the direct file name, not with a full path.

Please provide proper citations for those text files. You will find that some of them even include citation guidelines at thes tart.

Part the first: Sentiment analysis

A common approach to text analysis is called “sentiment analysis”. Broadly, sentiment analysis is intended to determine the writer’s overall sentiment in a piece of writing. Are they happy? Sad? Angry? Enthusiastic? Rumor has it that Amazon uses sentiment analysis in selecting reviews to prioritize and for other things, too.

The most straightforward form of sentiment analysis involves looking at word frequencies. A document with more positive than negative words is likely to be interepreted as positive. A document with more negative than positive words is likely to be interpreted as negative.

Conveniently, there are two long lists of positive and negative words available to you.

http://ptrckprry.com/course/ssd/data/positive-words.txt

http://ptrckprry.com/course/ssd/data/negative-words.txt

Please read the notes at the top of those files before you use them.

a. Define lists of strings positive-words and negative-words by reading from those files. You should be able to get most of the work done using file->lines (or file-to-lines). Note, however, that each file begins with a series of lines that start with semicolons. You should remove those lines.

b. Document and write a procedure, (posneg str), that takes a string as an input and returns

"positive" if the percent of positive words (on a 0 to 1 scale) is at least 0.05 higher than the percent of negative words;
"negative" if the percent of negative words is at least 5% higher than the percent of positive words; and
"neutral" in all other cases.

Useful concepts: decomposition, filter, rex-matches?, anonymous procedures (with section or lambda), rex-find-matches, tally, length, conditionals, potentially procedures from parts one and two.

Part the second: Basic gender analysis

a. Write a procedure, (count-male-pronouns str), that takes a string as input and counts the number of male pronouns (“he”, “him”, “his”) in the string.

Be careful not to count words like “history”.

;;; (count-male-pronouns str) -> integer?
;;;   str : string
;;; Count how many times that male pronouns appear as
;;; individual words in str.

b. Write a procedure, (count-female-pronouns str), that takes a string as input and counts the number of female pronouns (she, her, hers) in the string. Once again, you should be careful not to count words like “heroic”.

;;; (count-female-pronouns str) -> integer?
;;;   str : string
;;; Count how many times that female pronouns appear as
;;; individual words in str.

c. Write a procedure, (count-gender-neutral-pronouns str), that takes a string as input and counts the number of gender-neutral pronouns (they, their, theirs or zi, zir, zirs or …) in the file.

;;; (count-gender-netural-pronouns str) -> integer?
;;;   str : string
;;; Count how many times that gender-neutral pronouns appear as
;;; individual words in str.

d. Choose a book from Project Gutenburg. Using your two procedures, determine whether the book you chose is more likely to use male pronouns, female pronouns, or gender neutral pronouns. Insert the results of your experiments as a comment in your code file. (You should copy and paste the expressions and results from the interactions pane. You might then add a few lines of commentary.)

Useful concepts: decomposition, tally, anonymous procedures (with section or lambda), rex-find-matches (and regular expressions), potentially procedures from parts one, two, and three.

Part the third: Wordle assistants

As you may have heard, the game Wordle has experienced a recent surge in popularity. Wordle bears some similarity to Mastermind. As you play, you learn letters that belong in a word. For some letters, you know both the letter and the position. For other letters, you know that the letter appears somewhere in the string.

As we saw in a recent lab, the file /usr/share/dict/words contains a long list of words in English. We tend to save it as words.txt (yes, that’s a link).

Write a procedure, (wordle-helper first-char last-char middle-char), the finds all the five-letter words in that file that start with first-char, end with last-char, and have middle-char somewhere in the middle. wordle-helper should return a list of strings.

We would recommend that you store the words file in a variable.

(define word-list (file->lines "words.txt"))

Useful concepts: decomposition, rex-match-strings (from the lab on regular expressions; make sure to cite), regular expressions, anonymous procedures.

Part the fourth: Freestyle

Document and write one non-trivial procedure that does some other kind of textual analysis that you expect would be fun or useful.

Do not write a small variant of any of the procedures above.

Make sure that your procedure makes direct use of regular expressions in some way.

The regular expression should involve at least three different rex-??? procedures.

Partial rubric

In grading these assignment, we will look for the following. We may also identify other characteristics that move your work between levels. All one-star items are required for a grade of R or above. All two-star items are required for a grade of M or above. All three-star items are required for a grade of E.

This rubric may be updated closer to the due date.

Unit tests

[ ] Passes all the one-star unit tests (*)
[ ] Passes all the two-star unit tests (**)
[ ] Passes all the three-star unit tests (***)

General

[ ] The Racket file is correctly named (`more-text-analysis.rkt`) (*)
[ ] The Racket file contains an introductory comment with name, date, 
    assignment, course, and citations (*)
[ ] All associated/referenced files are included (*)
[ ] All associated files are done with the file name, not with a full path (*)
[ ] All the associated files are cited (*)
[ ] The code has been reformatted with Ctrl-I before submitting (*)
[ ] Variables have clear names (**)
[ ] Generally good style, with few stylistic errors (**)
[ ] All procedures are documented and the documentation is (mostly)
    correct (**)

Part one: Sentiment analysis

[ ] Cites the positive-words and negative-words files (*)
[ ] Cites the positive-words and negative-words files according to the
    guidance in those files (**)
[ ] Removes the lines at the beginning of those files computationally (**)
[ ] Removes the lines at the beginning of those files using an appropriate 
    call to `filter` (rather than, say, `drop`) (***)

Part two: Gender analysis

[ ] Includes the required analysis (sample runs of the procedures) (**)
[ ] Simple and elegant design (e.g., using general procedures from the
    sentiment analysis part) (***)

Part three: Wordle

[ ] Appropriately elegant or concise regular expression (***)

Part four: Freestyle

[ ] Includes a procedure (*)
[ ] The procedure achieves something nontrivial (**)
[ ] The procedure uses a new regular expression with at least three of
    the rex procedures (**)
[ ] A particularly interesting procedure (***)

Questions and answers

Can I use recursion?

No. But you shold not need recursion.

Acknowledgements

The positive and negative words files are the work of Hu, Liu, and Cheng. See the files themselves and the following papers for additional notes.

Minqing Hu and Bing Liu. “Mining and Summarizing Customer Reviews.” Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, Washington, USA.

Bing Liu, Minqing Hu and Junsheng Cheng. “Opinion Observer: Analyzing and Comparing Opinions on the Web.” Proceedings of the 14th International World Wide Web conference (WWW-2005), May 10-14, 2005, Chiba, Japan.

Copyright © Charlie Curtsinger, Sarah Dahlby Albright, Janet Davis, Nicole Eikmeier, Fahmida Hamid, Priscilla Jiménez, Barbara Johnson, Titus Klinge, Peter-Michael Osera, Samuel A. Rebelsky, John David Stone, Henry Walker, and Jerod Weinman.

Unless specified otherwise elsewhere on this page, this work is licensed under a Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/3.0/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.

This website was built using Jekyll, Twitter Bootstrap, and the Bootswatch Cosmo Theme.