Mini-Project 5: Language generation

Assigned
Friday, 15 October 2021
Summary
In this assignment, you will write programs that generate (or attempt to generate) different forms of writing. Along the way, you will explore issues pertaining to randomness, conditional behavior, and textual analysis.
Collaboration
Each student should submit their own responses to this assignment. You may consult other students in the class as you develop your responses. You may also consult with the normal host of other folks: Mentors, tutors, SamR, Grinnell CS students, random strangers on the Web, and so on and so forth. If you receive help from anyone, make sure to cite them in your responses. You do not need to cite course pages you were instructed to read for this assignment.

Disclaimer: In this assignment, you will read and generate some utterly horrible “poetry” and other text. Please accept my apologies if any of the work included herein offends your expectations, cultural or otherwise.

Note: In doing this mini-project, you will likely want to refer back to the lab on random language generation.

Please name your file language.rkt.

Background

A syllabic lexicon, or syllax for short, is a collection of words or short phrases arranged by syllables. For the purposes of this class, a syllax is a list of lists, where the listat index i contains only the words with i syllables.

For example, here is a syllax for some words related to CSC-151.

(define csc151-syllax
  (list
   ; 0
   (list)
   ; 1
   (list "cons" "car" "list" "pair" "Scheme" "sort" "match" "string"
         "lab" "map" "fold" "test") 
   ; 2
   (list "cadr" "cdr" "Racket" "jelly" "sandwich" "syllax" 
         "image" "recurse" "eboard" "data" "compose" "lambda" "section"
         "SoLA" "MP")
   ; 3
   (list "recursion" "computer" "digital" "confusing" "programming" 
         "CSC" "abstraction" "decompose" "document" "abstraction"
         "boolean" "binary")
   ; 4
   (list "humanities" "exponential" "collaborate" "one-fifty-one"
         "algorithm" "DrRacket" "dictionary" "generalize" "defenestrate")
   ; 5
   (list "collaborative" "experiential" "decomposition" "generality"
         "defenestration")
   ; 6
   (list)
   ; 7
   (list "triskaidekaphobia")))

Here’s another syllax for some words related to Grinnell college.

(define grinnell-syllax
  (list
   ; 0
   (list)
   ; 1
   (list "Mears" "Noyce" "husk" "train" "corn" "black" "HSSC")
   ; 2
   (list "self-gov" "Stonewall" "The Bear" "first-year" "scarlet"
         "remote" "Webex" "prairie" "need-blind" "soybeans" "Hopkins"
         "Younker" "Dibble" "cluster" "scurry" "Gremlin")
   ; 3
   (list "liberal" "JRC" "CLS" "advisor" "FYE" "laurel leaf" "Honor G"
         "ARH" "North Campus" "Iowa" "semester" "Women's quad"
         "Grinnellian" "pie-on-ears")
   ; 4
   (list "curriculum" "Mary B. James" "Tutorial" "Green alien" 
         "convocation" "education")
   ; 5
   (list)
   ; 6
   (list "Congregationalist")))

Helpful procedures

In doing this assignment, you may find the following procedures of use. You need not cite them; since you’re doing this assignment, we’ll assume that you’re taking ideas from the assignment.

;;; (random-list-element lst) -> any?
;;;   lst : list? (nonempty)
;;; Randomly select an element of `lst`
(define random-list-element
  (lambda (lst)
    (list-ref lst (random (length lst)))))

;;; (list-of-strings? val) -> boolean?
;;;   val : any?
;;; Determines if val is a list of strings.
(define list-of-strings?
  (lambda (val)
    (and (list? val)
         (andmap string? val))))

Part the first: Generating Haiku

Haiku are three-line poems that consist of a line with five syllables, a line with seven syllables, and a line with five syllables.

a. Our principles of decomposition suggest that we should begin by writing a procedure or procedures to generate lines for the Haiku. Our principles of generality suggest that those procedures should accommodate not just five or seven syllables, but an arbitrary number of syllables.

Document, design, and implement a recursive procedure, (phrase n syllax) that randomly generates a string that contains a phrase of n syllables given a syllax of the form above.

Make sure that you accommodate different “patterns” of n syllables. For example, if n is 4, you should randomly choose between

  • a one-syllable word followed by a three-syllable phrase;
  • a two-syllable word followed by a two-syllable phrase;
  • a three-syllable word followed by a one-syllable phrase; and a one-syllable word)
  • a four-syllable word.

Note that there are some complications involved. For example, if n is bigger than the maximum number of syllables in the syllax, it cannot consist of one word.

Note: (random start finish) generates a random integer between start (inclusive) and finish (exclusive).

Note: You may assume that every syllax has at least one one-syllable word. (Without this guarantee, it can be much harder to simplify the problem.)

Hint: You will want to generate a random number between 1 (there are no zero-syllable words) and either n or the most number of syllables in the syllax, whichever is smaller.

Hint: You will want to employ recursion.

b. It is difficult to write tests for random procedures, so you will likely have to conduct experiments instead. Please include a record of your experiments and your analyses of the results of those experiments. (E.g., you should check that the number of syllables is correct.)

> (phrase 7 csc151-syllax)
"CSC binary Scheme"             ; 3 + 3 + 1 = 7
> (phrase 7 csc151-syllax)
"recursion tail recursion"      ; 3 + 1 + 3 = 7 (are we only getting 3/1?)
> (phrase 7 csc151-syllax)
"match recurse one-fifty-one"   ; 1 + 2 + 4 = 7
> (phrase 7 csc151-syllax)
"triskaidekaphobia"             ; 7 (it's good to know we can get all seven)
> (phrase 7 csc151-syllax)
"pair collaborative Scheme"     ; 1 + 5 + 1 = 7
> (phrase 7 csc151-syllax)
"confusing decompose pair"      ; 3 + 3 + 1 = 7

c. Document and write a procedure, (haiku syllax), that takes as input a syllax of the form above and returns a string of the appropriate form, with each line terminated by the strange value "\n", which represents a carriage return.

> (haiku csc151-syllax)
"boolean Scheme list\ntriskaidekaphobia\ncdr fold sandwich\n"
> (haiku csc151-syllax)
"boolean data\nbinary recursion lab\nDrRacket pair\n"
> (haiku csc151-syllax)
"experiential\ntriskaidekaphobia\nabstraction MP\n"
> (haiku csc151-syllax)
"list MP image\nsyllax test jelly SoLA\nvector match lambda\n"

> (display "boolean Scheme list\ntriskaidekaphobia\ncdr fold sandwich\n")
boolean Scheme list
triskaidekaphobia
cdr fold sandwich

> (display (haiku grinnell-syllax))
Mary B. James husk
education corn The Bear
remote Mears husk Noyce

> (display (string-append (haiku grinnell-syllax) "\n" (haiku grinnell-syllax)))
corn education
laurel leaf Webex scarlet
train advisor husk

JRC Noyce Noyce
Younker Tutorial train
corn Mary B. James

> (display (haiku csc151-syllax))
recurse binary
generalize binary
experiential

As you have observed, these are not particularly good examples of Haiku. But you might find that some generate something interesting enough that you’d want to adapt them. For example, I love the “cdr fold sandwich”; perhaps you do, too. And that last Haiku isn’t bad.

d. Create an additional syllax of your choice to use in addition to the syllaxen above. Then provide a few sample Haiku generated by your procedure and your syllax.

Part the second: Extracting words and other utilities

In generating some kinds of text it can be useful to have a large corpus of words. And, in many cases, we achieve “interesting” results by using the words of others. Let’s consider how we might make a list of all the different words that appear in a book.

a. Document, write tests for, and implement a procedure, (extract-words str), that extracts all of the words from a string. Make sure that your words can include both uppercase and lowercase letters and that they can include hyphens and apostrophes.

If you’ve already written or seen such a procedure, you can feel free to grab it from elsewhere, but make sure to include citations.

b. Document, write tests for, and implement a recursive procedure, (list-contains? lst val), that determines whether a list contains a particular value. You may not use filter or index or anything similar.

c. Document, write tests for, and implement a recursive procedure, (dedup lst), that takes a list and removes all the duplicates from the list. You may not use any existing remove procedures provided by DrRacket.

Your procedure should return the elements in the same order that they appear in lst.

Part the third: Slightly Angry Libs

As you may recall, Mad Libs are a type of game in which we work with a piece of text with blanks in which each blank is marked with a particular part of speech. You play Mad Libs by gathering those parts of speech, filling in the blanks, and reading the text aloud. One way computers help us with this game is that they can fill in the blanks randomly, as we saw in a recent lab.

But we can also do a bit more. Instead of using predefined Mad Libs, we can start with “normal” text, identify parts of speech in that text, grab an equivalent part of speech from a list, and replace them in the text.

For example, suppose we have the following text:

Look Frankie! See Jessie play! See Jessie play with Spot. Spot is a wonderful dog. See Jessie throw a ball with Spot. Frankie wants to play, too. Will Jessie play with Frankie?

First, we might select a new set of characters from amongst a larger set, and substitute in those new characters.

Look Prof. Smith! See Scholar Smythe play! See Scholar Smythe play with President Smitty. President Smitty is a wonderful dog. See Scholar Smythe throw a ball with President Smitty. Prof. Smith wants to play, too. Will Scholar Smythe play with Prof. Smith?

Next, we might replace the nouns with new nouns.

Look Prof. Smith! See Scholar Smythe play! See Scholar Smythe play with President Smitty. President Smitty is a wonderful window. See Scholar Smythe throw a table with President Smitty. Prof. Smith wants to play, too. Will Scholar Smythe play with Prof. Smith?

We might then replace some of the first-person, singular, intransitive verbs with other verbs in the same form. Our corpus is currently relatively limited: play, dance, compute, sleep, cough, and program.

Look Prof. Smith! See Scholar Smythe sleep! See Scholar Smythe dance with President Smitty. President Smitty is a wonderful window. See Scholar Smythe throw a table with President Smitty. Prof. Smith wants to program, too. Will Scholar Smythe cough with Prof. Smith?

Wow! Things are going downhill fast. But the potential for something interesting is there. I think.

a. Document and write a procedure, (replace-characters original-characters new-characters words), that takes three lists of strings as parameters and, for each string in the first list (original-characters) chooses one random element of the second list (new-characters) and then replaces each original character in the last list with the new character. As you might guess, this represents the first step in the example above.

b. Document and write a procedure, (replace-words category words), that takes two lists of strings as parameters and, each time one of the words in the first list appears in the second list, replaces the word with a randomly selected element from the first list. (The replacement might be the same word.) As you might guess, this represents the steps of replacing nouns and third-person, singular present-tense, intransitive verbs.

c. Identify a piece of public domain text of about one page. (Alternately, write our own.) Using that text, as well as a list of characters in the text, a list of new character names, and at least three lists of parts of speech you choose, write a procedure, (maddish-libs) that generates a random variant of the original text. Your code will look something like the following.

(define maddish-libs
  (let ([text (extract-words (file->string "my-file.txt"))]
        [original-characters '("Frankie" "Jessie" "Spot")]
        [new-characters '("Sam" "Prof. J"  "Scholar Smythe"
                          "Professor Smith" "President Smitty"
                          "Flornnng The Spotted Dinosaur")]
        [singular-nouns '("dog" "window" "cat" "ball" "table" "computer")]
        [third-person-singular-present-tense-intransitive-verbs
         '("play" "fly" "sleep" "cough" "compute" "program" "design")]
        [third-person-singular-present-tense-transitive-verbs
         '("throw" "defenestrate" "eat")])
    (lambda ()
      (let* ([new-text-1 (replace-characters original-characters
                                             new-characters
                                             text)]
             [new-text-2 (replace-words singular-nouns new-text-1)]
             [new-text-3 (replace-words third-person-singular-present-tense-intransitive-verbs new-text-2)]
             [new-text-4 (replace-words third-person-singular-present-tense-transitive-verbs new-text-3)])
        (reduce (lambda (s1 s2) (string-append s1 " " s2))
                new-text-4)))))

Make sure to include at least two sample runs in your submission, commented out with #| and |#.

Part the fourth: Identifying syllables

In generating some kinds of text, such as those in a previous problem, it is useful to have a large corpus of words in different categories. One set of categories are words with a certain number of syllables.

a. Document and write a procedure, (syllables word), that attempts to determine how many syllables are in the string word. You can assume that word consists of only lowercase letters.

How do you decide how many syllables are in a word? You’ll need to find an appropriate algorithm. (Many of you are creative enough to do so on your own, but if you’re struggling, the Internet is an awesome place.)

b. Include some non-trivial examples of when your procedure works well and some of when it fails to work correctly.

c. Make a copy of The Project Gutenberg version of Jane Eyre, available at http://www.gutenberg.org/files/1260/1260-0.txt. Please name it 1260-0.txt and place it in the same directory as your Racket program.

d. Using syllables, filter, and any other procedures you deem appropriate, generate a syllax that consists of the one-syllable, two-syllable, three-syllable, four-syllable, five-syllable, six-syllable, and seven-syllable words in the first thousand or so lines of Jane Eyre.

e. Use those lists to generate some Haiku. Include examples.

Part the fifth: Rhyming

What makes a poem? While there is no requirement that poetry rhyme, many people associate rhyme with poetry. It is also certainly the case that many forms of poetry, such as a quatrain make use of rhyme.

As we think about generating or analyzing text, it may be useful to to be able to identify rhymes. Of course, we appear to be working in the wonderfully inconsistent language known as English, so precise definition of rhymes are difficult.

a. One possible metric for rhyming is the end of the word. Document and implement a procedure, (might-rhyme? word1 word2), that takes two strings that represent words (e.g., all lowercase letters plus potential apostrophes) and returns true if the two words share the last three characters.

Note: Your procedure should return false if either or both of the words has fewer than three characters.

b. Identify at least six pairs of words that do not rhyme, but pass that test. You might, for example, pick some random words (or the words from Jane Eyre) and then use filter to look through a larger list of words to see which seem to rhyme. Include them as a comment in your submission.

c. Identify at least six pairs of words that do rhyme, but do not pass that test. Include them as a comment in your submission.

d. Describe something you could do to address each of these two cases. That is, how could you avoid the non-rhymes and how could you incorporate the rhymes? Include your analysis as a comment in your submission.

e. Using your additional analysis, write a better (rhymes? word1 word2) procedure. You are free to make this as simple or as complicated as you like, provided you incorporate your ideas from d.

f. Using rhymes?, write a procedure, (rhymes-with word words), that takes a string and a list of strings as input and finds all of the words in words that appear to rhyme with word. (You should, of course, document rhymes-with.)

g. Write a procedure (abab words) that takes as input a corpus (list) of words and returns a string that represents a “random” quatrain of four lines of four words each. The last words of the first and third lines must rhyme, as must the last words of the second and fourth lines.

h. Include a few examples of abab in your submission.

Postscript

Don’t worry if you can’t find the word “syllax” anywhere, except in strange games. We invented it for this assignment. But the plural of syllax is definitely “syllaxen”.

Evaluation

We will primarily evaluate your work on correctness (does your code compute what it’s supposed to and are your procedure descriptions accurate); clarity (is it easy to tell what your code does and how it acheives its results; is your writing clear and free of jargon); and concision (have you kept your work short and clean, rather than long and rambly). In a few cases, we will also consider the creativity of your result.

Rubric

Necessary for an R

[ ] File named correctly (`language.rkt`).
[ ] File "runs" correctly in DrRacket.
[ ] Reformatted code with Ctrl-I before submitting. 
[ ] If the program references other files, those files were submitted.
[ ] If the program references other files, it does so with the base file name, rather than a complex path.
[ ] Introductory header comment with name, date, assignment, course, citations.

[ ] 1a. Includes examples.
[ ] 1c. `haiku` procedure is documented.

[ ] 2a. `extract-words` procedure is documented.
[ ] 2b. `dedup` procedure is documented.

[ ] 3a. `replace-characters` procedure is documented.
[ ] 3a. `replace-characters` procedure works for some nontrivial input.
[ ] 3b. `replace-words` procedure is documented.
[ ] 3b. `replace-words` procedure works for some nontrivial input.
[ ] 3c. `maddish-libs` procedure is documented.
[ ] 3c. `maddish-libs` procedure works for some nontrivial input.

Necessary for M

[ ] Correctly formatted/indented code.
[ ] Procedures that are supposed to return strings, like `haiku`, return strings rather than using `display`.
[ ] All procedures, including helpers, are documented.

[ ] 1a. `phrase` procedure passes all tests.
[ ] 1b. Includes a record of experiments, including counting of words.
[ ] 1c. `haiku` includes the "\n" at the end of the third line.
[ ] 1d. Includes another syllax.
[ ] 1d. Includes a few examples.

[ ] 2a. `extract-words` appears to meet the basic requirements.
[ ] 2a. Includes at least two tests for `extract-words` 
[ ] 2b. `list-contains` appears to meet the specifications.
[ ] 2b. Includes at least two tests for `list-contains`
[ ] 2c. `dedup` meets the specs.
[ ] 2c. Includes at least two tests for `dedup` 

[ ] 3a. `replace-characters` meets the specifications.
[ ] 3b. `replace-words` meets the specifications.
[ ] 3c. `maddish-libs` meets the specifications.
[ ] 3c. Includes at least two examples of `maddish-libs`.

[ ] 4a. The `syllables` procedure counts vowel sequences, not just vowels.
[ ] 4b. Includes examples of `syllables` with correct counts.
[ ] 4d. Generates a syllax for _Jane Eyre_ of the requested form.
[ ] 4e. Includes at least two Haiku examples from the _Jane Eyre_ syllax.

[ ] 5a. The `might-rhyme?` procedure accommodates words of fewer than three letters.
[ ] 5d. The narrative provides some suggestion of how to deal with the rhyme problems.
[ ] 5e. `rhymes?` does something different than `might-rhyme?`.
[ ] 5h. Includes examples of `abab`

Necessary for E

[ ] Avoids expensive repeated work.
[ ] Clear variable names.

[ ] 2a. `extract-words` goes beyond the basic requirements in at least one way.
[ ] 2a. Includes at least three edge-case tests for `extract-words` and at least five total tests.
[ ] 2b. Includes at least three edge-case tests for `list-contains` and at least five total tests.
[ ] 2c. Includes at least three edge-case tests for `dedup` and at least five total tests.

[ ] 3c. At least five different parts of speech in `maddish-libs`.

[ ] 4a. The `syllables` procedure has at least two extensions to "count vowel sequences".
[ ] 4b. Includes at least one example of syllables with an incorrect count.

[ ] 5e. The new `rhymes?` procedures addresses some of the non-rhyming pairs.
[ ] 5e. The new `rhymes?` procedures addresses some of the rhyming pairs.
[ ] 5e. The new `rhymes?` procedure explains the things it is checking (e.g., with examples)
[ ] 5g. The `abab` procedure ensures that words have rhymes. 
[ ] 5g. `abab` procedure avoids significant repetition.

[ ] One of `rhymes?` or `syllables` is particularly well designed.