Mini-Project 5: Language generation

Assigned
Wednesday, 2 December 2020
Summary
In this assignment, you will write programs that generate (or attempt to generate) different forms of writing. Along the way, you will explore issues pertaining to randomness, conditional behavior, and textual analysis.
Collaboration
Each student should submit their own responses to this assignment. You may consult other students in the class as you develop your responses. You may also consult with the normal host of other folks: Mentors, tutors, SamR, Grinnell CS students, random strangers on the Web, and so on and so forth. If you receive help from anyone, make sure to cite them in your responses. You do not need to cite course pages you were instructed to read for this assignment.

Disclaimer: In this assignment, you will read and generate some utterly horrible “poetry”. Please accept my apologies if any of the work included herein offends your expectations, cultural or otherwise.

Note: In doing this mini-project, you will likely want to refer back to the lab on random language generation.

Please name your file language.rkt.

Part the first: Generating Haiku

Haiku are three-line poems that consist of a line with five syllables, a line with seven syllables, and a line with five syllables.

a. Create the following lists of words (strings), each of which contains at least five different words of the stated form.

  • one-syllable-words, a list of words with one syllable
  • two-syllable-words, a list of words with two syllables
  • three-syllable-words, a list of words with three syllables
  • four-syllable-words, a list of words with four syllables
  • five-syllable-words, a list of words with five syllables

Combine those into a list or vector in which the value at position i is a list of words with i syllables. Your code will look something like

(define words-by-syllable
  (list '() 
        one-syllable-words 
        two-syllable-words 
        three-syllable-words 
        four-syllable-words 
        five-syllable-words))

b. Document, design, and implement a recursive procedure, (phrase n words) that randomly generates a string containing a phrase of n syllables given a list or vector (your choice) of the form above. Make sure that you accommodate different “patterns” of n syllables. For example, if n is 4, you might use

  • a one-syllable phrase (which is just a one-syllable word) and a three-syllable phrase
  • a two-syllable phrase and a two-syllable phrase
  • a three-syllable phrase and a one-syllable phrase (which is just a one-syllable word)
  • a four-syllable word

It is difficult to write tests for random procedures, so you will likely have to conduct experiments instead. Please include a record of your experiments.

c. Document and write a procedure, (haiku), that generates a Haiku of the appropriate form, with each line terminated by the strange value "\n".

> (haiku)
"exit dog dragon\nbaseball dog television\nelephant eat car\n"
> (display (haiku))
exceeding dog car
ant dog eat car baseball ball
exit ball eat car

d. As you explore your haiku procedure, you may discover that there seems to be a bias toward short words. Write a new procedure (perhaps with some additional helper procedures), (haiku2), that generates Haiku that are more likely to have longer words.

Part the second: Extracting words

In generating some kinds of text it can be useful to have a large corpus of words. And, in many cases, we achieve “interesting” results by using the words of others. Let’s consider how we might make a list of all the different words that appear in a book.

While you may have recently written a procedure that removes duplicates from a list, it’s possible that there were infelicities in that procedure. Here is a procedure that claims to remove duplicates from a sorted list.

;;; (remove-duplicates lst) -> list?
;;;   lst : list?
;;; Remove duplicates from lst.
;;; Precondition: lst must be sorted, with duplicates next to each other.
(define remove-duplicates-x
  (lambda (lst)
    (cond
      [(or (null? lst) (null? (cdr lst)))
       lst]
      [(equal? (car lst) (cadr lst))
       (remove-duplicates-x (cdr lst))]
      [else
       (cons (car lst) (remove-duplicates-x (cdr lst)))])))

Verify that the procedure appears to work as advertised. (There’s nothing to turn in for this part.)

Once you’ve verified that this procedure works, you’re ready for the real work.

a. Write a tail-recursive version of remove-duplicates-x, which you should call remove-duplicates.

b. Document, write tests for, design, and implement a procedure, (unique-words string) that

  • extracts all of the entries from the string that “look like” words in that they consist only of letters and, optionally, an apostrophe (e.g,. for “it’s” or “couldn’t”) [you can use something simple, like string-split, or you can use something fancier]
  • converts them all to lowercase [using string-downcase],
  • sorts the list [using sort], and
  • removes duplicates [using your new version of remove-duplicates].

Part the third: Identifying syllables

In generating some kinds of text, such as those in a previous problem, it is useful to have a large corpus of words in different categories. One set of categories are words with a certain number of syllables.

a. Document and write a procedure, (syllables word), that attempts to determine how many syllables are in the string word. You can assume that word consists of only lowercase letters.

How do you decide how many syllables are in a word? One technique that works in many cases is to identify how many sequences of vowels there are. In many instances, that strategy provides a good rough estimate. However, there are also many cases in which that estimate fails (potentially, it fails for “syllables”, although we could argue that the internal “y” serves as a vowel). So try to be creative in figuring out other special patterns. It is likely that you will need one or more conditionals in your procedure.

It is fine if your procedure does not work perfectly, or even all that well. We’d simply like to see some thought and creativity.

b. Make a copy of The Project Gutenberg version of Jane Eyre, available at http://www.gutenberg.org/files/1260/1260-0.txt. Please name it 1260.txt and place it in the same directory as your Racket program.

c. Using syllables, filter, and any other procedures you deem appropriate, generate lists of the one-syllable, two-syllable, three-syllable, four-syllable, and five-syllable words in Jane Eyre.

d. Use those lists to generate some interesting pattern of text, such as a Haiku.

Part the fourth: Rhyming

What makes a poem? While there is no requirement that poetry rhyme, many people associate rhyme with poetry. It is also certainly the case that many forms of poetry, such as a quatrain make use of rhyme.

As we think about generating or analyzing text, it may be useful to to be able to identify rhymes. Of course, we appear to be working in the wonderfully inconsistent language known as English, so precise definition of rhymes are difficult.

a. One possible metric for rhyming is the end of the word. Write a procedure, (might-rhyme? word1 word2), that takes two strings that represent words (e.g., all lowercase letters plus potential apostrophes) and returns true if the two words share the last three characters.

Note: Your procedure should work correctly if one or both of the words has fewer than three characters.

b. Identify a dozen or so pairs of words that do not rhyme, but pass that test. You might, for example, pick some random words and then use filter to look through a larger list of words to see which seem to rhyme.

c. Identify a dozen or so pairs of words that do rhyme, but do not pass that test.

d. Using your additional analysis, write a better (rhymes? word1 word2) procedure. You are free to make this as simple or as complicated as you like, provided it is at least as successful as might-rhyme?. (You should, of course, document rhymes?.)

e. Using rhymes?, write a procedure, (rhymes-with word words), that takes a string and a list of strings as input and finds all of the words in words that appear to rhyme with word. (You should, of course, document rhymes-with.)

f. Write a procedure (abab words) that takes as input a corpus (list) of words and generates a “random” quatrain of four lines of four words. The last words of the first and third lines must rhyme, as must the last words of the second and fourth lines.

Part the fifth: Nearby words and sentence generation

As you’ve likely realized, generating actual language is hard, and writing programs that “interpret” language is often even harder. One of the legendary challenges of language generation has to do with the differences between two very similar statements.

Time flies like an arrow.

Fruit flies like an apple.

Can you tell why that pair is complex? If not, ask your faculty member or mentor.

In looking for ways to generate somewhat realistic text, one approach that has shown some promise relies on a relatively straightforward analysis of an existing text.

  • You start with some word that you know can start a sentence.
  • You randomly select from among the words that immediately follow that word in the original text.
  • You randomly select from among the words that immediately follow that word in the original text.
  • And so on and so forth, until you reach the end of the sentence.

This approach sometimes works surprising well and sometimes works relatively poorly. We can often improve it by working with pairs or triplets of words. But for now, we’ll stick with single words.

We’re also going to try a variant of this technique, in which we work from the back of a sentence to the front, rather than the front to the back.

a. Document and write a procedure, (sentence-ends str), that finds all of the words in str that end sentences. For example,

> (sentence-ends "The cat ate the hat.  The rat sat.")
'("hat" "sat")
> (sentence-ends "Do you like blue mac and cheese?  No I don't, it makes me sneeze!")
'("cheese" "sneeze")

b. Document and write a procedure, (left-neighbors word str), that finds all of the words that immediately precede word in str. For example,

> (left-neighbors "hat" "The cat sat on the hat.  'Where is my hat?' asked the rat.  It's now a flat hat.  How 'bout that?  Will the fat rat jump on that brat cat?")
'("the" "my" "flat")

With those two procedures, we should be able to generate things that appear to be similar sentences. Let’s see.

  • We randomly pick amongst the ending words. Those are “hat”, “rat”, “hat”, “that”, and “cat”, in this case. Let’s say we pick “hat”.
  • We identify the left neighbors of “hat”. Those are “the”, “my”, and “flat”. Let’s say we pick “the”.
  • We identify the left neighbors of “the”. Those are “on”, “asked”, and “Will”. Let’s say we pick “asked”.
  • We identify the left neighbors of “asked”. There’s only one, it’s “hat”.
  • You know the left neighbors of “hat”. Let’s say we pick “my”.

That’s probably enough. We’ve now generated the phrase “my hat asked the hat”. While it’s not Shakespeare, it is potentially promising.

c. Document and write a procedure, (random-sentence words) that

  • identifies that ending words in words [using sentence-ends],
  • randomly selects one of those [using random-elt],
  • identifies the left neighbors of that word [using left-neighbors],
  • randomly selects one of those [using random-elt] (if there are no left neighbors, uses a random word),
  • identifies the left neighbors of that word [using left-neighbors],
  • randomly selects one of those [using random-elt] (if there are no left neighbors, uses a random word),

After selecting six words, you should then combine them together into a single sentence, using string-append.

d. It may be worth comparing this “backwards” approach to a more forwards approach. To get ready, document and write a procedure (right-neighbors word str) that finds all the words that immediately follow word in str. (We’re not going to have you do the rest of that experiment, but you might find right-neighbors useful elsewhere in this assigment.)

Part the sixth: Freestyle poetry

You’ve explored a variety of issues in analyzing and generating text. It’s now time to explore creative ways to use what you have learned.

Poets.org provides details on a wide variety of poetic forms, such as limericks. (You can see the different poetic forms in hte list of poems.

Pick a non-trivial poetic form and write a program to generate (or approximate) poetry of that form. You may choose to work from fixed sets of words or words you draw from another source.

Evaluation

We will primarily evaluate your work on correctness (does your code compute what it’s supposed to and are your procedure descriptions accurate); clarity (is it easy to tell what your code does and how it acheives its results; is your writing clear and free of jargon); and concision (have you kept your work short and clean, rather than long and rambly). In a few cases, we will also consider the creativity of your result.