Skip to main content

Assignment 5: Language generation

Assigned
Wednesday, 20 February 2019
Due
Tuesday, 26 February 2019 by 10:30pm
Summary
In this assignment, you will write programs that generate (or attempt to generate) different forms of writing. Along the way, you will explore issues pertaining to randomness, conditional behavior, and textual analysis.
Collaboration
You must work with your assigned partner(s) on this assignment. You may discuss this assignment with anyone, provided you credit such discussions when you submit the assignment.
Submitting
Email your answer to csc151-01-grader@grinnell.edu. The subject of your email should be [CSC151-01] Assignment 5 (Your Names) and should contain your answers to all parts of the assignment. Scheme code should be in the body of the message, not in an attachment.

Problem 1: Generating simplified Haiku

Topics: randomness, strings, language generation

Haiku are three-line poems that consist of a line with five syllables, a line with seven syllables, and a line with five syllables.

a. Create the following lists of words, each of which contains at least five different words of the stated form.

  • one-syllable-words, a list of words with one syllable
  • two-syllable-words, a list of words with two syllables
  • three-syllable-words, a list of words with three syllables
  • four-syllable-words, a list of words with four syllables
  • five-syllable-words, a list of words with five syllables

b. Document and write a procedure, (two-syllable-group), that randomly generates a two-syllable group of words, using either two one-syllable words or a single two-syllable word. For example, if one-syllable-words contains '(ant ball car dog eat) and two-syllable-words contains (aardvark baseball cocoon dragon exit), we might see the following behavior.

> (two-syllable-group)
"car ant"
> (two-syllable-group)
"dragon"
> (two-syllable-group)
"ball eat"

c. Document and write a procedure, (three-syllable-group), that randomly generates a three-syllable group of words, using either (i) a one-syllable word followed by a two-syllable group, (ii) a two-syllable group followed by a one-sylalble word, or (iii) a three-syllable word.

d. Document and write a procedure, (four-syllable-group), that randomly generates a four-syllable group of words, using either (i) a one-syllable word followed by a three-syllable group, (ii) a three-syllable group followed by a one-syllable word, (iii) two two-syllable groups, or (iv) a four-syllable word.

e. Document and write a procedure, (five-syllable-group), that randomly generates a five-syllable group of words, using either (i) a one-syllable word followed by a four-syllable group, (ii) a two-syllable group followed by a three-syllable group, (iii) a three-syllable group followed by a two-syllable group, (iv) a four-syllable group followed by a one-syllable word, or (v) a five-syllable word.

f. Document and write a procedure, (seven-syllable-group), that randomly generates a seven-syllable group of words using either (i) a two-syllable group followed by a five-syllable group or (ii) a five-syllable group followed by a two-syllable group.

g. Document and write a procedure, (haiku), that generates a Haiku of the appropriate form.

> (haiku)
"exit dog dragon\nbaseball dog television\nelephant eat car\n"
> (display (haiku))
Output! exceeding dog car
Output! ant dog eat car baseball ball
Output! exit ball eat car

h. As you explore your haiku procedure, you may discover that there seems to be a bias toward short words. Write a new procedure (perhaps with some additional helper procedures), (haiku2), that generates Haiku that are more likely to have longer words.

Problem 2: Extracting words

Topics: files, strings, regular expressions

In generating some kinds of text it can be useful to have a large corpus of words. And, in many cases, we achieve “interesting” results by using the words of others. Let’s consider how we might make a list of all the different words that appear in a book.

While you may have recently written a procedure that removes duplicates from a list, it’s possible that there were infelicities in that procedure. Here is a procedure that claims to remove duplicates from a sorted list. (This procedure is another in the category of “procedures for which you may understand the what but not the *how”.)

;;; Procedure:
;;;   remove-duplicates
;;; Parameters:
;;;   lst, a sorted list of values
;;; Purpose:
;;;   Remove duplicates from lst.
;;; Produces:
;;;   unique, a sorted list of values
;;; Preconditions:
;;;   [No additional]
;;; Postconditions:
;;;   * Every element in unique appears in lst.
;;;   * Every element in lst is equal to some element in unique.
;;;   * unique is sorted in the same way as lst.
(define remove-duplicates
  (lambda (lst)
    (cond
      [(or (null? lst) (null? (cdr lst)))
       lst]
      [(equal? (car lst) (cadr lst))
       (remove-duplicates (cdr lst))]
      [else
       (cons (car lst) (remove-duplicates (cdr lst)))])))

Verify that the procedure appears to work as advertised. (There’s nothing to turn in for this part.)

Once you’ve verified that this procedure works, you’re ready for the real work.

Document and write a procedure, (unique-words file) that

  • reads file as a string [using file->string],
  • extracts all of the entries that “look like” words in that they consist only of letters and, optionally, an apostrophe (e.g,. for “it’s” or “couldn’t”) [using regular expressions],
  • converts them all to lowercase [using string-downcase],
  • sorts the list [using sort], and
  • removes duplicates [using remove-duplicates].

Problem 3: Identifying syllables

Topics: strings, randomness, text analysis, conditionals

In generating some kinds of text, such as those in the previous problem, it is useful to have a large corpus of words in different categories. One set of categories are words with a certain number of syllables.

a. Document and write a procedure, (syllables word), that attempts to determine how many syllables are in the string word. You can assume that word consists of only lowercase letters.

How do you decide how many syllables are in a word? One technique that works in many cases is to identify how many sequences of vowels there are. In many instances, that provides a rough estimate. However, there are also many cases in which that estimate fails (potentially, it fails for “syllables”, although we could argue that the internal “y” serves as a vowel). So try to be creative in figuring out other special patterns. It is likely that you will need one or more conditionals in your procedure.

b. As you may recall, the file /home/rebelsky/Desktop/pg1260.txt contains the Project Gutenberg version of Jane Eyre. Using syllables, filter, and any other procedures you deem appropriate, generate lists of the one-syllable, two-syllable, three-syllable, four-syllable, and five-syllable words in Jane Eyre.

c. Use those lists to generate some interesting pattern of text, such as a Haiku.

Problem 4: Identifying rhymes

Topics: strings, text analysis, conditionals, randomness

What makes a poem? While there is no requirement that poetry rhyme, many people associate rhyme with poetry. It is also certainly the case that many forms of poetry, such as a quatrain make use of rhyme.

As we think about generating or analyzing text, it may be useful to to be able to identify rhymes. Of course, we appear to be working in the wonderfully inconsistent language known as English, so precise definition of rhymes are difficult.

a. One possible metric for rhyming is the end of the word. Write a procedure, (might-rhyme? word1 word2), that takes two strings that represent words (e.g., all lowercase letters plus potential apostrophes) and returns true if the two words share the last three characters.

Note: Your procedure should work correctly if one or both of the words has fewer than three characters.

b. Identify a dozen or so pairs of words that do not rhyme, but pass that test. You might, for example, pick some random words and then use filter to look through a larger list of words to see which seem to rhyme.

c. Identify a dozen or so pairs of words that do rhyme, but do not pass that test.

d. Using your additional analysis, write a better (rhymes? word1 word2) procedure. You are free to make this as simple or as complicated as you like, provided it is at least as successful as might-rhyme. (You should, of course, document rhymes?.)

e. Using rhymes?, write a procedure, (rhymes-with word words), that finds all of the words in words that appear to rhyme with word. (You should, of course, document rhymes?.)

f. Write a procedure (abab words) that takes as input a corpus of words and generates a “random” quatrain of four lines of four words. The last words of the first and third lines must rhyme, as must the last words of the second and fourth lines.

Problem 5: Identifying and using nearby words

Topics: strings, text analysis, regular expressions, conditionals, randomness, local bindings

As you’ve likely realized, generating actual language is hard, and writing programs that “interpret” language is often even harder. One of the legendary challenges of language generation has to do with the differences between two very similar statements.

Time flies like an arrow.

Fruit flies like an apple.

Can you tell why that pair is complex? If not, ask your faculty member or mentor.

In looking for ways to generate somewhat realistic text, one approach that has shown some promise relies on a relatively straightforward analysis of an existing text.

  • You start with some word that you know can start a sentence.
  • You randomly select from among the words that immediately follow that word in the original text.
  • You randomly select from among the words that immediately follow that word in the original text.
  • And so on and so forth, until you reach the end of the sentence.

This approach sometimes works surprising well and sometimes works relatively poorly. We can often improve it by working with pairs or triplets of words. But for now, we’ll stick with single words.

We’re also going to try a variant of this technique, in which we work from the back of a sentence to the front, rather than the front to the back.

a. Document and write a procedure, (sentence-ends str), that finds all of the words in str that end sentences. For example,

> (sentence-ends "The cat ate the hat.  The rat sat.")
'("hat" "sat")
> (sentence-ends "Do you like blue mac and cheese?  No I don't, it makes me sneeze!")
'("cheese" "sneeze")

b. Document and write a procedure, (left-neighbors word str), that finds all of the words that immediately precede word in str. For example,

> (left-neighbors "hat" "The cat sat on the hat.  'Where is my hat?' asked the rat.  It's now a flat hat.  How 'bout that?  Will the fat rat jump on that brat cat?")
'("the" "my" "flat")

With those two procedures, we should be able to generate things that appear to be similar sentences. Let’s see.

  • We randomly pick amongst the ending words. Those are “hat”, “rat”, “hat”, “that”, and “cat”, in this case. Let’s say we pick “hat”.
  • We identify the left neighbors of “hat”. Those are “the”, “my”, and “flat”. Let’s say we pick “the”.
  • We identify the left neighbors of “the”. Those are “on”, “asked”, and “Will”. Let’s say we pick “asked”.
  • We identify the left neighbors of “asked”. There’s only one, it’s “hat”.
  • You know the left neighbors of “hat”. Let’s say we pick “my”.

That’s probably enough. We’ve now generated the phrase “my hat asked the hat”. While it’s not Shakespeare, it is potentially promising.

c. Document and write a procedure, (random-sentence words) that

  • identifies that ending words in words [using sentence-ends],
  • randomly selects one of those [using random-elt],
  • identifies the left neighbors of that word [using left-neighbors],
  • randomly selects one of those [using random-elt],
  • identifies the left neighbors of that word [using left-neighbors],
  • randomly selects one of those [using random-elt],

After selecting six words, you should then combine them together into a single sentence, using string-append.

d. It may be worth comparing this “backwards” approach to a more forwards approach. To get ready, document and write a procedure (right-neighbors word str) that finds all the words that immediately follow word in str. (We’re not going to have you do the rest of that experiment, but you might find right-neighbors useful elsewhere in this assigment.)

Problem 6: Generating poetic forms

Topics: text analysis, text generation, creativity

You’ve explored a variety of issues in analyzing and generating text. It’s now time to explore creative ways to use what you have learned.

Poets.org provides details on a wide variety of poetic forms, such as limericks.

Pick a non-trivial poetic form and write a program to generate (or approximate) poetry of that form.

Documentation

For this assignment, you should document your procedures using the 6P documentation style. For procedures that randomly generate outputs, you should specify as much as possible about the output and then add something like “the output is difficult to predict”.

Evaluation

We will primarily evaluate your work on correctness (does your code compute what it’s supposed to and are your procedure descriptions accurate); clarity (is it easy to tell what your code does and how it acheives its results; is your writing clear and free of jargon); and concision (have you kept your work short and clean, rather than long and rambly). In a few cases, we will also consider the creativity of your result.