Mini-Project 6: Language generation

Summary
In this assignment, you will write programs that generate (or attempt to generate) different forms of writing. Along the way, you will explore issues pertaining to randomness, conditional behavior, and textual analysis.
Collaboration
Each student should submit their own responses to this assignment. You may consult other students in the class as you develop your responses. You may also consult with the normal host of other folks: Mentors, tutors, SamR, Grinnell CS students, random strangers on the Web, and so on and so forth. If you receive help from anyone, make sure to cite them in your responses. You do not need to cite course pages you were instructed to read for this assignment.

Disclaimer: In this assignment, you will read and generate some utterly horrible “poetry” and other text. Please accept my apologies if any of the work included herein offends your expectations, cultural or otherwise.

Note: In doing this mini-project, you will likely want to refer back to the lab on random language generation.

Please name your file language.rkt.

Background

A syllabic lexicon, or syllax for short, is a collection of words or short phrases arranged by syllables. For the purposes of this class, a syllax is a vector of vectors, where the vector at index i contains only the words with i syllables.

For example, here is a syllax for some words related to CSC-151.

(define csc151-syllax
  (vector 
   ; 0
   (vector)
   ; 1
   (vector "cons" "car" "list" "pair" "Scheme" "sort" "match" "string"
           "lab" "map" "fold" "test") 
   ; 2
   (vector "vector" "cadr" "cdr" "Racket" "jelly" "sandwich" "syllax" 
           "image" "recurse" "eboard" "data" "compose" "lambda" "section"
           "SoLA" "MP")
   ; 3
   (vector "recursion" "computer" "digital" "confusing" "programming" 
           "CSC" "abstraction" "decompose" "document" "abstraction"
           "boolean" "binary")
   ; 4
   (vector "humanities" "exponential" "collaborate" "one-fifty-one"
           "algorithm" "DrRacket" "dictionary" "generalize"
           "tail recursion")
   ; 5
   (vector "collaborative" "experiential" "decomposition" "generality")
   ; 6
   (vector)
   ; 7
   (vector "triskaidekaphobia")))

Here’s another syllax for some words related to Grinnell college.

(define grinnell-syllax
  (vector
   ; 0
   (vector)
   ; 1
   (vector "Mears" "Noyce" "husk" "train" "corn" "black")
   ; 2
   (vector "self-gov" "Stonewall" "The Bear" "first-year" "scarlet"
           "remote" "Webex" "prairie" "need-blind" "soybeans" "Hopkins"
           "Younker" "Dibble" "cluster" "scurry")
   ; 3
   (vector "liberal" "JRC" "CLS" "advisor" "FYE" "laurel leaf" "Honor G"
           "ARH" "North Campus" "Iowa" "semester" "Women's quad"
           "Grinnellian")
   ; 4
   (vector "curriculum" "Mary B. James" "Tutorial" "Green alien" "convocation" "education")
   ; 5
   (vector)
   ; 6
   (vector "Congregationalist")))

Helpful procedures

In doing this assignment, you may find the following procedures of use. You need not cite them; since you’re doing this assignment, we’ll assume that you’re taking ideas from the assignment. You may have to update your csc151 library to ensure that you have vector-andmap, which we just added recently.

;;; (random-vector-element vec) -> any?
;;;   vec : vector? (nonempty)
;;; Randomly select an element of `vec`
(define random-vector-element
  (lambda (vec)
    (vector-ref vec (random (vector-length vec)))))

;;; (strvec? val) -> boolean?
;;;   val : any?
;;; Determines if val is a vector of strings?
(define strvec?
  (lambda (val)
    (and (vector? val)
         (vector-andmap string? val))))

Part the first: Generating Haiku

Haiku are three-line poems that consist of a line with five syllables, a line with seven syllables, and a line with five syllables.

a. Our principles of decomposition suggest that we should begin by writing a procedure or procedures to generate lines for the Haiku. Our principles of generality suggest that those procedures should accommodate not just five or seven syllables, but an arbitrary number of syllables.

Document, design, and implement a recursive procedure, (phrase n syllax) that randomly generates a string that contains a phrase of n syllables given a syllax of the form above.

Make sure that you accommodate different “patterns” of n syllables. For example, if n is 4, you should randomly choose between

  • a one-syllable word followed by a three-syllable phrase;
  • a two-syllable word followed by a two-syllable phrase;
  • a three-syllable word followed by a one-syllable phrase; and a one-syllable word)
  • a four-syllable word.

Note that there are some complications involved. For example, if n is bigger than the maximum number of syllables in the syllax, it cannot consist of one word.

Note: (random start finish) generates a random integer between start (inclusive) and finish (exclusive).

Note: You may assume that every syllax has at least one one-syllable word. (Without this guarantee, it can be much harder to simplify the problem.)

Hint: You will want to generate a random number between 1 (there are no zero-syllable words) and either n or the most number of syllables in the syllax, whichever is smaller.

Hint: You will want to employ recursion.

b. It is difficult to write tests for random procedures, so you will likely have to conduct experiments instead. Please include a record of your experiments and your analyses of the results of those experiments. (E.g., you should check that the number of syllables is correct.)

> (phrase 7 csc151-syllax)
"CSC binary Scheme"             ; 3 + 3 + 1 = 7
> (phrase 7 csc151-syllax)
"recursion tail recursion"      ; 3 + 1 + 3 = 7 (are we only getting 3/1?)
> (phrase 7 csc151-syllax)
"match recurse one-fifty-one"   ; 1 + 2 + 4 = 7
> (phrase 7 csc151-syllax)
"triskaidekaphobia"             ; 7 (it's god to know we can get all seven)
> (phrase 7 csc151-syllax)
"pair collaborative Scheme"     ; 1 + 5 + 1 = 7
> (phrase 7 csc151-syllax)
"confusing decompose pair"      ; 3 + 3 + 1 = 7

c. Document and write a procedure, (haiku syllax), that takes as input a syllax of the form above and returns a string of the appropriate form, with each line terminated by the strange value "\n", which represents a carriage return.

> (haiku csc151-syllax)
"boolean Scheme list\ntriskaidekaphobia\ncdr fold sandwich\n"
> (haiku csc151-syllax)
"boolean data\nbinary recursion lab\nDrRacket pair\n"
> (haiku csc151-syllax)
"experiential\ntriskaidekaphobia\nabstraction MP\n"
> (haiku csc151-syllax)
"list MP image\nsyllax test jelly SoLA\nvector match lambda\n"

> (display "boolean Scheme list\ntriskaidekaphobia\ncdr fold sandwich\n")
boolean Scheme list
triskaidekaphobia
cdr fold sandwich

> (display (haiku grinnell-syllax))
Mary B. James husk
education corn The Bear
remote Mears husk Noyce

> (display (string-append (haiku grinnell-syllax) "\n" (haiku grinnell-syllax)))
corn education
laurel leaf Webex scarlet
train advisor husk

JRC Noyce Noyce
Younker Tutorial train
corn Mary B. James

> (display (haiku csc151-syllax))
recurse binary
generalize binary
experiential

As you have observed, these are not particularly good examples of Haiku. But you might find that some generate something interesting enough that you’d want to adapt them. For example, I love the “cdr fold sandwich”; perhaps you do, too. And that last Haiku isn’t bad.

d. Create an additional syllax of your choice to use in addition to the syllaxen above. Then provide a few sample Haiku generated by your procedure and your syllax.

Part the second: Extracting words

In generating some kinds of text it can be useful to have a large corpus of words. And, in many cases, we achieve “interesting” results by using the words of others. Let’s consider how we might make a list of all the different words that appear in a book.

a. Document, write tests for, and implement a procedure, (extract-words str), that extracts all of the words from a string. Make sure that your words can include both uppercase and lowercase letters and that they can include hyphens and apostrophes.

If you’ve already written or seen such a procedure, you can feel free to grab it from elsewhere, but make sure to include citations.

b. Document, write tests for, and implement a tail-recursive procedure, (dedup lst), that takes a list and removes all the duplicates from the list. You may not use any existing remove procedures provided by DrRacket.

Your procedure should return the elements in the same order that they appear in lst. Since you’re using tail recursion, you’ll probably need to reverse so-far at the end of the recursion.

Part the third: Identifying syllables

In generating some kinds of text, such as those in a previous problem, it is useful to have a large corpus of words in different categories. One set of categories are words with a certain number of syllables.

a. Document and write a procedure, (syllables word), that attempts to determine how many syllables are in the string word. You can assume that word consists of only lowercase letters.

How do you decide how many syllables are in a word? You’ll need to find an appropriate algorithm. (Many of you are creative enough to do so on your own, but if you’re struggling, the Internet is an awesome place.)

b. Include some non-trivial examples of when your procedure works well and some of when it fails to work correctly.

c. Make a copy of The Project Gutenberg version of Jane Eyre, available at http://www.gutenberg.org/files/1260/1260-0.txt. Please name it 1260-0.txt and place it in the same directory as your Racket program.

d. Using syllables, filter, and any other procedures you deem appropriate, generate a syllax for Jane Eyre that consists of the one-syllable, two-syllable, three-syllable, four-syllable, five-syllable, six-syllable, and seven-syllable words in Jane Eyre.

e. Use those lists to generate some Haiku. Include examples.

Part the fourth: Rhyming

What makes a poem? While there is no requirement that poetry rhyme, many people associate rhyme with poetry. It is also certainly the case that many forms of poetry, such as a quatrain make use of rhyme.

As we think about generating or analyzing text, it may be useful to to be able to identify rhymes. Of course, we appear to be working in the wonderfully inconsistent language known as English, so precise definition of rhymes are difficult.

a. One possible metric for rhyming is the end of the word. Document and implement a procedure, (might-rhyme? word1 word2), that takes two strings that represent words (e.g., all lowercase letters plus potential apostrophes) and returns true if the two words share the last three characters.

Note: Your procedure should return false if either or both of the words has fewer than three characters.

b. Identify at least six pairs of words that do not rhyme, but pass that test. You might, for example, pick some random words (or the words from _Jane Eyre__ and then use filter to look through a larger list of words to see which seem to rhyme. Include them as a comment in your submission.

c. Identify at least six pairs of words that do rhyme, but do not pass that test. Include them as a comment in your submission.

d. Describe something you could do to address each of these two cases. That is, how could you avoid the non-rhymes and how could you incorporate the rhymes? Include your analysis as a comment in your submission.

e. Using your additional analysis, write a better (rhymes? word1 word2) procedure. You are free to make this as simple or as complicated as you like, provided you incorporate your ideas from d.

f. Using rhymes?, write a procedure, (rhymes-with word words), that takes a string and a list of strings as input and finds all of the words in words that appear to rhyme with word. (You should, of course, document rhymes-with.)

g. Write a procedure (abab words) that takes as input a corpus (list) of words and returns a string that represents a “random” quatrain of four lines of four words each. The last words of the first and third lines must rhyme, as must the last words of the second and fourth lines.

h. Include a few examples of abab in your submission.

Part the fifth: Nearby words and sentence generation

As you’ve likely realized, generating actual language is hard, and writing programs that “interpret” language is often even harder. One of the legendary challenges of language generation has to do with the differences between two very similar statements.

Time flies like an arrow.

Fruit flies like an apple.

Can you tell why that pair is complex? If not, ask your faculty member or mentor.

In looking for ways to generate somewhat realistic text, one approach that has shown some promise relies on a relatively straightforward analysis of an existing text.

  • You start with some word that you know can start a sentence.
  • You randomly select from among the words that immediately follow that word in the original text.
  • You randomly select from among the words that immediately follow that word in the original text.
  • And so on and so forth, until you reach the end of the sentence.

This approach sometimes works surprising well and sometimes works relatively poorly. We can often improve it by working with pairs or triplets of words. But for now, we’ll stick with single words.

a. Document, write tests for, and implement a procedure, (sentence-ends str), that identifies all (or most of) the words that end the sentences in str. The words that end sentences typically come before periods, question marks, and exclamation points.

> (sentence-ends "The cat ate the hat.  The rat sat.")
'("hat" "sat")
> (sentence-ends "Do you like blue mac and cheese?  No I don't, it makes me sneeze!")
'("cheese" "sneeze")
> (sentence-ends ""The cat sat on the hat.  \"Where is my hat?\" asked the rat.  It's now a flat hat.  How 'bout that?  Will the fat rat jump on that brat cat who sat?")
'("hat" "rat" "that" "sat")

b. Document, write tests for, and implement a procedure, (sentence-starts str), that identifies all (or most of) the words that start sentences in str and returns them as a list (without duplicates). Typically, the start of a sentence is given by (a) the start of the string and (b) any word that comes after end punctuation (space, question mark, exclmation point) and a sequence of whitespace (space, newline, tab).

> (sentence-starts "The cat ate the hat.  The rat sat.")
'("The")
> (sentence-starts "Do you like blue mac and cheese?  No I don't, it makes me sneeze!")
'("Do" "No")
> (sentence-starts "The cat sat on the hat.  \"Where is my hat?\" asked the rat.  It's now a flat hat.  How 'bout that?  Will the fat rat jump on that brat cat who sat?")
'("The" "Where" "asked" "It's" "How" "Will") ; or maybe not "asked"

c. Document, write tests for, and implement a procedure, (right-neighbors word words), that finds all of the words that immediately follow word in words, which is a list of strings. For example,

> (define cat-thing (extract-words "The cat sat on the hat.  \"Where is my hat?\" asked the rat.  It's now a flat hat.  How 'bout that?  Will the fat rat jump on that brat cat who sat?"))
> (right-neighbors "hat" cat-thing)
'("Where" "asked" "how")
> (right-neighbors "The" cat-thing)
'("cat")
> (right-neighbors "the" cat-thing)
'("hat" "rat" "fat") 
> (right-neighbors "will" cat-thing)
'("the")
> (right-neighbors "computer" cat-thing)
'()

d. With those three procedures, we should be able to generate things that appear to be similar to sentences. Let’s see.

  • We pick among the starting words. Let’s say we pick “The”.
  • We look at all the right neighbors of “The”. Unfortunately, there is only one: “cat”. (There’s a reason we normally do this with longer texts.) We have generated “The cat”.
  • What words follow “cat”? “sat” and “who”. Let’s say we randomly chose “who”. We are now at “The cat who”. Exciting, isn’t it?
  • What words follow “who”? Only “sat”. We do need a bigger corpus. We’re now at “The cat who sat”.
  • “sat” ends a sentence, so we could stop now. Perhaps we should, even though it’s not really a sentence. But we’re going to go on. What follows “sat”? “on”. We’re now up to “The cat sat on”.
  • “on” can be followed by “the” or “that”. Let’s suppose we choose “the”. Our sentence-like object has grown to “The cat sat on the”.
  • “the” can be followed by “hat” or “fat” or “rat”. We’ll choose “rat”. “The cat sat on the rat.”
  • “rat” is an ending word. Let’s stop now.

Is that promising? I’m not sure. We definitely need a larger corpus. But we’re going to move ahead anyway.

Document and write a procedure, (random-sentence all-words start-words end-words) that takes the lists or vectors we generate from the procedures you’ve written already, and uses them to build a sentence according to an algorithm something like the following.

1. Start with one of the start words.
2. Call that the current word.
3. Repeatedly apply the following process.
    a. If the current word is one of the last words, stop with
       some probability (say 50%).
    b. Generate the list of following words.
    c. If that list or vector is empty, stop.
    d. Pick randomly from the list of following words.
    e. Add that to the sentence and call it the current word.

Part the sixth: Freestyle

There are two options for your freestyle.

a. You’ve explored a variety of issues in analyzing and generating text. It’s now time to explore creative ways to use what you have learned. Choose any combination of techniques you’ve learned (random sentence generation, random phrase generation with words by syllable, rhyming, Mad libs, etc.) and document and write a procedure that uses them to generate a kind of text of your choice.

Make sure to include some “successful” examples of your procedure.

b. Alternately, document and write a procedure that determines (or, more precisely, attempts to determine) how closely a piece of writing matches one of the central poetic forms. You’ll need to examine syllables per line, rhyming scheme, and such.

Make sure to include some examples.

Postscript

Don’t worry if you can’t find the word “syllax” anywhere, except in strange games. We invented it for this assignment. But the plural of syllax is definitely “syllaxen”.

Evaluation

We will primarily evaluate your work on correctness (does your code compute what it’s supposed to and are your procedure descriptions accurate); clarity (is it easy to tell what your code does and how it acheives its results; is your writing clear and free of jargon); and concision (have you kept your work short and clean, rather than long and rambly). In a few cases, we will also consider the creativity of your result.

Partial rubric

This rubric is based on an old version of the assignment. The new rubric should appear in the somewhat near future.

These are some of the issues we will be looking at. If you do not address the one-star items, you will not achieve R. If you do not address the two-star items, you will not achieve M. If you do not address the three-star items, you will not achieve E.

General
-------

[ ] File named correctly (`language.rkt`). *
[ ] File "runs" correctly in DrRacket. *
[ ] Reformatted code with Ctrl-I before submitting. *
[ ] If the program references other files, those files were submitted. *
[ ] If the program references other files, it does so with the base file
    name, rather than a complex path. *

[ ] Introductory comment with name, date, assignment, course, citations. **
[ ] Correctly formatted/indented code. **
[ ] All procedures are documented and the documentation is (mostly)
    correct. **
[ ] Procedures that are supposed to return strings, like `haiku`,
    return strings rather than using `display`. **

[ ] Avoids expensive repeated work. ***
[ ] Clear variable names. ***

Part one
--------

[ ] Includes examples. **
[ ] Includes the "\n" at the end of the third line. **
[ ] Works correctly for different numbers of words in the subvectors in
    `words-by-syllable`. **

[ ] Works correctly for different numbers of subvectors in 
    `words-by-syllable`, provided the subvector at index `i` contains 
    only words of `i` syllables.  (E.g., if we add six-syllable words,
    it will, on occasion, include those.) ***

Part two
--------

[ ] Includes tests for the new `remove-neighboring-duplicates` (at least
    two normal cases and at least three edge cases) **
[ ] Includes tests for `unique-words` (at least two normal cases and at least
    three edge cases) **

Part three
----------

[ ] The `syllables` procedure counts vowel sequences, not just vowels.  **

[ ] The `syllables` procedure has at least two extensions to "count 
    vowel sequences". ***

Part four
---------

[ ] At least ten non-rhyming pairs whose last three letters match. *
[ ] At least ten rhyming pairs whose last three letters do not match. *

[ ] Accommodates words of fewer than three letters. **
[ ] `rhymes?` procedure is documented. **
[ ] Includes examples of `abab` **

[ ] New `rhymes?` procedures addresses some of the non-rhyming pairs. ***
[ ] New `rhymes?` procedures addresses some of the rhyming pairs. ***
[ ] New `rhymes?` procedure explains the things it is checking (e.g.,
    with examples) ***
[ ] `abab` procedure ensures that words have rhymes. ***
[ ] `abab` procedure avoids significant repetition. ***

Part five
---------

[ ] Includes appropriate tests of `sentence-ends` (at least two normal
    and at least three edge). **
[ ] Handles the situation in which a word lacks a predecessor. **

Part six
--------

[ ] Includes examples of procedure's results. **