Assignment 8: Topic modeling

Assigned: Wednesday, 10 April 2019
Due: Tuesday, 16 April 2019 by 10:30pm
Summary: For this assignment, you will combine the various things you’ve learned in the class to build a simple topic modeling application.
Collaboration: You must work with your assigned partner(s) on this assignment. You may discuss this assignment with anyone, provided you credit such discussions when you submit the assignment.
Submitting: Email your answers to csc151-01-grader@grinnell.edu. The subject of your email should be [CSC151 01] Assignment 8 and should contain your answers to all parts of the assignment. Please send your scheme code in an attached .rkt file.
Note: For this assignment, you need only provide the four P’s for each of the procedures you write.

Note: We will be using paragraphs from *Alice in Wonderland* for some of the examples. See the end of the assignment for how we obtained those paragraphs. Depending on how you design some of your algorithms, you may get somewhat different results for the examples.

Introduction

Topic modeling is a core technique in text-based digital humanities. The basic goal of topic modeling is to take a collection of multiple texts (potentially small tests, potentially large texts) and find groups of words that tend to appear together. Each group of words is a potential “topic” within the texts. For example, consider the following sentences, each of which we will consider one text.

Grinnell College has a number of initiatives related to the digital humanities.
This semester, CSC151, Grinnell’s introductory computer science course is using the digital humanities as a source of problems.
Like most computer science courses, CSC151, Functional Problem Solving, is a workshop-style course that builds problem-solving skills..
Grinnell’s other initiatives in the digital humanities, include a forthcoming concentration, a variety of experimental courses, and Project Vivero.

Two of the sentences (1 and 4) are primarilly about Grinnell’s work in the digital humanities. Words like “Grinnell”, “digital”, “humanities”, and “initiatives” appear in those sentences. Two of the sentences (2 and 3) are primarily about computer science (and, perhaps, problem solving). The words “computer”, “science”, “problem”, and “CSC151” all appear in those two sentences. (“Grinnell”, “digital”, “humanities” also appear in sentence 2, so the might end up grouped there; we might even say that that sentence is partially about the digital humanities and partially about computer science.)

While humans can often identify such word clusters (perhaps better than computers, for small sets of texts), the DH community has seen good success in using computer algorithms to identify such clusters. Often, the computer algorithm discovers “surprising” connections. Note that the clusters, themselves, are not the final product. Rather, each cluster of words provides a potential topic for the digital humanist to explore with close reading of the texts. For example, if topic modeling suggests that “heart” and “death” tend to group together in an author’s works, one might consider exploring why.

Approaches to topic modeling

There are a variety of approaches to topic modeling. Most start with a collection of texts. The texts are often relatively small: Tweets, paragraphs from a long text, perhaps even sentences.

We then preprocess the texts to identify the important words. One might remove common words (e.g., “and”, “the”, “an”), convert plurals to singular, and even identify the common form of similar words (e.g., to reduce “argues”, “argue”, “arguing”, and “argueed” to the same form). Some algorithms also remove short words. We might also remove particularly uncommon words, words that occur only a few times across the corpus of texts.

We also gather statistics, such as the number of times each word appears in each text, as well as the number of times each word appears across the whole corpus.

Many algorithms, including those that we will use on this assignment, then randomly assign words to topics (sets of words) and repeatedly improve those assignments until they reach an acceptable state. How you improve ehose assignments and how you determine an acceptable state vary from algorithm to algorithm. (How you prove that an algorithm improves the state is complex, and not something we will cover.)

Finally, we present the results to the user.

A simplified topic modeling algorithm

You shall know a word by the company it keeps. – J.R. Firth

How do we “improve” our assignments of words to topics? Doing so requires that we think a bit about the state of the system. Suppose we have four texts (for convenience, we’ll call them “Text 0”, “Text 1”, “Text 2”, and “Text 3”) and three topics (which we’ll call “Topic A”, “Topic B”, and “Topic C”). We can categorize a document by the percentage of words in the document that fall in each topic. For example, suppose that our text is “Henry the heroic frog heard that a ferocious dragon visited a happy and handsome frog”, that Topic A contains “handsome, happy, heard, henry, and heroic”, Topic B contains “ferocious” and “visited”, and Topic C contains “dragon” and “frog”. There are ten non-trivial words in that sentence. (We’ll ignore “a”, “and”, “that”, “the”, and similar words.) 50% of the words in the text are in Topic A, 20% are in Topic B, and 30% are in Topic C (one “dragon” and two “frogs”). We can do a similar analysis for each text. Once we do so, we might conceive of the results of that analysis as a form of table.

	Text 1	Text 2	Text 3	Text 4
Topic A	50%	10%	30%	10%
Topic B	20%	70%	35%	10%
Topic C	30%	20%	35%	80%

You’ll note that each column totals 100%. That is, all the words (or at least all of the meaningful words) in each text are assigned to one of the topics. It appears that text 2 is primarily about topic B and that text 4 is primarily about topic C.

We can use this table to update the assignment of words to topics. In doing so, we will find that we also update the table. Let’s consider the word “frog” and suppose that “frog” appears five times in text 2, twice in text 3, and once in text 4. (We already said that it appears twice in text 1.)

We can use the statistics about which topic a text is associated with to update our knowledge of which topic to associated “frog” with. If a text is mostly about a topic, it is likely that most of the words in the text belong in that topic. If a text has little association with a topic (e.g., text 4 and topic B), it is unlikely that the words in the text belong in the topic. We can therefore use that information to calculate a proability that a word belongs in a particular topic. (Ideally, in calculating that probability, we would remove the word from the table and update the percentages; for this example, we will not do so.)

A word in text 1 has a 50% probability of being in topic A. Since 20% of the instances of “frog” are in text 1, we would say that text 1 contributes .10 (50% x 20%) to the probability that “frog” belongs in topic A. A word in text 2 has a 10% chance of being in topic A. Since half of the appearances of “frog” are in text 2, we would say that text 2 contributes .05 (10% x 50%) to the probability that “frog” belongs in topic A. Similarly, text 3 contributes .06 (30% x 20%) and text contributes .01 (10% x 10%) to that probability. The overall probability that “frog” belongs in topic A is therefore .22.

We can do a similar analysis for topic B. Text 1 contributes .04 (20% x 20%). Text 2 contributes .35 (70% x 50%). Text 3 contributes .07 (35% x 20%). And text 4 contributes .01 (10% x 10%). That gives us an overall probability of .47.

For topic C, text 1 contributes .06 (30% x 20%), text 2 contributes .10 (20% x 50%), text 3 contributes .07 (35% x 20%), and text 4 contributes 8% (80% x 10%). That gives an overall probability of .31.

Fortunately, those probabilities end up to 100%. (Yes, it’s possible to prove that they will always do so.)

We can then use those percentages to probabilistically reassign “frog”. We generate a random number between 0 and 1. If it is .22 or less, we reassign “frog” to topic A. If it’s between .22 and .69 (.22 + .47), we reassign “frog” to topic B. And the rest of the time, we keep “frog” in topic C.

Let’s suppose the random number tells us to reassign “frog” to topic B. Since we’ve moved “frog”, we also need to update the table. Let’s suppose text 2 had twenty-five words. If there were five instances of “frog” in text 2, that accounts for all of the 20% of text 2 that were assigned to topic C. We also know that it accounts for two of the ten words that were assigned to topic C in text 1. We’ll pretend that it accounts for two of twenty words in text 3 and one of ten words in text 4. Our updated table will be as follows.

	Text 1	Text 2	Text 3	Text 4
Topic A	50%	10%	30%	10%
Topic B	40%	90%	45%	20%
Topic C	10%	00%	25%	70%

If we repeat this process again and again and again, we are likely to end up with a “steady state”, one in which we tend to reassign words to the same topic.

Assignment

Part a. We will be representing the topics as a hash table that maps each word (a string) to a corresponding topic (“A”, “B”, “C”, ….).

Write a procedure, (randomly-assign-topics words n), that takes a list of words as input and creates a hash table that associates each word with one of n different topics. You may assume that n is a number between 1 and 26, inclusive.

> (randomly-assign-topics '("ferocious" "dragon" "frog" "hippo" "happy") 3)
'#hash(("dragon" . "C") ("ferocious" . "B") ("frog" . "C") ("happy" . "A") ("hippo" . "C"))
> (randomly-assign-topics '("ferocious" "dragon" "frog" "hippo" "happy") 3)
'#hash(("dragon" . "B") ("ferocious" . "A") ("frog" . "C") ("happy" . "B") ("hippo" . "A"))
> (randomly-assign-topics '("ferocious" "dragon" "frog" "hippo" "happy") 3)
'#hash(("dragon" . "C") ("ferocious" . "A") ("frog" . "B") ("happy" . "B") ("hippo" . "A"))
> (randomly-assign-topics '("ferocious" "dragon" "frog" "hippo" "happy") 2)
'#hash(("dragon" . "A") ("ferocious" . "B") ("frog" . "B") ("happy" . "A") ("hippo" . "A"))

You may find the following procedure helpful.

;;; Procedure:
;;;   letter
;;; Parameters:
;;;   n, a non-negative integer
;;; Purpose:
;;;   Convert a number to a letter (such as for the name of a topic)
;;; Produces:
;;;   let, a string
(define letter
  (lambda (n)
    (string (integer->char (+ (char->integer #\A) n)))))

Part b. As you may recall, we need to work with text in a more convenient form. In particular, we will find it useful to convert a text to a list of strings, doing the various kinds of cleanup described earlier.

Write a procedure, (cleanup str), that takes a string as a parameter, extracts the words, removes any of the less intersting words, and converts everything to lowercase. (If you’d also like to try a form of stemming, you can do so.)

> (cleanup "Henry the heroic frog heard that a ferocious dragon visited a happy and handsome frog.")
'("henry" "heroic" "frog" "heard" "ferocious" "dragon" "visited" "happy" "handsome" "frog")
> (cleanup alice01)
'("alice" "beginning" "very" "tired" "sitting" "sister" "bank" "having"
  "nothing" "once" "twice" "peeped" "into" "book" "sister" "reading"
  "pictures" "conversations" "what" "book" "thought" "alice" "without"
  "pictures" "conversations")

We will refer to the result of cleanup as a “text”.

Part c. At some point, we’ll probably need a list of every possible word, with no duplicates. That will allow us to randomly select words, to generate the initial topics, and so on and so forth.

Write a procedure, (remove-duplicates list-of-words), that removes all duplicates from a list of words.

Write a procedure, (unique-words list-of-texts), that takes a list of texts (that is a list of lists of words) as a parameter and provides a list of all of the words in the list of texts, with each word appearing once. (This should be a simple procedure.)

> (remove-duplicates (cleanup "Once upon a time, there lived a princess and a frog.  The princess was very happy.  And when I say that they lived, I don't mean that they lived together.  The princess knew not the frog, and the frog knew not the princess."))
'("when" "mean" "very" "knew" "princess" "frog" "there" "once" "don't" "upon" "that" "together" "happy" "they" "lived" "time")

> (remove-duplicates (cleanup alice01))
'("beginning" "alice" "sitting" "very" "pictures" "thought" "into" "reading" "peeped" "without" "what" "bank" "twice" "having" "book" "sister" "conversations" "tired" "nothing" "once")
> (remove-duplicates (cleanup alice02))
'("could" "whether" "rabbit" "chain" "stupid" "pleasure" "made" "when" "considering" "white" "would" "sleepy" "getting" "mind" "feel" "making" "daisy" "daisies" "very" "close" "eyes" "well" "trouble" "picking" "pink" "suddenly" "with" "worth") 
> (remove-duplicates (cleanup alice03))
'("seemed" "seen" "late" "then" "field" "shall" "itself" "looked" "watch" "hear" "curiosity" "fortunately" "never" "just" "waistcoat" "took" "natural" "alice" "feet" "occurred" "actually" "remarkable" "down" "burning" "with" "rabbit" "hurried" "pocket" "thought" "have" "hole" "across" "take" "this" "when" "think" "either" "after" "dear" "under" "mind" "wondered" "started" "there" "before" "very" "afterwards" "over" "ought" "flashed" "that" "large" "hedge" "quite" "nothing" "much" "time")

> (unique-words (map cleanup (list alice01 alice02 alice03)))
'("seemed" "seen" "without" "itself" "suddenly" "would" "sleepy" "getting" "just" "tired" "natural" "alice" "sitting" "pink" "occurred" "twice" "pictures" "with" "daisy" "rabbit" "pocket" "thought" "chain" "into" "remarkable" "take" "when" "either" "after" "under" "mind" "wondered" "started" "making" "considering" "afterwards" "well" "actually" "once" "flashed" "that" "hole" "much" "time" "worth" "book" "whether" "late" "then" "shall" "made" "looked" "watch" "hear" "curiosity" "fortunately" "never" "sister" "waistcoat" "feel" "picking" "feet" "reading" "there" "over" "hurried" "what" "could" "down" "burning" "stupid" "pleasure" "across" "this" "bank" "think" "white" "large" "dear" "took" "beginning" "before" "daisies" "very" "close" "eyes" "trouble" "peeped" "have" "ought" "nothing" "field" "having" "hedge" "quite" "conversations")

; How many total words do we have in the excerpt?
> (reduce + (map (o length cleanup) alice-excerpt))
550
; How many unique words in each section?
> (map (o length remove-duplicates cleanup) alice-excerpt)
'(4 20 28 57 11 24 59 27 56 56 74 46)
> (reduce + (map (o length remove-duplicates cleanup) alice-excerpt))
462
; How many unique words overall?
> (length (unique-words (map cleanup alice-excerpt)))
297

Part d. We will need to represent the columns of the table in some way. The most convenient way is a hash table which maps topic names to probabilities.

Write a procedure, (text-probs text topics), that takes as input a list of words and the type of hash table produced by randomly-assign-topics, and produces a hash table that correctly maps each topic name to the percentage of words in the list that fall into that topic.

> topics
'#hash(("computer" . "C") ("dragon" . "B") ("ferocious" . "B") ("fish" . "B") ("frog" . "C") ("handsome" . "B") ("happy" . "C") ("pizza" . "A") ("science" . "B"))
> (text-probs '("computer" "science" "pizza" "ferocious" "computer" "pizza") topics)
'#hash(("A" . 1/3) ("B" . 1/3) ("C" . 1/3))
> (text-probs '("fish" "fish" "frog" "ferocious" "handsome") topics)
'#hash(("B" . 4/5) ("C" . 1/5))

> (define topics 
    (randomly-assign-topics (unique-words (map cleanup alice-excerpt)) 5))
> (hash-ref topics "alice")
"C"
> (take (cleanup alice01) 10)
'("alice" "beginning" "very" "tired" "sitting" "sister" "bank" "having" "nothing" "once")
> (map (section hash-ref topics <>) (take (cleanup alice01) 10))
'("C" "A" "D" "A" "A" "D" "D" "A" "A" "E")
> (text-probs (take (cleanup alice01) 10) topics)
'#hash(("A" . 1/2) ("C" . 1/10) ("D" . 3/10) ("E" . 1/10))
> (text-probs (cleanup alice01) topics)
'#hash(("A" . 2/5) ("C" . 1/5) ("D" . 7/25) ("E" . 3/25))

Part e. We will need to keep track of the percentage of times each word appears in each text.

First, write a procedure, (word-appearances word texts), that takes a word and a list of texts as parameters and produces list of the number of times the word appears in each text.

> (word-appearances "frog"
                    '(("frog" "frog" "funny" "frog")
                      ("happy" "frog")
                      ("pizza" "computer" "pizza")
                      ("ferocious" "frog")))
'(3 1 0 1)
> (word-appearances "alice" (map cleanup alice-excerpt))
'(0 2 0 2 1 1 0 1 2 0 2 2)

Next, write a procedure, (word-percentages word texts), that works like the previous procedure, but returns the fraction of the total number of appearances of the word in each text. (You can assume that the word appears in at least one text.)

> (word-percentages "frog"
                    '(("frog" "frog" "funny" "frog")
                      ("happy" "frog")
                      ("pizza" "computer" "pizza")
                      ("ferocious" "frog")))
'(3/5 1/5 0 1/5)
> (reduce + (word-appearances "alice" (map cleanup alice-excerpt)))
13
> (word-percentages "alice" (map cleanup alice-excerpt))
'(0 2/13 0 2/13 1/13 1/13 0 1/13 2/13 0 2/13 2/13)

Part f. We’re getting close, believe it or not. You next goal is to use the word percentages and a list of text probabilities (created by text-probs) to generate a list of probabilities that a particular word belongs in each of the different topics.

Write a procedure, (word-probs word-percents list-of-text-probs), that takes two equal-length lists as parameters, with the first list having the form of word-percentages and the second having the form of a list of hash tables returned by text-probs, and that returns a list of the form '(("A" prob) ("B" prob) ....). For the table at the beginning of this assignment, the return value would be '(("A" 0.22) ("B" 0.47) ("C" 0.31)). The topics do not have to appear in alphabetical order.

If you’d like to make the list of topic letters a parameter to word-probs, you can do so.

> (define excerpt-topics (randomly-assign-topics (unique-words (map cleanup alice-excerpt)) 5))
> (define excerpt-probs (map (section text-probs <> excerpt-topics)
                             (map cleanup alice-excerpt)))
> excerpt-probs
'(#hash(("A" . 1/4) ("B" . 1/2) ("C" . 1/4))
  #hash(("A" . 12/25) ("C" . 1/5) ("D" . 3/25) ("E" . 1/5))
  #hash(("A" . 2/7) ("B" . 2/7) ("C" . 1/7) ("D" . 1/4) ("E" . 1/28))
  #hash(("A" . 19/72) ("B" . 23/72) ("C" . 1/6) ("D" . 1/6) ("E" . 1/12))
  #hash(("A" . 3/11) ("B" . 2/11) ("C" . 4/11) ("E" . 2/11))
  #hash(("A" . 4/27) ("B" . 1/9) ("C" . 2/9) ("D" . 11/27) ("E" . 1/9))
  #hash(("A" . 9/68) ("B" . 15/68) ("C" . 13/68) ("D" . 9/34) ("E" . 13/68))
  #hash(("A" . 5/28) ("B" . 1/7) ("C" . 3/14) ("D" . 2/7) ("E" . 5/28))
  #hash(("A" . 4/35) ("B" . 17/70) ("C" . 11/70) ("D" . 5/14) ("E" . 9/70))
  #hash(("A" . 15/64) ("B" . 7/32) ("C" . 3/16) ("D" . 7/32) ("E" . 9/64))
  #hash(("A" . 19/101) ("B" . 24/101) ("C" . 23/101) ("D" . 20/101) ("E" . 15/101))
  #hash(("A" . 9/52) ("B" . 1/4) ("C" . 3/13) ("D" . 1/4) ("E" . 5/52)))
> (define alice-excerpt-percentages
    (word-percentages "alice" (map cleanup alice-excerpt)))
> alice-excerpt-percentages
'(0 2/13 0 2/13 1/13 1/13 0 1/13 2/13 0 2/13 2/13)
> (word-probs alice-excerpt-percentages excerpt-probs)
'(("E" . 32477339/236576340) ("C" . 5031337/23657634) ("D" . 30173/136350) ("A" . 207335563/887161275) ("B" . 394393/2022020))
> (reduce + (map cdr (word-probs alice-excerpt-percentages excerpt-probs)))
1
> (map (o exact->inexact cdr) (word-probs alice-excerpt-percentages excerpt-probs))
'(0.1372805877375565 0.21267287337355884 0.2212907957462413 0.23370673274709833 0.19504901039554504)

Part g. You can use the following procedure to randomly select from such a list.

;;; Procedure:
;;;   biased-select
;;; Parameters:
;;;   lst, a non-empty list of value/probability lists
;;; Purpose:
;;;   Select one of the elements in the list, choosing
;;;   the element according to probability.  (This is
;;;   called "biased selection" in the literature.)
;;; Produces:
;;;   value, a value
;;; Preconditions:
;;;   * Each element of lst has the form (val prob).
;;;   * Each probability is a real number.
;;;     That is (all (o real? cadr) lst)
;;;   * Each probability is between 0 and 1, inclusive.
;;;   * The sum of all the probabilities is 1.
;;;     That is, (reduce + (map cadr lst)) = 1.
;;; Postconditions:
;;;   * value is one of the values in the list.  That is
;;;     (member? value (map car lst)).
;;;   * It is difficult to predict which value we get.
;;;   * Suppose the list is of the form ((val1 prob1)
;;;     (val2 prob2) ... (valn probn)).  Over a long
;;;     series of calls, we'll see val1 about prob1
;;;     of the time, val2 about prob2 of the time, and so
;;;     on and so forth.
(define biased-select
  (lambda (lst)
    (let kernel ([r (random)]
                 [remaining lst])
      (let* ([entry (car remaining)]
             [value (car entry)]
             [prob (cadr entry)])
        (cond
          [(null? (cdr remaining))
           value]
          [(< r prob)
           value]
          [else
           (kernel (- r prob)
                   (cdr remaining))])))))

Here’s an instance of that procedure in action.

> (define frog-probs '(("A" 0.22) ("B" 0.47) ("C" 0.31)))
> (define frog-topics (map (lambda (x) (biased-select frog-probs)) (make-list 1000 null)))
> (tally-value frog-topics "A")
224
> (tally-value frog-topics "B")
474
> (tally-value frog-topics "C")
302

Write a procedure, (select-topic ...) that uses the preceding procedures to probabilistically select a topic for a word given the word percentages and the list of text probabilities. You can choose the parameters to this procedure; you will likely want to use values you’ve precomputed. One example appears below.

> (define excerpt-topics (randomly-assign-topics (unique-words (map cleanup alice-excerpt)) 5))
> (define excerpt-probs (map (section text-probs <> excerpt-topics)
                             (map cleanup alice-excerpt)))
> (define alice-excerpt-percentages
    (word-percentages "alice" (map cleanup alice-excerpt)))
> (select-topic alice-excerpt-percentages excerpt-probs)
"B"
> (select-topic alice-excerpt-percentages excerpt-probs)
"D"
> (select-topic alice-excerpt-percentages excerpt-probs)
"D"
> (select-topic alice-excerpt-percentages excerpt-probs)
"B"
> (select-topic alice-excerpt-percentages excerpt-probs)
"B"
> (select-topic alice-excerpt-percentages excerpt-probs)
"C"

Part h. Write a procedure, (update-probs! text probs word oldtopic newtopic) that updates the probabilities for a text based on the change of a word from an old topic to a new topic. You will likely need the text (list of words), the old probabilities, the word you are using, the old topic, and the new topic. For example, if we move “Frog” from “C” to “B” and we know that “Frog” represents 2/7 of the words in a text, we would do something like,

  (hash-set probs "C" (- (hash-ref probs "C") 2/7))
  (hash-set probs "B" (+ (hash-ref probs "B") 2/7))

Here’s an example using our text for Alice in Wonderland.

; Set up the topics and probabilities
> (define excerpt-topics (randomly-assign-topics (unique-words (map cleanup alice-excerpt)) 6))
> (define alice01-probs (text-probs (cleanup alice01) excerpt-topics))
> alice01-probs
'#hash(("A" . 3/25) ("B" . 1/5) ("C" . 4/25) ("D" . 9/25) ("E" . 2/25) ("F" . 2/25))
; What topic does "alice" have?
> (hash-ref excerpt-topics "alice")
"D"
; How many times does it appear in the first text?
> (tally-value (cleanup alice01) "alice")
2
; Update the probabilities
> (update-probs! (cleanup alice01) alice01-probs "alice" "D" "C")
; We should see C increase by 2/25 and D decreates by 2/25.
> alice01-probs
'#hash(("A" . 3/25) ("B" . 1/5) ("C" . 6/25) ("D" . 7/25) ("E" . 2/25) ("F" . 2/25))
; In the larger procedure, we would update alice to use category C.
> (hash-set! excerpt-topics "alice" "C")

> (define alice03-probs (text-probs (cleanup alice03) excerpt-topics))
> alice03-probs
'#hash(("A" . 13/72) ("B" . 13/72) ("C" . 1/8) ("D" . 11/72) ("E" . 1/6) ("F" . 7/36))
> (hash-ref excerpt-topics "rabbit")
"B"
> (tally-value (cleanup alice03) "rabbit")
4
> (length (cleanup alice03))
72
> (update-probs! (cleanup alice03) alice03-probs "rabbit" "B" "E")
> alice03-probs
'#hash(("A" . 13/72) ("B" . 1/8) ("C" . 1/8) ("D" . 11/72) ("E" . 2/9) ("F" . 7/36))
> (hash-set! excerpt-topics "rabbit" "E")

Part i. Write a procedure, (improve-model! topics list-of-texts list-of-probs) that randomly selects a word and uses the procedures above to reassign it to a new topic and update the list of probabilities.

> (define excerpt-texts (map cleanup alice-excerpt))
> (define excerpt-topics (randomly-assign-topics (unique-words excerpt-texts) 6))
> (define excerpt-probs (map (section text-probs <> excerpt-topics) excerpt-texts))
> excerpt-probs
'(#hash(("B" . 1/4) ("E" . 1/4) ("F" . 1/2))
  #hash(("A" . 1/5) ("B" . 6/25) ("C" . 1/5) ("D" . 2/25) ("E" . 4/25) ("F" . 3/25))
  #hash(("A" . 1/7) ("B" . 2/7) ("C" . 3/14) ("D" . 3/28) ("E" . 1/14) ("F" . 5/28))
  #hash(("A" . 13/72) ("B" . 5/18) ("C" . 13/72) ("D" . 1/18) ("E" . 11/72) ("F" . 11/72))
  #hash(("B" . 1/11) ("C" . 4/11) ("D" . 2/11) ("E" . 2/11) ("F" . 2/11))
  #hash(("A" . 2/9) ("B" . 7/27) ("C" . 4/27) ("D" . 1/9) ("E" . 2/27) ("F" . 5/27))
  #hash(("A" . 7/34) ("B" . 7/34) ("C" . 7/34) ("D" . 3/34) ("E" . 5/68) ("F" . 15/68))
  #hash(("A" . 3/28) ("B" . 1/4) ("C" . 2/7) ("D" . 1/14) ("E" . 1/7) ("F" . 1/7))
  #hash(("A" . 6/35) ("B" . 2/7) ("C" . 4/35) ("D" . 11/70) ("E" . 3/70) ("F" . 8/35))
  #hash(("A" . 3/32) ("B" . 5/32) ("C" . 7/32) ("D" . 5/32) ("E" . 3/16) ("F" . 3/16))
  #hash(("A" . 17/101) ("B" . 21/101) ("C" . 15/101) ("D" . 12/101) ("E" . 15/101) ("F" . 21/101))
  #hash(("A" . 7/26) ("B" . 3/13) ("C" . 5/52) ("D" . 5/26) ("E" . 3/52) ("F" . 2/13)))

> (improve-model! excerpt-topics excerpt-texts excerpt-probs)
; This moved "upon" from "C" to "E"
> (map (section tally-value <> "upon") excerpt-texts)
'(0 0 0 0 0 0 1 0 0 0 1 0)
; It appears that we will only make changes to the distribution in elements
; 6 and 10.
> (hash-ref excerpt-topics "upon")
"E"
> excerpt-probs
'(#hash(("B" . 1/4) ("C" . 0) ("E" . 1/4) ("F" . 1/2))
  #hash(("A" . 1/5) ("B" . 6/25) ("C" . 1/5) ("D" . 2/25) ("E" . 4/25) ("F" . 3/25))
  #hash(("A" . 1/7) ("B" . 2/7) ("C" . 3/14) ("D" . 3/28) ("E" . 1/14) ("F" . 5/28))
  #hash(("A" . 13/72) ("B" . 5/18) ("C" . 13/72) ("D" . 1/18) ("E" . 11/72) ("F" . 11/72))
  #hash(("B" . 1/11) ("C" . 4/11) ("D" . 2/11) ("E" . 2/11) ("F" . 2/11))
  #hash(("A" . 2/9) ("B" . 7/27) ("C" . 4/27) ("D" . 1/9) ("E" . 2/27) ("F" . 5/27))
  #hash(("A" . 7/34) ("B" . 7/34) ("C" . 13/68) ("D" . 3/34) ("E" . 3/34) ("F" . 15/68)) ; C this has changed from 7/34 to 13/68; E has changed from 5/68 to 3/38
  #hash(("A" . 3/28) ("B" . 1/4) ("C" . 2/7) ("D" . 1/14) ("E" . 1/7) ("F" . 1/7))
  #hash(("A" . 6/35) ("B" . 2/7) ("C" . 4/35) ("D" . 11/70) ("E" . 3/70) ("F" . 8/35))
  #hash(("A" . 3/32) ("B" . 5/32) ("C" . 7/32) ("D" . 5/32) ("E" . 3/16) ("F" . 3/16))
  #hash(("A" . 17/101) ("B" . 21/101) ("C" . 14/101) ("D" . 12/101) ("E" . 16/101) ("F" . 21/101)) ; C from 15/101->14/101, E from 15/101->16/101
  #hash(("A" . 7/26) ("B" . 3/13) ("C" . 5/52) ("D" . 5/26) ("E" . 3/52) ("F" . 2/13)))

Part j. Write a procedure, (display-topics topics) that takes the topics table (which, you may recall, maps words to their topics) and prints it out in a human-readable form. That is, it should print all the words for topic A, then all the words for topic B, and so on and so forth.

> (define topics
    (randomly-assign-topics
     '("alice" "rabbit" "frog" "princess" "castle" "happy" "wonder")
     3))
> topics
'#hash(("alice" . "C") ("castle" . "A") ("frog" . "A") ("happy" . "A") ("princess" . "B") ("rabbit" . "C") ("wonder" . "A"))
> (display-topics topics)
Output! Topic A: (castle frog happy wonder)
Output! Topic B: (princess)
Output! Topic C: (alice rabbit)

You may find it helpful to first write a procedure, (invert-topics topics) that inverts a topics table, creating a new table whose keys are the topics and whose values are the lists of words in the topic.

> (define topics
    (randomly-assign-topics
     '("alice" "rabbit" "frog" "princess" "castle" "happy" "wonder")
     3))
> topics
'#hash(("alice" . "A") ("castle" . "B") ("frog" . "C") ("happy" . "B") ("princess" . "A") ("rabbit" . "C") ("wonder" . "B"))
> (invert-topics topics)
'#hash(("A" . ("alice" "princess")) ("B" . ("happy" "wonder" "castle")) ("C" . ("rabbit" "frog")))

Part k. Write a procedure, (topic-model list-of-strings num-topics num-iterations) that builds and returns a topic model (hash table) for the list of strings.

You will need to

Convert each strings in the list of strings to a text (a list of words).
Identify the unique words.
Build a random model.
Build the list of probabilities for each text.
Repeatedly call the improve-model! procedure.

> (define model (time (topic-model alice-excerpt 10 1000)))
Output! cpu time: 203 real time: 207 gc time: 25
> (display-topics model)
Topic A: (afraid aloud anything bats begun cupboards curtsey daisy dark dinah disappointment empty even fallen falling fortunately grand hope hurried ignorant itself killing lamps listen many maps milk nice noticed occurred opportunity pegs presently question seem seemed some somebody somewhere soon sort sound suddenly take they tunnel well words would)
Topic B: (about among come idea knowledge learnt listening over showing sometimes walking)
Topic C: (across actually alice answer began conversations curtseying didn distance dozing dreamy earnestly else fancy field hedge just know late leaves long longitude matter mice near rather said saucer should sitting things think thousand watch which)
Topic D: (country downward felt funny glad several there through time)
Topic E: (after fell make true wonder)
Topic F: (beginning book catch coming could going here hurt jumped like looked ought pleasure remarkable roof shelves sight still tired wind)
Topic G: (again antipathies away brave burning cats chain curiosity eyes fall fear from girl good great hall heap hear house into likely miss mouse name next orange people pink please practice sides sleepy spoke sticks straight talking tell them tried walk went whiskers wish with wondered word zealand)
Topic H: (afterwards centre corner dear drop ears earth either feet flashed found hole home little longer made marmalade must nothing overhead passed past perhaps picking quite rabbit remember right schoolroom seen shall stairs started this took tumbling twice under waistcoat what when white wouldn written)
Topic I: (behind close large peeped)
Topic J: (another asking australia bank before came chapter considering couldn daisies deep dipped down dream ever feel filled first four getting hand hanging happen have having heads herself hung hurrying labelled latitude lessons look lost making manage managed might miles mind moment much natural never night once passage pictures plenty pocket reading saying sister slowly stopping stupid such that their then though thought thump trouble truth turned upon very were whether without world worth)

Part l. Pick threes sets of a dozen or so texts (paragraphs from a book you like, related Tweets, etc.) and show the results of running topic-model on each. You should try 100, 1000, and (if it’s not too slow) 10000 iterations of improve-model!.

Part m. Write a paragraph or two summarizing what you’ve discovered through your topic modeling or other aspects of this assignment.

Variants

You may find it easier to address some aspects of this assignment using vectors, rather than lists. You should feel free to do so.

As you can tell from the example, we’ll often get relatively large topics. If you want to find a way to filter out the less common words, that might make things more manageable. (That can also be an approach for the project.)

Sample texts

Some sample paragraphs were generated using the following code.

;;; Procedure:
;;;   file->paragraphs
;;; Parameters:
;;;   fname, a string
;;; Purpose:
;;;   Extract all of the paragraphs from a file.
;;; Produces:
;;;   paragraphs, a list of strings
(define file->paragraphs
  (lambda (fname)
    (regexp-split #px"\n\n+"
                  (file->string fname))))

;;; Identifier:
;;;   alice
;;; Type:
;;;   List of strings
;;; Contents:
;;;   The paragraphs of Alice in Wonderland.
(define alice
  (file->paragraphs "/home/rebelsky/Desktop/alice.txt"))

;;; Identifier:
;;;   alice-excerpt
;;; Type:
;;;   List of strings
;;; Contents:
;;;   Ten paragraphs from the start of Alice in Wonderland.
;;; Note:
;;;   The first eleven or so paragraphs of the version we use are
;;;   the Project Gutenberg meta-data and the title.  We start with
;;;   the heading for chapter 1..
(define alice-excerpt
  (take (drop alice 11) 12))

;;; Identifiers:
;;;   alice01 ... alice10
;;; Type:
;;;   string
;;; Contents:
;;;   Paragraphs from alice
(define alice01 (list-ref alice-excerpt 1))
(define alice02 (list-ref alice-excerpt 2))
(define alice03 (list-ref alice-excerpt 3))
(define alice04 (list-ref alice-excerpt 4))
(define alice05 (list-ref alice-excerpt 5))
(define alice06 (list-ref alice-excerpt 6))
(define alice07 (list-ref alice-excerpt 7))
(define alice08 (list-ref alice-excerpt 8))
(define alice09 (list-ref alice-excerpt 9))
(define alice10 (list-ref alice-excerpt 10))

Evaluation

We will primarily evaluate your work on correctness (does your code compute what it’s supposed to and are your procedure descriptions accurate); clarity (is it easy to tell what your code does and how it achieves its results; is your writing clear and free of jargon); and concision (have you kept your work short and clean, rather than long and rambly).

Acknowledgements

I took the examples of stemming from The Wikipedia page on stemming. (Yes, faculty members sometimes use Wikipedia.)

I found the quote from J. R. Firth on his Wikipedia page.

Copyright © Charlie Curtsinger, Sarah Dahlby Albright, Janet Davis, Fahmida Hamid, Titus Klinge, Samuel A. Rebelsky, and Jerod Weinman. Selected materials are copyright by John David Stone or Henry Walker and are used with permission.

Unless specified otherwise elsewhere on this page, this work is licensed under a Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/3.0/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.

This website was built using Jekyll, Twitter Bootstrap, and the Bootswatch Cosmo Theme.