Skip to main content

Assignment 6: More text processing

Assigned
Wednesday, 27 February 2019
Due
Tuesday, 5 March 2019 by 10:30pm
Summary
In this assignment, you will write a variety of procedures that let us explore texts, including a structured corpus (collection) of texts.
Collaboration
You must work with your assigned partner(s) on this assignment. You may discuss this assignment with anyone, provided you credit such discussions when you submit the assignment.
Submitting
Email your answer to csc151-01-grader@grinnell.edu. The subject of your email should be [CSC151-01] Assignment 6 (Your Names) and should contain your answers to all parts of the assignment. Scheme code should be in the body of the message, not in an attachment.

Note: This assignment requires that you’ve understood both hash tables and structured types. You will also need to continue to use regular expressions and a variety of core Racket types, procedures, and keywords. You can probably do problem 1 without learning structured types, so that’s a good starting point.

Problem 1: Counting characters

Topics: Hash tables and tallying, Files

As you’ve seen, it is often useful to learn about text by counting things, such as the number of times each word occurs. In the lab on hash tables, we learned how to use a hash table to help us count.

a. Write a procedure, (tally-letters str), that takes a string as input and returns a list of how many times each letter occurs. For this procedure, you should treat lowercase and uppercase as equivalent. (You can convert using char-downcase or string-downcase).

For example,

> (tally-letters "Hello World!")
'(("a" 0)
  ("b" 0)
  ("c" 0)
  ("d" 1)
  ("e" 1)
  ("f" 0)
  ("g" 0)
  ("h" 1)
  ("i" 0)
  ("k" 0)
  ("l" 3)
  ("m" 0)
  ("n" 0)
  ("o" 2)
  ("p" 0)
  ("q" 0)
  ("r" 1)
  ("s" 0)
  ("t" 0)
  ("u" 0)
  ("v" 0)
  ("w" 1)
  ("x" 0)
  ("y" 0)
  ("z" 0))

You may also produce a list of character counts.

> (tally-letters "Hello World!")
'((#\a 0)
  (#\b 0)
  ...)

You will likely want to

  • Create a hash table.
  • Turn the string into a list of characters.
  • Iterate through the list, using for-each. If a character is a letter, count it in the hash table. If not, don’t.
  • Use the hash table to make a list of character/count lists.

b. Write a procedure, (visualize-letter-counts lst), that takes a list produced by tally-letters and provides some useful visualization.

c. Often, the use of “digrams” (two-letter sequences) provides some interesting or useful information about a text. Write a procedure, (tally-digrams str), that tallies all of the digrams in a string. In contrast to tally-letters, which includes letters even when they do not appear, tally-digrams should only list the digrams that appear, preferably in alphabetical order.

You may assume that no more than ten letters ever appear in a row. (That may be important as you design your regular expressions.)

> (telly-digrams "Hello mellow World!")
'(("el" 2)
  ("he" 1)
  ("ld" 1)
  ("ll" 2)
  ("lo" 2)
  ("me" 1)
  ("or" 1)
  ("ow" 1)
  ("rl" 1)
  ("wo" 1))

Problem 2: Dates and times

Topics: Structures, Predicates, Conditionals

In our initial explorations of Racket’s structures, we created a structure for a date and a structure for a time. In many InterWeb applications, the two pieces of data are tied together. this set of exercises, we will create a variety of tools for working with a single structure that stores date and time.

a. Create a structure, datetime-kernel, that has eight fields: year, month, day, hour, minute, second, weekday, offset. The first six should be stored as numbers and have the obvious meanings. The weekday field should be a string, that represents the day of the week. The offset is the offset of the local time from GMT. (We will likely ignore this field for now, but it’s useful to include it in our structure.)

b. Document and write an (is-leap-year? year) procedure that determines whether a year is a leap year. (If you haven’t memorized the rules, most years divisible by four are leap years. Most years divisible by 100 are not leap years. Years divisible by 400 are leap years. Isn’t that an amazing rule?)

c. Document and write a (days-in-month year month) procedure that takes a year and a month (a number between 1 and 12) as parameters and returns the number of days in that month.

d. Document and write a (valid-date? year month day) procedure that takes a year, month, and day of the month as parameters and determines whether or not that represents a valid date.

e. Document and write a (valid-time? hours minutes seconds) procedure that takes the obvious parameters and determines whether or not it represents a valid time. Times run from 00:00:00 to 23:59:59.

f. Create appropriate husk methods for datetime-kernel. These include

  • (datetime year month day hour minute second weekday offset), which checks preconitions and then creates one of those objects (you need not verify that it’s the correct weekday).
  • (datetime? value), which determines whether a Scheme value is a valid wrapped datetime-kernel.
  • Accessors for each of the eight fields.

g. More frequently, you’ll find datetime values represented in a form like "Wed Sep 28 22:54:59 +0000 2016". Write procedures (string->datetime str) and (datetime->string dt) that convert from and to the string representation.

h. Write a procedure, (datetime-before? dt1 dt2) that determines whether one datetime value comes before another. (You need not take the offset from GMT into account.)

Problem 3: Traversing and tallying Trump Tweets

Topics: Text processing, Structs, Regular expressions, Hash tables

The file trump-tweets-2016.txt (also available as /home/rebelsky/Desktop/trump-tweets-2016.txt) contains information on over 4000 Tweets posted by Donald Trump in 2016. Your goal in this problem is to develop some tools to analyze those Tweets. Here’s a sample.

{"source": "Twitter for iPhone", "id_str": "781247791473451008", "text": "RT @TeamTrump: \"She put the office of Sec of State up for sale. If she ever got the chance, she\u2019d put the Oval Office up for sale too.\" #Fo\u2026", "created_at": "Wed Sep 28 21:42:43 +0000 2016", "retweet_count": 10034, "in_reply_to_user_id_str": null, "favorite_count": 0, "is_retweet": true}

As you can see, this representation has a collection of field names, which are surround by quotation marks and followed by a values. String values are in quotation marks. Numeric and Boolean values are not. The Twitter system will gives access to other information, such as geolocation and author, which has been elided from these entries

a. Create a structured type, tweet, that you can use to represent these tweets. You need not worry about creating a husk for this type.

b. Write a procedure, (string->tweet str), that turns one of these strings into a tweet object. Once you have this procedure, you should be able to create a long list of Tweets with a command like the following.

> (define tweets (map string->tweet (file->lines "/home/rebelsky/Desktop/trump-tweets-2016.txt")))

c. Write a procedure, (tweet->string str), that turns a tweet backinto one of these strings.

d. Write a procedure, (tweets-between tweets start end), that takes a list of tweets and two datetimes as parameters, and returns a list of all the Tweets made between those two datetimes. (You need not include the starting time or the ending time.)

e. Write a procedure, (most-liked tweets n), that extracts the n most retweeted Tweets. You will need to sort the tweets by the number of likes and then take the first n.

f. Write a procedure, (tweet-sources tweets) that produces a list of lists, each of which contains the type a source (e.g. "Twitter for iPhone") and the corresponding count of the number of times that source is used. Note that you will likely want to use a hash table to do this counting.

> (tweet-sources some-random-list-of-tweets)
'(("Twitter for Android" 23)
  ("Twitter for iPhone" 152)
  ("Twitter Web Client" 18)
  ("Alexa Spyware" 2))

g. Do something else interesting with the list of Tweets.

Acknowledgements

The Trump Tweets were found at https://github.com/bpb27/trump-tweet-archive/blob/master/data/realdonaldtrump/2016.json. I reformatted the Tweets using something like the following procedure.

(define cleanup
  (o (regexp-replace* #px"(\"is_retweet\": [a-z]*\\})," <> "\\1\n")
     (regexp-replace* "^\[" <> "")
     (regexp-replace* "]$" <> "")))

That is: Remove the starting and ending square brackets, then add a newline in place of a comma when you see "is_retweet": true}, or "is_retweet": false},.

Documentation

For this assignment, you should document your procedures using the 6P documentation style. For procedures that randomly generate outputs, you should specify as much as possible about the output and then add something like “the output is difficult to predict”.

Evaluation

We will primarily evaluate your work on correctness (does your code compute what it’s supposed to and are your procedure descriptions accurate); clarity (is it easy to tell what your code does and how it acheives its results; is your writing clear and free of jargon); and concision (have you kept your work short and clean, rather than long and rambly). In a few cases, we will also consider the creativity of your result.