Note: This assignment requires that you’ve understood both hash tables and structured types. You will also need to continue to use regular expressions and a variety of core Racket types, procedures, and keywords. You can probably do problem 1 without learning structured types, so that’s a good starting point.
Topics: Hash tables and tallying, Files
As you’ve seen, it is often useful to learn about text by counting things, such as the number of times each word occurs. In the lab on hash tables, we learned how to use a hash table to help us count.
a. Write a procedure, (tally-letters str)
, that takes a string
as input and returns a list of how many times each letter occurs.
For this procedure, you should treat lowercase and uppercase as
equivalent. (You can convert using char-downcase
or string-downcase
).
For example,
> (tally-letters "Hello World!")
'(("a" 0)
("b" 0)
("c" 0)
("d" 1)
("e" 1)
("f" 0)
("g" 0)
("h" 1)
("i" 0)
("k" 0)
("l" 3)
("m" 0)
("n" 0)
("o" 2)
("p" 0)
("q" 0)
("r" 1)
("s" 0)
("t" 0)
("u" 0)
("v" 0)
("w" 1)
("x" 0)
("y" 0)
("z" 0))
You may also produce a list of character counts.
> (tally-letters "Hello World!")
'((#\a 0)
(#\b 0)
...)
You will likely want to
for-each
. If a character is a
letter, count it in the hash table. If not, don’t.b. Write a procedure, (visualize-letter-counts lst)
, that takes a list
produced by tally-letters
and provides some useful visualization.
c. Often, the use of “digrams” (two-letter sequences) provides some
interesting or useful information about a text. Write a procedure,
(tally-digrams str)
, that tallies all of the digrams in a string.
In contrast to tally-letters
, which includes letters even when
they do not appear, tally-digrams
should only list the digrams
that appear, preferably in alphabetical order.
You may assume that no more than ten letters ever appear in a row. (That may be important as you design your regular expressions.)
> (telly-digrams "Hello mellow World!")
'(("el" 2)
("he" 1)
("ld" 1)
("ll" 2)
("lo" 2)
("me" 1)
("or" 1)
("ow" 1)
("rl" 1)
("wo" 1))
Topics: Structures, Predicates, Conditionals
In our initial explorations of Racket’s structures, we created a structure for a date and a structure for a time. In many InterWeb applications, the two pieces of data are tied together. this set of exercises, we will create a variety of tools for working with a single structure that stores date and time.
a. Create a structure, datetime-kernel
, that has eight fields:
year
, month
, day
, hour
, minute
, second
, weekday
,
offset
. The first six should be stored as numbers and have the
obvious meanings. The weekday
field should be a string, that
represents the day of the week. The offset
is the offset of
the local time from GMT. (We will likely ignore this field for
now, but it’s useful to include it in our structure.)
b. Document and write an (is-leap-year? year)
procedure that
determines whether a year is a leap year. (If you haven’t memorized
the rules, most years divisible by four are leap years. Most years
divisible by 100 are not leap years. Years divisible by 400 are
leap years. Isn’t that an amazing rule?)
c. Document and write a (days-in-month year month)
procedure that
takes a year and a month (a number between 1 and 12) as parameters
and returns the number of days in that month.
d. Document and write a (valid-date? year month day)
procedure
that takes a year, month, and day of the month as parameters and
determines whether or not that represents a valid date.
e. Document and write a (valid-time? hours minutes seconds)
procedure
that takes the obvious parameters and determines whether or not it
represents a valid time. Times run from 00:00:00
to 23:59:59
.
f. Create appropriate husk methods for datetime-kernel
. These
include
(datetime year month day hour minute second weekday offset)
,
which checks preconitions and then creates one of those objects
(you need not verify that it’s the correct weekday).(datetime? value)
, which determines whether a Scheme value is
a valid wrapped datetime-kernel
.g. More frequently, you’ll find datetime values represented in
a form like "Wed Sep 28 22:54:59 +0000 2016"
. Write procedures
(string->datetime str)
and (datetime->string dt)
that convert
from and to the string representation.
h. Write a procedure, (datetime-before? dt1 dt2)
that determines
whether one datetime value comes before another. (You need not
take the offset from GMT into account.)
Topics: Text processing, Structs, Regular expressions, Hash tables
The file trump-tweets-2016.txt
(also available as /home/rebelsky/Desktop/trump-tweets-2016.txt
) contains
information on over 4000 Tweets posted by Donald Trump in 2016.
Your goal in this problem is to develop some tools to analyze those
Tweets. Here’s a sample.
{"source": "Twitter for iPhone", "id_str": "781247791473451008", "text": "RT @TeamTrump: \"She put the office of Sec of State up for sale. If she ever got the chance, she\u2019d put the Oval Office up for sale too.\" #Fo\u2026", "created_at": "Wed Sep 28 21:42:43 +0000 2016", "retweet_count": 10034, "in_reply_to_user_id_str": null, "favorite_count": 0, "is_retweet": true}
As you can see, this representation has a collection of field names, which are surround by quotation marks and followed by a values. String values are in quotation marks. Numeric and Boolean values are not. The Twitter system will gives access to other information, such as geolocation and author, which has been elided from these entries
a. Create a structured type, tweet
, that you can use to represent
these tweets. You need not worry about creating a husk for this
type.
b. Write a procedure, (string->tweet str)
, that turns one of these
strings into a tweet
object. Once you have this procedure, you
should be able to create a long list of Tweets with a command like
the following.
> (define tweets (map string->tweet (file->lines "/home/rebelsky/Desktop/trump-tweets-2016.txt")))
c. Write a procedure, (tweet->string str)
, that turns a tweet backinto one of these strings.
d. Write a procedure, (tweets-between tweets start end)
, that takes
a list of tweets and two datetimes as parameters, and returns a list of all
the Tweets made between those two datetimes. (You need not include the
starting time or the ending time.)
e. Write a procedure, (most-liked tweets n)
, that extracts the
n
most retweeted Tweets. You will need to sort the tweets by the
number of likes and then take the first n
.
f. Write a procedure, (tweet-sources tweets)
that produces a
list of lists, each of which contains the type a source
(e.g. "Twitter for iPhone"
) and the corresponding count of
the number of times that source is used. Note that you will
likely want to use a hash table to do this counting.
> (tweet-sources some-random-list-of-tweets)
'(("Twitter for Android" 23)
("Twitter for iPhone" 152)
("Twitter Web Client" 18)
("Alexa Spyware" 2))
g. Do something else interesting with the list of Tweets.
The Trump Tweets were found at https://github.com/bpb27/trump-tweet-archive/blob/master/data/realdonaldtrump/2016.json
. I reformatted the Tweets
using something like the following procedure.
(define cleanup
(o (regexp-replace* #px"(\"is_retweet\": [a-z]*\\})," <> "\\1\n")
(regexp-replace* "^\[" <> "")
(regexp-replace* "]$" <> "")))
That is: Remove the starting and ending square brackets, then add a
newline in place of a comma when you see "is_retweet": true},
or
"is_retweet": false},
.
For this assignment, you should document your procedures using the 6P documentation style. For procedures that randomly generate outputs, you should specify as much as possible about the output and then add something like “the output is difficult to predict”.
We will primarily evaluate your work on correctness (does your code compute what it’s supposed to and are your procedure descriptions accurate); clarity (is it easy to tell what your code does and how it acheives its results; is your writing clear and free of jargon); and concision (have you kept your work short and clean, rather than long and rambly). In a few cases, we will also consider the creativity of your result.