Topics: Regular expressions, String processing
In the lab on regular expressions, you wrote a variety of procedures that took a string as input and created a list of words in the string.
Write a procedure (string->sentences str)
, that takes a string as
input and produces a list of all the sentences that appear in the
string.
> (string->sentences "Hi! My name is DocR. Did you get that? It's short for Doctor Racket.")
'("Hi" "My name is DocR" "Did you get that" "It's short for Doctor Racket")
For extra credit, include the punctuation in the sentences.
> (string->sentences "Hi! My name is DocR. Did you get that? It's short for Doctor Racket.")
'("Hi!" "My name is DocR." "Did you get that?" "It's short for Doctor Racket.")
Topics: HTML, Regular expressions, File basics
In the lab on regular expressions, you started to
explore how you might take a page created in HTML and extract just
the text on that page. In case you’ve forgotten, we can start by
using regexp-replace*
and a pattern like #px"<[^>]*>
to remove
all of the tags.
But that’s not enough. We also want to remove extra space at the start of each line, collapse multiple blank lines into a single blank line, and collapse multiple spaces into a single space.
a. Write a procedure, (extract-text str)
, that takes a string (potentially
with HTML tags) as input and returns the same string with the policies
above applied. That is, remove HTML tags, remove spaces at the beginning
of the line, collapse multiple spaces into a single space, and collapse
multiple blank lines into a single line.
b. Write a procedure, (html->text htmlfile textfile)
, that takes
two strings that name files as parameters, reads the contents of the
first file, extracts the text, and then saves the result in the
second.
c. One deficiency of the extract-text
procedure we’ve written,
at least as it applies to some files, is that it does not deal with
lines that are not very long. For example, it may be that after
stripping tags and whitespace we end up with something like the
following.
For example,
it may be that after
stripping tags and
whitespace
we end up with something like the following.
In
many cases, the
best strategy is to combine all
of the text up to each
blank line into a single line.
In many cases, the best strategy is to combine all of the text up to each blank line into a single line. (Your Web browser may wrap the following, but they are each one line.)
For example, it may be that after stripping tags and whitespace we end up with something like the following.
In many cases, the best strategy is to combine all of the text up to each blank line into a single line.
Write a procedure, (merge-lines source target)
, that takes two filenames
as parameters, applies that process to the contents of the first file,
and writes the result to the second file.
Note: You may find it helpful to recall that when you use
(regexp-replace* regexp string replacement)
, if you put \\1
in
the replacement, it contains the first parenthesized expression in
regexp
. Similarly, \\2
contains the second parenthesized
expression, and so on and so forth.
Topics: Regular expressions, Text files
The technique of sentiment analysis examines texts and extracts information about the sentiments expressed within that text. Are the statements positive? Negative? Nuanced? Do we see joy expressed? Sorrow?
While there are a variety of complex processes used in sentiment analysis, it is possible to do a simplified version of sentiment analysis by looking at the percentage of times that words that represent a certain sentiment appear in a text.
a. Identify at least a half-dozen words or phrases whose presence you
expect signal joy. “Joy” is an obvious one. “Enthusiasm” is perhaps
another (or, more generally, words that begin with “enthus”). Then write
a procedure, (analyze-joy str)
, that takes a string as input and
determines what percentage of the words in str
signal joy.
b. Do the same thing with words or phrases you expect to signal anger. For example, “shout” or “throw”.
c. Pick one other sentiment and develop a procedure to analyze that sentiment.
d. Using your three procedures, analyze the sentiment of three books
you download from Project Gutenberg.
Include a commented-out copy of your interactions pane in your
code file. You start a multi-line comment with #|
and end it with |#
Topics: Regular expressions, Text files
As you may recall from the introduction to this course, digital humanists often use their programming tools interactively. That is, they identify possible issues of interest, use programming tools to tease out some details, then return to the text to do more analysis.
Let’s try a simple version of that.
Topics: Regular expressions, String processing, Images
You’ve now explored two very different kinds of values in Racket: images and text files. (Well, you’ve explored much more than those two, but those are a good starting point.) Let’s combine those two ideas.
Write a procedure, (visualize filename)
, that does at least three
computations based on the named file (e.g., percentage of joyful
words, number of sentences) and then provides an “interesting”
visual representation of the results of those computations. You
might use larger shapes to represent higher frequencies or counts
or more transparent colors to represent smaller frequencies or
values.
Use your procedure on at least three different files downloaded from Project Gutenberg.
Note: If you have vision limitations that make this assignment difficult, you may instead choose to write a procedure that provides a textual description of the files.
For this assignment, you should document your procedures using the 4P documentation style. However, if you’d like to try to add the additional 2P’s (preconditions and postconditions), you are certainly welcome to do so.
We will primarily evaluate your work on correctness (does your code compute what it’s supposed to and are your procedure descriptions accurate); clarity (is it easy to tell what your code does and how it acheives its results; is your writing clear and free of jargon); and concision (have you kept your work short and clean, rather than long and rambly). In a few cases, we will also consider the creativity of your result.