Extracting sublists: drop
, take
Counting elements: length
Reading files: file->chars
, file->words
, file->lines
, file->string
Writing files: string->file
, lines->file
Notation: #px"EXPRESSION"
.
)[abc]
, [c-f]
)\d
for digits, \D
for non-digits, \w
for “word characters”, \W
for non-word characters, \s
for spaces characters and \S
for non-space characters).EXP*
or EXP+
)EXP1|EXP2
)(EXP)
).Useful operations:
(string-split string regexp)
, (regexp-match regexp string)
,
(regexp-match* regexp string)
,
(regexp-replace regexp string replacement)
,
(regexp-replace* regexp string replacement)
a. Open a terminal window and type /home/rebelsky/bin/csc151/update
to
make sure that you have the latest version of the class software.
b. Don’t forget to put (require loudhum)
in the Definitions pane
immediately after #lang racket
.
Project Gutenberg provides an extensive collection of public domain books in a variety of forms, including “plain text”.
a. Navigate to the Project Gutenberg Web site and download one or two books in plain text format. Strive for short- to medium-length books. Jane Eyre is okay. The Complete Works of William Shakespeare is not.
b. Pick one of the books you’ve downloaded and open it in gedit (aka “Text Editor”). (You’re doing this primarily to see that you got the appropriate contents.)
a. Using the book, write instructions in the definitions pane to
read the characters, words, lines, and complete contents from the
book. Call the results book-letters
, book-words
, book-lines
,
and book-contents
. For example,
(define book-letters (file->chars "/home/rebelsky/Desktop/pg1260.txt"))
b. Write instructions to extract the first 20 characters, 10 words, and 5 lines from the book.
c. Determine how many letters appear in the book.
d. Write instructions to extract lines 100 through 120 from the book.
e. Write instructions to determine how many times the letter “a” appears in the book. (You need deal only with lowercase “a”.)
As you may recall, the procedure (string->file str fname)
, saves
a string to the named file. There’s also a (lines->file lines fname)
,
that saves a list of strings to the named file, one string per line.
a. Save line 100 of your book to the file
/home/username/Desktop/line100.txt
. (Please substitute your
own user name.)
b. Verify that you were successful by using file->string
with
that same file name.
c. Save lines 100 through 120 of your book to the file
/home/username/Desktop/excerpt.txt
. (Once again, please substitute
your own user name.)
d. Verify that you were successful by using file->string
with
that same file name.
e. Add a line to your definitions pane that reads as follows.
(define excerpt (file->string "/home/username/Desktop/excerpt.txt"))
a. What do you expect to happen if you try to read a file that
you do not have permission to read using file->string
, as in
the following?
> (file->string "/home/rebelsky/Desktop/TOP-SECRET")
b. Check your answer experimentally.
c. What do you expect to happen if you try to read a file that
does not exist using file->string
, as in the following?
> (file->string "/home/nobody/Desktop/ydobon")
d. Check your answer experimentally.
e. What do you expect to happen if you try to write a file to someone else’s directory, as in the following?
> (string->file "I'm a H4x0r" "/home/rebelsky/Desktop/info")
f. Check your answer experimentally.
g. In a prior exercise, you created the file
/home/username/Desktop/line100.txt
. Using file->string
, check
the contents of that file.
h. What do you expect to happen if you try to write to that file, as in the following?
> (string->file "line 100" "/home/username/Desktop/line100.txt")
i. Check your answer experimentally.
a. Add the following to your definitions pane and click “Run”.
(define sample
"fishy: one cat, one hat, two things, \none fish, two fish, red fish, blue fish, green and yellow fish \nred books \n\n\none and two\tor\tthree and four\nthat is flat\n")
b. Suppose we create a file with (string->file sample "/home/username/Desktop/sample.txt")
. What do you expect the contents of that file to look like?
c. Check your answer experimentally.
d. One way to break up that string is at each space. Write an expression to do so. (You should not need regular expressions, at least not yet.)
e. Another way to break up that string is at each newline character. Write an expression to do so. (You still should not need regular expressions, at least not yet.)
f. The word “and” appears a few times in that string. Split it at that word.
As you may have noted in the previous exercise, it seems insufficient to split at a space, or a newline, or even a tab (which we didn’t try yet).
a. Write an expression that splits sample
at any whitespace character
(space, tab, or newline).
> (string-split sample #px"???")
'("fishy:" "one" "cat," "one" "hat," "two" "things," "" "one" "fish," "two" "fish," "red" "fish," "blue" "fish," "green" "and" "yellow" "fish" "" "red" "books" "" "" "" "one" "and" "two" "or" "three" "and" "four" "that" "is" "flat")
b. As you may have noted, the previous example includes a lot of empty
strings. That’s because we’re splitting at a single whitespace
character but the file contains sequences of whitespace characters,
such as a space and a newline, or multiple newlines in a row. Write
an expression that splits sample
at any nonempty sequence of
whitespace characters.
> (string-split sample #px"???")
'("fishy:" "one" "cat," "one" "hat," "two" "things," "one" "fish," "two" "fish," "red" "fish," "blue" "fish," "green" "and" "yellow" "fish" "red" "books" "one" "and" "two" "or" "three" "and" "four" "that" "is" "flat")
c. As you may have noted, the previous example includes characters in “words”
that are not alphabetical, such as the colon in "fishy:"
and the comma in
"hat,"
. Write an expression that splits sample
at any nonempty
sequence of non-alphabetical characters.
> (string-split sample #px"???")
'("fishy" "one" "cat" "one" "hat" "two" "things" "one" "fish" "two" "fish" "red" "fish" "blue" "fish" "green" "and" "yellow" "fish" "red" "books" "one" "and" "two" "or" "three" "and" "four" "that" "is" "flat")
d. Write a procedure, (string->words str)
, that takes a string as
input and splits it into the “words” (sequences of alphabetical
characters).
> (string->words sample)
'("fishy" "one" "cat" "one" "hat" "two" "things" "one" "fish" "two" "fish" "red" "fish" "blue" "fish" "green" "and" "yellow" "fish" "red" "books" "one" "and" "two" "or" "three" "and" "four" "that" "is" "flat")
> (string->words "hello+goodbye, ph33r")
'("hello" "goodbye" "ph" "r")
As you may recall, the (regexp-match* regexp string)
returns a
list of all strings that match a pattern.
a. Write an expression that identifies all of the times that two
vowels appear in sequence in sample
. Your expression should only
return a list of the vowel pairs, not the context.
b. Write an expression that identifies all of the times that a
three-letter sequence of characters that ends with “at” appears in
sample
. Your expression should return a list of those three-letter
sequences.
c. Write an expression that identifies all of the times that a
three-letter word that ends with “at” appears in sample
.
d. Repeat the two prior experiments using the excerpt from your book.
e. Repeat those two experiments using the whole book. How long does it seem to take?
f. Write an expression that identifies all of the times that
an adjective and the word “fish” appear together in sample
.
g. Write an expression that identifies all of the two-word sequences
that begin with the word “one” appear in sample
.
h. Check simlar uses of “one” in your sample.
We often use #px"<[^>]*>
as a reasonable, but not perfect, regular
expression for matching an HTML tag. Let’s see what we can do with
that as a pattern.
a. If you don’t have a Web site from the lab on HTML and CSS, create one by opening a terminal window and typing the following.
$ /home/rebelsky/bin/csc151-setup-web
b. As you may recall, your HTML directory contains a file
called thingy.html
. (The full path name is
/home/username/public_html/thingy.html
). Write an expression
that extracts all of the tags from that file.
c. Write a procedure, (tags filename)
, that extracts all the
tags from the given file. Check it using the same file you
worked with in the prior step.
d. The procedure (regexp-replace* regexp string replacement)
replaces all instances of a regular expression with a replacement.
Write an expression to read thingy.html
and remove all tags
from the resulting text.
e. Write a procedure, (notags filename)
, that extracts the
contents of the given file, but without any tags.
You may recall that when you use (regexp-replace* regexp string
replacement)
, if you put \\1
in the replacement, it contains the
first parenthesized expression in regexp
. Similarly, \\2
contains the second parenthesized expression, and so on and so
forth. Let’s try that out.
a. What output do you expect for the following?
> (regexp-replace* #px"([a-z][a-z][a-z])" "the cat" "\\1\\2\\3\\2\\1")
?
b. Check your answer experimentally.
c. What output do you expect for the following?
> (regexp-replace* #px"([a-z][a-z][a-z])" "catastrophe" "\\1\\2\\3\\2\\1 ")
?
d. Check your answer experimentally.
e. Earlier, you identified all of the words that were prefaced by the word “one”. Write an expression that replaces “one THING” by “some THINGs”. For example,
> (regexp-replace* #px"???" "one cat and one hat" "???")
"some cats and some hats"
a. Write a procedure, (count-male-pronouns str)
, that takes a
string as input and counts the number of male pronouns (“he”, “him”,
“his”) in the string. (Be careful not to count words like “history”.)
b. Write a procedure, (count-female-pronouns str)
, that takes
a string as input and counts the number of female pronouns
(“she”, “her”, “hers”) in the string.
c. Using your two procedures, determine whether the book you choose is more likely to use male pronouns or female pronouns.
d. Write a procedure (this-and-that str)
that finds all sequences
of the form “WORD and WORD” in the string.
If you find that you have extra time, consider trying one or more of the following exercises.
At times, we want to restrict our pattern to the start or end of a
string (or start or end of a line).
Check the Racket reference
on regular expressions
to determine how to match start and end.
You may have observed that our regular expressions are biased toward alphabetic characters from American English. In this instance, as in too many situations related to computers, the original designers did not adequately consider the needs of those who speak other languages.
> (regexp-match* #px"\\w" "a b c á ñ d")
'("a" "b" "c" "d")
> (regexp-match* #px"[a-z]" "a b c á ñ d")
'("a" "b" "c" "d")
Fortunately, Racket’s regular expressions do provide ways to match
according to more general Unicode classifications.
Explore the Racket reference
on regular expressions
and figure out how to do so.
> > (regexp-match* #px"???" "a b c á ñ d")
'("a" "b" "c" "á" "ñ" "d")
Rewrite the string->words
procedure so that it works with any
sequence of letters, not just US letters.