Skip to main content

Lab: Regular expressions and pattern matching

Held
Wednesday, 6 February 2019
Writeup due
Friday, 8 February 2019
Summary
We explore pattern matching in Racket, particularly the use of what are called regular expressions to express general patterns.
Disclaimer
This lab is new for spring 2019. It may contain some infelicities.

Useful procedures and notation

List operations

Extracting sublists: drop, take

Counting elements: length

Text files

Reading files: file->chars, file->words, file->lines, file->string

Writing files: string->file, lines->file

Regular expressions

Notation: #px"EXPRESSION"

  • any one character (.)
  • sets of characters (e.g., [abc], [c-f])
  • anti-sets of characters (e.g., `[^abc]),
  • specific sets (\d for digits, \D for non-digits, \w for “word characters”, \W for non-word characters, \s for spaces characters and \S for non-space characters).
  • repetition (EXP* or EXP+)
  • alternation (EXP1|EXP2)
  • grouping ((EXP)).

Useful operations:

  • (string-split string regexp), (regexp-match regexp string), (regexp-match* regexp string), (regexp-replace regexp string replacement), (regexp-replace* regexp string replacement)

Preparation

a. Open a terminal window and type /home/rebelsky/bin/csc151/update to make sure that you have the latest version of the class software.

b. Don’t forget to put (require loudhum) in the Definitions pane immediately after #lang racket.

Exercises

Exercise 1: Working with texts

Project Gutenberg provides an extensive collection of public domain books in a variety of forms, including “plain text”.

a. Navigate to the Project Gutenberg Web site and download one or two books in plain text format. Strive for short- to medium-length books. Jane Eyre is okay. The Complete Works of William Shakespeare is not.

b. Pick one of the books you’ve downloaded and open it in gedit (aka “Text Editor”). (You’re doing this primarily to see that you got the appropriate contents.)

Exercise 2: Working with texts, revisited

a. Using the book, write instructions in the definitions pane to read the characters, words, lines, and complete contents from the book. Call the results book-letters, book-words, book-lines, and book-contents. For example,

(define book-letters (file->chars "/home/rebelsky/Desktop/pg1260.txt"))

b. Write instructions to extract the first 20 characters, 10 words, and 5 lines from the book.

c. Determine how many letters appear in the book.

d. Write instructions to extract lines 100 through 120 from the book.

e. Write instructions to determine how many times the letter “a” appears in the book. (You need deal only with lowercase “a”.)

Exercise 3: Creating files

As you may recall, the procedure (string->file str fname), saves a string to the named file. There’s also a (lines->file lines fname), that saves a list of strings to the named file, one string per line.

a. Save line 100 of your book to the file /home/username/Desktop/line100.txt. (Please substitute your own user name.)

b. Verify that you were successful by using file->string with that same file name.

c. Save lines 100 through 120 of your book to the file /home/username/Desktop/excerpt.txt. (Once again, please substitute your own user name.)

d. Verify that you were successful by using file->string with that same file name.

e. Add a line to your definitions pane that reads as follows.

(define excerpt (file->string "/home/username/Desktop/excerpt.txt"))

Exercise 4: More file experiments

a. What do you expect to happen if you try to read a file that you do not have permission to read using file->string, as in the following?

> (file->string "/home/rebelsky/Desktop/TOP-SECRET")

b. Check your answer experimentally.

c. What do you expect to happen if you try to read a file that does not exist using file->string, as in the following?

> (file->string "/home/nobody/Desktop/ydobon")

d. Check your answer experimentally.

e. What do you expect to happen if you try to write a file to someone else’s directory, as in the following?

> (string->file "I'm a H4x0r" "/home/rebelsky/Desktop/info")

f. Check your answer experimentally.

g. In a prior exercise, you created the file /home/username/Desktop/line100.txt. Using file->string, check the contents of that file.

h. What do you expect to happen if you try to write to that file, as in the following?

> (string->file "line 100" "/home/username/Desktop/line100.txt")

i. Check your answer experimentally.

Exercise 5: Exploring a sample string

a. Add the following to your definitions pane and click “Run”.

(define sample
  "fishy: one cat, one hat, two things, \none fish, two fish, red fish, blue fish, green and yellow fish \nred books \n\n\none and two\tor\tthree and four\nthat is flat\n")

b. Suppose we create a file with (string->file sample "/home/username/Desktop/sample.txt"). What do you expect the contents of that file to look like?

c. Check your answer experimentally.

d. One way to break up that string is at each space. Write an expression to do so. (You should not need regular expressions, at least not yet.)

e. Another way to break up that string is at each newline character. Write an expression to do so. (You still should not need regular expressions, at least not yet.)

f. The word “and” appears a few times in that string. Split it at that word.

Exercise 6: Splitting strings, revisited

As you may have noted in the previous exercise, it seems insufficient to split at a space, or a newline, or even a tab (which we didn’t try yet).

a. Write an expression that splits sample at any whitespace character (space, tab, or newline).

> (string-split sample #px"???")
'("fishy:" "one" "cat," "one" "hat," "two" "things," "" "one" "fish," "two" "fish," "red" "fish," "blue" "fish," "green" "and" "yellow" "fish" "" "red" "books" "" "" "" "one" "and" "two" "or" "three" "and" "four" "that" "is" "flat")

b. As you may have noted, the previous example includes a lot of empty strings. That’s because we’re splitting at a single whitespace character but the file contains sequences of whitespace characters, such as a space and a newline, or multiple newlines in a row. Write an expression that splits sample at any nonempty sequence of whitespace characters.

> (string-split sample #px"???")
'("fishy:" "one" "cat," "one" "hat," "two" "things," "one" "fish," "two" "fish," "red" "fish," "blue" "fish," "green" "and" "yellow" "fish" "red" "books" "one" "and" "two" "or" "three" "and" "four" "that" "is" "flat")

c. As you may have noted, the previous example includes characters in “words” that are not alphabetical, such as the colon in "fishy:" and the comma in "hat,". Write an expression that splits sample at any nonempty sequence of non-alphabetical characters.

> (string-split sample #px"???")
'("fishy" "one" "cat" "one" "hat" "two" "things" "one" "fish" "two" "fish" "red" "fish" "blue" "fish" "green" "and" "yellow" "fish" "red" "books" "one" "and" "two" "or" "three" "and" "four" "that" "is" "flat")

d. Write a procedure, (string->words str), that takes a string as input and splits it into the “words” (sequences of alphabetical characters).

> (string->words sample)
'("fishy" "one" "cat" "one" "hat" "two" "things" "one" "fish" "two" "fish" "red" "fish" "blue" "fish" "green" "and" "yellow" "fish" "red" "books" "one" "and" "two" "or" "three" "and" "four" "that" "is" "flat")
> (string->words "hello+goodbye, ph33r")
'("hello" "goodbye" "ph" "r")

Exercise 7: Searching strings

As you may recall, the (regexp-match* regexp string) returns a list of all strings that match a pattern.

a. Write an expression that identifies all of the times that two vowels appear in sequence in sample. Your expression should only return a list of the vowel pairs, not the context.

b. Write an expression that identifies all of the times that a three-letter sequence of characters that ends with “at” appears in sample. Your expression should return a list of those three-letter sequences.

c. Write an expression that identifies all of the times that a three-letter word that ends with “at” appears in sample.

d. Repeat the two prior experiments using the excerpt from your book.

e. Repeat those two experiments using the whole book. How long does it seem to take?

f. Write an expression that identifies all of the times that an adjective and the word “fish” appear together in sample.

g. Write an expression that identifies all of the two-word sequences that begin with the word “one” appear in sample.

h. Check simlar uses of “one” in your sample.

Exercise 8: Exploring tags

We often use #px"<[^>]*> as a reasonable, but not perfect, regular expression for matching an HTML tag. Let’s see what we can do with that as a pattern.

a. If you don’t have a Web site from the lab on HTML and CSS, create one by opening a terminal window and typing the following.

$ /home/rebelsky/bin/csc151-setup-web

b. As you may recall, your HTML directory contains a file called thingy.html. (The full path name is /home/username/public_html/thingy.html). Write an expression that extracts all of the tags from that file.

c. Write a procedure, (tags filename), that extracts all the tags from the given file. Check it using the same file you worked with in the prior step.

d. The procedure (regexp-replace* regexp string replacement) replaces all instances of a regular expression with a replacement. Write an expression to read thingy.html and remove all tags from the resulting text.

e. Write a procedure, (notags filename), that extracts the contents of the given file, but without any tags.

Exercise 9: Fancier replacements

You may recall that when you use (regexp-replace* regexp string replacement), if you put \\1 in the replacement, it contains the first parenthesized expression in regexp. Similarly, \\2 contains the second parenthesized expression, and so on and so forth. Let’s try that out.

a. What output do you expect for the following?

> (regexp-replace* #px"([a-z][a-z][a-z])" "the cat" "\\1\\2\\3\\2\\1")
?

b. Check your answer experimentally.

c. What output do you expect for the following?

> (regexp-replace* #px"([a-z][a-z][a-z])" "catastrophe" "\\1\\2\\3\\2\\1 ")
?

d. Check your answer experimentally.

e. Earlier, you identified all of the words that were prefaced by the word “one”. Write an expression that replaces “one THING” by “some THINGs”. For example,

> (regexp-replace* #px"???" "one cat and one hat" "???")
"some cats and some hats"

Exercise 10: Analyzing texts

a. Write a procedure, (count-male-pronouns str), that takes a string as input and counts the number of male pronouns (“he”, “him”, “his”) in the string. (Be careful not to count words like “history”.)

b. Write a procedure, (count-female-pronouns str), that takes a string as input and counts the number of female pronouns (“she”, “her”, “hers”) in the string.

c. Using your two procedures, determine whether the book you choose is more likely to use male pronouns or female pronouns.

d. Write a procedure (this-and-that str) that finds all sequences of the form “WORD and WORD” in the string.

For those with extra time

If you find that you have extra time, consider trying one or more of the following exercises.

Extra 1: Matching the start or end of a string

At times, we want to restrict our pattern to the start or end of a string (or start or end of a line).
Check the Racket reference on regular expressions to determine how to match start and end.

Extra 2: Matching unicode categories

You may have observed that our regular expressions are biased toward alphabetic characters from American English. In this instance, as in too many situations related to computers, the original designers did not adequately consider the needs of those who speak other languages.

> (regexp-match* #px"\\w" "a b c á ñ d")
'("a" "b" "c" "d")
> (regexp-match* #px"[a-z]" "a b c á ñ d")
'("a" "b" "c" "d")

Fortunately, Racket’s regular expressions do provide ways to match according to more general Unicode classifications.
Explore the Racket reference on regular expressions and figure out how to do so.

> > (regexp-match* #px"???" "a b c á ñ d")
'("a" "b" "c" "á" "ñ" "d")

Extra 3: Extracting words, revisited

Rewrite the string->words procedure so that it works with any sequence of letters, not just US letters.