As we’ve already discovered, Racket provides a variety of tools for working with strings and lists of strings. We can extract substrings, split strings, count strings, and more.
But we’ve encountered one significant obstacle: Sometimes instead of a particular string, we want to work with a “pattern” of sorts, such as “any vowel” rather than “a”, or “any sequence of non-alphanumeric characters” rather than “a space”.
As you might expect, needing to express patterns in strings is a fairly common task in computings. Some years ago, the Mathematician Stephen Kleene invented a notation, which he called “Regular Expressions”. (Fun fact: Kleene is the great-grand-advisor of Professor Rebelsky in CS.) Most modern programming languages now provide some version of regular expressions.
We can use regular expressions in Racket in a wide variety of ways,
including splitting strings (with string-split
), identifying
parts of a string (with regexp-match
and regexp-match*
), and
replacing parts of a string (with regexp-replace
and
regexp-replace*
).
Racket provides two kinds of regular expressions, some denoted by
#px"EXPRESSION"
and some by #rx"EXPRESSION"
. The details of
the differences are not important at this stage and we will focus
primarily on those that use #px
syntax.
There are a wide variety of kinds of regular expressions available to you.
Almost any letter or sequence of letters can be used as a regular expression. For example, the following expression extracts all instances of “or” from a string.
> (regexp-match* #px"or" "The vorpal sword went snicker-snack.")
'("or" "or")
A period matches any letter. We can use that with sequences of
exact letters to match all four-letter sequences that end in ck
with the pattern #px"..ck"
(any letter, any letter, c, k).
> (regexp-match* #px"..ck" "The vorpal sword went snicker-snack.")
'("nick" "nack")
There are a few “backslash patterns” that match certain groups of
letters. \d
is digits (0, 1, 2, …, 9). \D
is anything other
than digits. \w
is letters, digits, and underscore. \W
is
anything not in \w
. \s
is a whitespace character (space, tab,
newline, formfeed, or return) and S
is any non-space character.
You’ll explore most of these in lab. When you write these in a
regular expression, you use two backslashes.
> (regexp-match* #px".sn..." "The vorpal sword went snicker-snack")
'(" snick" "-snack")
> (regexp-match* #px"\\ssn..." "The vorpal sword went snicker-snack")
'(" snick")
The simplest non-trival regular expressions are sets of characters,
denoted by a square left bracket, the characters, and a square right
bracket. For example, the regular expression #px"[aeiou]"
is a
shorthand for “a or e or i or o or u” and therefore matches any
lowercase vowel.
> (string-split "The vorpal sword went snicker-snack" #px"[aeiou]")
'("Th" " v" "rp" "l sw" "rd w" "nt sn" "ck" "r-sn" "ck")
One can also use ranges in expressing sets of characters, with a
dash in-between the start and end of the range. #px"[a-e]"
represents
“a or b or c or d or e”.
> (string-split "The vorpal sword went snicker-snack" #px"[a-e]")
'("Th" " vorp" "l swor" " w" "nt sni" "k" "r-sn" "" "k")
One common use of sets is to split strings at punctuation marks. For
example, we might use #px"[,.;:?!]"
to represent any of comma or
period or semicolon or colon or question mark or exclamation point.
> (string-split "I am Sam. Sam I am. Do I, perhaps, like green eggs and ham?" #px"[,.;:?!]")
'("I am Sam" " Sam I am" " Do I" " perhaps" " like green eggs and ham")
If you put a caret (^
) at the start of the set, the meaning changes
to “anything other than these characters”. Since #px"[aeiou ]"
means
“a or e or i or o or u or space”, #px"[^aeiou ]"
means “any single
letter other than a, e, i, o, u , or space”.
> (regexp-replace* #px"[^aeiou ]" "The vorpal sword went snicker-snack" "_")
"__e _o__a_ __o__ _e__ __i__e____a__"
In some cases, we want expressions that repeat a pattern.
A regular expression with a star (*
) at
the end represents “0 or more repetitions of the expression”.
A regular expression with a plus sign (+
) at the end
represents “1 or more repetitions of the expression”.
So #px"a+"
means “a or aa or aaa or aaaa or aaaaa or ….”.
Similarly, #px"a*"
means “the empty string or a or aa or aaa or
aaaa or aaaaa or ….”.
> (regexp-match* #px"a+" "aaah! amazing aardvarks alphabetize")
'("aaa" "a" "a" "aa" "a" "a" "a")
Since the asterisk can match no copies, it can produce some very strange results.
> (regexp-match* #px"a*" "aaah! amazing aardvarks alphabetize")
'("aaa" "" "" "" "" "a" "" "a" "" "" "" "" "" "aa" "" "" "" "a" "" "" "" "" "a" "" "" "" "a" "" "" "" "" "" "" "")
> (string-split "aaah! amazing aardvarks alphabetize" #px"a*")
'("" "h" "!" " " " " "" "m" "" "z" "i" "n" "g" " " "" "r" "d" "v" "" "r" "k" "s" " " "" "l" "p" "h" "" "b" "e" "t" "i" "z" "e" "")
Remember: #px"a*"
means “0 or more a’s in row”. As the second
example suggests since there are, say, no a’s between “i”, “n”, and
“g” in “amazing”, it needs to split between each of those.
More frequently, we will use +
and *
with more complex regular
expressions. For example, #px"[a-z]+"
means “a sequence of one
or more lowercase letters”, #px"a[a-z]*a"
means “a sequence
of lowercase letters that starts with an a and ends with an a”,
and #px"a[a-z]+a"
means “a sequence of at least three lowercase
letters that starts with an a and ends with an a”.
> (regexp-match* #px"[a-z]+" "aah! three (3) amazing aardvarks alphabetize")
'("aah" "three" "amazing" "aardvarks" "alphabetize")
> (regexp-match* #px"a[a-z]*a" "aah! three (3) amazing aardvarks alphabetize")
'("aa" "ama" "aardva" "alpha")
> (regexp-match* #px"a[a-z]+a" "aah! three (3) amazing aardvarks alphabetize")
'("ama" "aardva" "alpha")
Sometimes we want ot match any of a few words, such as eggs, ham,
and fish. The alternation symbol, |
, allows us to write express
“either of these regular expressions”.
> (regexp-replace* #px"eggs|ham|fish" "One fish, two fish, red fish, blue fish eat eggs and ham, or so says the hatted cat." "thing")
"One thing, two thing, red thing, blue thing eat thing and thing, or so says the hatted cat."
You can safely put parentheses around a regular expression for clear
grouping and for things like the *
expression and +
expressions. We’ll
see some other uses of parentheses in a bit. For example,
while #px"a[a-z]+"
means “a sequence of lowercase letters that
starts with an a and has at least one more letter”, #px"(a[a-z])+"
means “a sequence of lowercase letters in which a appears every-other
letter”.
> (regexp-match* #px"a[a-z]+" "aah! amazing aardvarks alphabetize with abandon")
'("aah" "amazing" "aardvarks" "alphabetize" "abandon")
> (regexp-match* #px"(a[a-z])+" "aah! amazing aardvarks alphabetize with abandon")
'("aa" "amaz" "aa" "ar" "al" "ab" "aban")
In the examples above, we’ve used three of the five basic operations we can perform with regular expressions. Let’s review the structure of each
The expression (string-split string regexp)
splits a string at each
instance of the regular expression, resulting in a list of strings.
> (string-split "One and two and three" #px" +")
'("One" "and" "two" "and" "three")
> (string-split "One and two and three" #px"[aeiou]")
'("On" " " "nd tw" " " "nd thr" "")
The expression (regexp-match* regexp string)
identifies each instance
of the regular expression in the string and returns them in list form.
> (regexp-match* #px".or." "Is there a vorpal sword in this morbid world?")
'("vorp" "word" "morb" "worl")
The expression (regexp-match regexp string)
identifies the first
instance of the regular expression in the string and returns it as a
list of one string.
> (regexp-match #px".or." "Is there a vorpal sword in this morbid world?")
'("vorp")
The expression (regexp-replace* regexp string replacement)
identifies each copy of regexp
in the given string and replaces
it with replacement
. If replacement
is a string that contains
\\1
(or \\2
or …), we use the appropriate parenthesized part
of the the original. If replacement
is a procedure, that procedure
is called on the matches. (If the matches include parentheses, it’s
a bit more complicated.)
> (regexp-replace* #px"(eggs|ham|fish)" "One fish, two fish, red fish, blue fish eat eggs and ham, or so says the hatted cat." "\\1\\1")
"One fishfish, two fishfish, red fishfish, blue fishfish eat eggseggs and hamham, or so says the hatted cat."
> (regexp-replace* #px"eggs|ham|fish" "One fish, two fish, red fish, blue fish eat eggs and ham, or so says the hatted cat." string-upcase)
"One FISH, two FISH, red FISH, blue FISH eat EGGS and HAM, or so says the hatted cat."
The expression (regexp-replace regexp string replacement)
behaves
much like regexp-replace*
, except that it only replaces the first
instance of the pattern. For this version, the replacement can also be
a procedure, in which case the procedure is applied to the matching
text.
> (regexp-replace #px"(eggs|ham|fish)" "One fish, two fish, red fish, blue fish eat eggs and ham, or so says the hatted cat." "\\1\\1\\1")
"One fishfishfish, two fish, red fish, blue fish eat eggs and ham, or so says the hatted cat."
> (regexp-replace #px"eggs|ham|fish" "One fish, two fish, red fish, blue fish eat eggs and ham, or so says the hatted cat." string-upcase)
"One FISH, two fish, red fish, blue fish eat eggs and ham, or so says the hatted cat."
People with the name “Jared” often encounter various “alternate spellings” of their name, including “Jered”, “Jerod”, and “Jarod”. Write two regular expressions we could use to match any of those four spellings.
One common analysis of a text is the ratio of male pronouns to female pronouns in a text.
Suppose we want to identify the number of times that the words “she” and “her” appear in a string and we do not want to count the number of times they appear in other words (e.g., “sheet” and “there”).
How could you use regular expressions to count those appearances?
Ideas and information were taken from a section of the Racket reference on regular expressions and a section of the Racket guideon Regular Expressions.