As we’ve already discovered, Racket provides a variety of tools for working with strings and lists of strings. We can extract substrings, split strings, count strings, and more.
But we’ve encountered one significant obstacle: sometimes instead of a particular string, we want to work with patterns of characters potentially contained within a string. For example:
As you might expect, needing to express patterns in strings is a fairly common task in computing. Some years ago, the Mathematician Stephen Kleene invented a notation, which he called “Regular Expressions”. (Fun fact: Kleene is the great-grand-advisor of Professor Rebelsky in CS.) Most modern programming languages now provide some version of regular expressions.
We can use regular expressions in Racket in a wide variety of ways, including splitting strings (with string-split), identifying parts of a string (with regexp-match and regexp-match*), and replacing parts of a string (with regexp-replace and regexp-replace*).
In this reading, we cover the basics of regular expressions and their associated library support in Racket.
For reference, the Racket documentation for regular expressions can be found here:
Regular expressions constitute a domain-specific language (DSL). In contrast to a general-purpose programming language like Racket, a DSL is a smaller language built for a particular task. Here, the task is specifying patterns of characters that appear in strings.
Because of this, it is natural to express such patterns using strings themselves.
For example, the following string expresses the pattern of characters #\h, #\e, #\l, #\l, #\o in sequence:
"hello"
Of course, this is just the string hello!
If we wish to express the pattern of seeing this series of characters in sequence, we can just replicate the string:
"hellohello"
But what if we want to capture the pattern of zero or more repetitions of the word "hello"?
We clearly can’t specify that with just our literal string syntax!
Regular expressions enrich strings by introducing a syntax and semantics for expressing these patterns.
For example, the following string captures the pattern of zero or more repetitions of "hello".
"(hello)*"
As we shall see shortly, * in a regular expression doesn’t mean the literal #\* character; it means “zero or more occurrences!”
In this sense, regular expressions are a little language embedded within Racket.
Like learning a new language, you’ll need to become acquainted with this syntax in addition to the quirks involved with embedding this language within Racket’s string type.
However, what is important to begin digesting as you read through these basics is the sorts of primitive patterns that regular expressions give you.
Like Racket’s basic types for data, these patterns form the basis for any sort of complicated character pattern you wish to capture.
From a regular expression string, we can create a genuine regular expression using the pregexp function:
> (pregexp "(hello)*")
#px"(hello)*"
The resulting value has type regular expression.
In fact, rather than using pregexp, you will likely find it more convenient to use the pregexp literal syntax instead:
> #px"(hello)*"
#px"(hello)*"
In other words, rather than calling pregexp, you can prepend a regular expression string with #px to specify a regular expression value directly.
regexp versus pregexp
It turns out that Racket provides an alternative variant of regular expression using the regexp function and #rx literal form.
This version of regular expressions is simpler, but missing some useful features that greatly simplify the regular expressions that we write.
Because of this, we will use pregexp/#px exclusively in this course.
If you peruse the documentation, you will note that most of the examples use #rx rather than #px regular expressions.
In most cases, you can safely substitute #px for #rx in the examples.
The only times you will likely encounter errors are when the regular expression string contains special characters such as curly braces (‘{‘), (‘}’) and backslash (‘').
In these cases, you will need to escape the characters with a backslash which is problematic because strings themselves use backslash as an escape character!
We won’t dive further into these complexities in this introductory reading, but you should be aware of them as you move forward with regular expressions.
We’ll explore these difficulties some more in our associated lab.
Almost any letter or sequence of letters can be used as a regular expression. For example, the following expression extracts all instances of “or” from a string.
> (regexp-match* #px"or" "The vorpal sword went snicker-snack.")
'("or" "or")
A period matches any letter. We can use that with sequences of
exact letters to match all four-letter sequences that end in ck
with the pattern #px"..ck" (any letter, any letter, c, k).
> (regexp-match* #px"..ck" "The vorpal sword went snicker-snack.")
'("nick" "nack")
There are a few “backslash patterns” that match certain groups of
letters. \d is digits (0, 1, 2, …, 9). \D is anything other
than digits. \w is letters, digits, and underscore. \W is
anything not in \w. \s is a whitespace character (space, tab,
newline, formfeed, or return) and S is any non-space character.
You’ll explore most of these in lab. When you write these in a
regular expression, you use two backslashes.
> (regexp-match* #px".sn..." "The vorpal sword went snicker-snack")
'(" snick" "-snack")
> (regexp-match* #px"\\ssn..." "The vorpal sword went snicker-snack")
'(" snick")
The simplest non-trival regular expressions are sets of characters,
denoted by a square left bracket, the characters, and a square right
bracket. For example, the regular expression #px"[aeiou]" is a
shorthand for “a or e or i or o or u” and therefore matches any
lowercase vowel.
> (string-split "The vorpal sword went snicker-snack" #px"[aeiou]")
'("Th" " v" "rp" "l sw" "rd w" "nt sn" "ck" "r-sn" "ck")
One can also use ranges in expressing sets of characters, with a
dash in-between the start and end of the range. #px"[a-e]" represents
“a or b or c or d or e”.
> (string-split "The vorpal sword went snicker-snack" #px"[a-e]")
'("Th" " vorp" "l swor" " w" "nt sni" "k" "r-sn" "" "k")
One common use of sets is to split strings at punctuation marks. For
example, we might use #px"[,.;:?!]" to represent any of comma or
period or semicolon or colon or question mark or exclamation point.
> (string-split "I am Sam. Sam I am. Do I, perhaps, like green eggs and ham?" #px"[,.;:?!]")
'("I am Sam" " Sam I am" " Do I" " perhaps" " like green eggs and ham")
If you put a caret (^) at the start of the set, the meaning changes
to “anything other than these characters”. Since #px"[aeiou ]" means
“a or e or i or o or u or space”, #px"[^aeiou ]" means “any single
letter other than a, e, i, o, u , or space”.
> (regexp-replace* #px"[^aeiou ]" "The vorpal sword went snicker-snack" "_")
"__e _o__a_ __o__ _e__ __i__e____a__"
In some cases, we want expressions that repeat a pattern.
A regular expression with a star (*) at
the end represents “0 or more repetitions of the expression”.
A regular expression with a plus sign (+) at the end
represents “1 or more repetitions of the expression”.
So #px"a+" means “a or aa or aaa or aaaa or aaaaa or ….”.
Similarly, #px"a*" means “the empty string or a or aa or aaa or
aaaa or aaaaa or ….”.
> (regexp-match* #px"a+" "aaah! amazing aardvarks alphabetize")
'("aaa" "a" "a" "aa" "a" "a" "a")
Since the asterisk can match no copies, it can produce some very strange results.
> (regexp-match* #px"a*" "aaah! amazing aardvarks alphabetize")
'("aaa" "" "" "" "" "a" "" "a" "" "" "" "" "" "aa" "" "" "" "a" "" "" "" "" "a" "" "" "" "a" "" "" "" "" "" "" "")
> (string-split "aaah! amazing aardvarks alphabetize" #px"a*")
'("" "h" "!" " " " " "" "m" "" "z" "i" "n" "g" " " "" "r" "d" "v" "" "r" "k" "s" " " "" "l" "p" "h" "" "b" "e" "t" "i" "z" "e" "")
Remember: #px"a*" means “0 or more a’s in row”. As the second
example suggests since there are, say, no a’s between “i”, “n”, and
“g” in “amazing”, it needs to split between each of those.
More frequently, we will use + and * with more complex regular
expressions. For example, #px"[a-z]+" means “a sequence of one
or more lowercase letters”, #px"a[a-z]*a" means “a sequence
of lowercase letters that starts with an a and ends with an a”,
and #px"a[a-z]+a" means “a sequence of at least three lowercase
letters that starts with an a and ends with an a”.
> (regexp-match* #px"[a-z]+" "aah! three (3) amazing aardvarks alphabetize")
'("aah" "three" "amazing" "aardvarks" "alphabetize")
> (regexp-match* #px"a[a-z]*a" "aah! three (3) amazing aardvarks alphabetize")
'("aa" "ama" "aardva" "alpha")
> (regexp-match* #px"a[a-z]+a" "aah! three (3) amazing aardvarks alphabetize")
'("ama" "aardva" "alpha")
Sometimes we want ot match any of a few words, such as eggs, ham,
and fish. The alternation symbol, |, allows us to write express
“either of these regular expressions”.
> (regexp-replace* #px"eggs|ham|fish" "One fish, two fish, red fish, blue fish eat eggs and ham, or so says the hatted cat." "thing")
"One thing, two thing, red thing, blue thing eat thing and thing, or so says the hatted cat."
You can safely put parentheses around a regular expression for clear
grouping and for things like the * expression and + expressions. We’ll
see some other uses of parentheses in a bit. For example,
while #px"a[a-z]+" means “a sequence of lowercase letters that
starts with an “a” and has at least one more letter”, #px"(a[a-z])+"
means “a sequence of lowercase letters in which a appears every-other
letter”.
> (regexp-match* #px"a[a-z]+" "aah! amazing aardvarks alphabetize with abandon")
'("aah" "amazing" "aardvarks" "alphabetize" "abandon")
> (regexp-match* #px"(a[a-z])+" "aah! amazing aardvarks alphabetize with abandon")
'("aa" "amaz" "aa" "ar" "al" "ab" "aban")
In the examples above, we’ve used three of the five basic operations we can perform with regular expressions. Let’s review the structure of each
The expression (string-split string regexp) splits a string at each
instance of the regular expression, resulting in a list of strings.
> (string-split "One and two and three" #px" +")
'("One" "and" "two" "and" "three")
> (string-split "One and two and three" #px"[aeiou]")
'("On" " " "nd tw" " " "nd thr" "")
The expression (regexp-match* regexp string) identifies each instance
of the regular expression in the string and returns them in list form.
> (regexp-match* #px".or." "Is there a vorpal sword in this morbid world?")
'("vorp" "word" "morb" "worl")
The expression (regexp-match regexp string) identifies the first
instance of the regular expression in the string and returns it as a
list of one string.
> (regexp-match #px".or." "Is there a vorpal sword in this morbid world?")
'("vorp")
The expression (regexp-replace* regexp string replacement)
identifies each copy of regexp in the given string and replaces
it with replacement. If replacement is a string that contains
\\1 (or \\2 or …), we use the appropriate parenthesized part
of the the original. If replacement is a procedure, that procedure
is called on the matches. (If the matches include parentheses, it’s
a bit more complicated.)
> (regexp-replace* #px"(eggs|ham|fish)" "One fish, two fish, red fish, blue fish eat eggs and ham, or so says the hatted cat." "\\1\\1")
"One fishfish, two fishfish, red fishfish, blue fishfish eat eggseggs and hamham, or so says the hatted cat."
> (regexp-replace* #px"eggs|ham|fish" "One fish, two fish, red fish, blue fish eat eggs and ham, or so says the hatted cat." string-upcase)
"One FISH, two FISH, red FISH, blue FISH eat EGGS and HAM, or so says the hatted cat."
The expression (regexp-replace regexp string replacement) behaves
much like regexp-replace*, except that it only replaces the first
instance of the pattern. For this version, the replacement can also be
a procedure, in which case the procedure is applied to the matching
text.
> (regexp-replace #px"(eggs|ham|fish)" "One fish, two fish, red fish, blue fish eat eggs and ham, or so says the hatted cat." "\\1\\1\\1")
"One fishfishfish, two fish, red fish, blue fish eat eggs and ham, or so says the hatted cat."
> (regexp-replace #px"eggs|ham|fish" "One fish, two fish, red fish, blue fish eat eggs and ham, or so says the hatted cat." string-upcase)
"One FISH, two fish, red fish, blue fish eat eggs and ham, or so says the hatted cat."
People with the name “Jared” often encounter various “alternate spellings” of their name, including “Jered”, “Jerod”, and “Jarod”. Write two regular expressions we could use to match any of those four spellings.
One common analysis of a text is the ratio of male pronouns to female pronouns in a text.
Suppose we want to identify the number of times that the words “she” and “her” appear in a string and we do not want to count the number of times they appear in other words (e.g., “sheet” and “there”).
How could you use regular expressions to count those appearances?
Ideas and information were taken from a section of the Racket reference on regular expressions and a section of the Racket guideon Regular Expressions.