Regular expressions

Summary
We explore regular expressions in Racket which allows us to search for patterns of text in a structured, concise manner.

Introduction

As we’ve already discovered, Racket provides a variety of tools for working with strings and lists of strings. We can extract substrings, split strings, count strings, and more.

But we’ve encountered one significant obstacle: sometimes instead of a particular string, we want to work with patterns of characters potentially contained within a string. For example:

  • We want to determine whether a string contains a book title which will be written using title case (first letter capitalized of each word).
  • We want to find all occurrences of “??” embedded within a string literal (i.e., in a block of text surrounded by quotes) and replace them with a string of our choice.
  • We want to break up a string that represents a date into chunks, noting that sometimes people use “-“, “/”, or even “.” as the delimiter between day, month, and year.

As you might expect, needing to express patterns in strings is a fairly common task in computing. Some years ago, the Mathematician Stephen Kleene invented a notation, which he called “Regular Expressions”. (Fun fact: Kleene is the great-grand-advisor of Professor Rebelsky in CS.) Most modern programming languages now provide some version of regular expressions.

We can use regular expressions in Racket in a wide variety of ways, including splitting strings (with string-split), identifying parts of a string (with regexp-match and regexp-match*), and replacing parts of a string (with regexp-replace and regexp-replace*). In this reading, we cover the basics of regular expressions and their associated library support in Racket.

For reference, the Racket documentation for regular expressions can be found here:

Regular expression basics

Regular expressions constitute a domain-specific language (DSL). In contrast to a general-purpose programming language like Racket, a DSL is a smaller language built for a particular task. Here, the task is specifying patterns of characters that appear in strings.

Because of this, it is natural to express such patterns using strings themselves. For example, the following string expresses the pattern of characters #\h, #\e, #\l, #\l, #\o in sequence:

"hello"

Of course, this is just the string hello! If we wish to express the pattern of seeing this series of characters in sequence, we can just replicate the string:

"hellohello"

But what if we want to capture the pattern of zero or more repetitions of the word "hello"? We clearly can’t specify that with just our literal string syntax! Regular expressions enrich strings by introducing a syntax and semantics for expressing these patterns. For example, the following string captures the pattern of zero or more repetitions of "hello".

"(hello)*"

As we shall see shortly, * in a regular expression doesn’t mean the literal #\* character; it means “zero or more occurrences!” In this sense, regular expressions are a little language embedded within Racket. Like learning a new language, you’ll need to become acquainted with this syntax in addition to the quirks involved with embedding this language within Racket’s string type. However, what is important to begin digesting as you read through these basics is the sorts of primitive patterns that regular expressions give you. Like Racket’s basic types for data, these patterns form the basis for any sort of complicated character pattern you wish to capture.

Constructing regular expressions

From a regular expression string, we can create a genuine regular expression using the pregexp function:

> (pregexp "(hello)*")
#px"(hello)*"

The resulting value has type regular expression. In fact, rather than using pregexp, you will likely find it more convenient to use the pregexp literal syntax instead:

> #px"(hello)*"
#px"(hello)*"

In other words, rather than calling pregexp, you can prepend a regular expression string with #px to specify a regular expression value directly.

regexp versus pregexp

It turns out that Racket provides an alternative variant of regular expression using the regexp function and #rx literal form. This version of regular expressions is simpler, but missing some useful features that greatly simplify the regular expressions that we write. Because of this, we will use pregexp/#px exclusively in this course.

If you peruse the documentation, you will note that most of the examples use #rx rather than #px regular expressions. In most cases, you can safely substitute #px for #rx in the examples. The only times you will likely encounter errors are when the regular expression string contains special characters such as curly braces (‘{‘), (‘}’) and backslash (‘'). In these cases, you will need to escape the characters with a backslash which is problematic because strings themselves use backslash as an escape character! We won’t dive further into these complexities in this introductory reading, but you should be aware of them as you move forward with regular expressions. We’ll explore these difficulties some more in our associated lab.

Exact letters

Almost any letter or sequence of letters can be used as a regular expression. For example, the following expression extracts all instances of “or” from a string.

> (regexp-match* #px"or" "The vorpal sword went snicker-snack.")
'("or" "or")

Any letter

A period matches any letter. We can use that with sequences of exact letters to match all four-letter sequences that end in ck with the pattern #px"..ck" (any letter, any letter, c, k).

> (regexp-match* #px"..ck" "The vorpal sword went snicker-snack.")
'("nick" "nack")

Groups of letters

There are a few “backslash patterns” that match certain groups of letters. \d is digits (0, 1, 2, …, 9). \D is anything other than digits. \w is letters, digits, and underscore. \W is anything not in \w. \s is a whitespace character (space, tab, newline, formfeed, or return) and S is any non-space character. You’ll explore most of these in lab. When you write these in a regular expression, you use two backslashes.

> (regexp-match* #px".sn..." "The vorpal sword went snicker-snack")
'(" snick" "-snack")
> (regexp-match* #px"\\ssn..." "The vorpal sword went snicker-snack")
'(" snick")

Sets

The simplest non-trival regular expressions are sets of characters, denoted by a square left bracket, the characters, and a square right bracket. For example, the regular expression #px"[aeiou]" is a shorthand for “a or e or i or o or u” and therefore matches any lowercase vowel.

> (string-split "The vorpal sword went snicker-snack" #px"[aeiou]")
'("Th" " v" "rp" "l sw" "rd w" "nt sn" "ck" "r-sn" "ck")

One can also use ranges in expressing sets of characters, with a dash in-between the start and end of the range. #px"[a-e]" represents “a or b or c or d or e”.

> (string-split "The vorpal sword went snicker-snack" #px"[a-e]")
'("Th" " vorp" "l swor" " w" "nt sni" "k" "r-sn" "" "k")

One common use of sets is to split strings at punctuation marks. For example, we might use #px"[,.;:?!]" to represent any of comma or period or semicolon or colon or question mark or exclamation point.

> (string-split "I am Sam.  Sam I am.  Do I, perhaps, like green eggs and ham?" #px"[,.;:?!]")
'("I am Sam" "  Sam I am" "  Do I" " perhaps" " like green eggs and ham")

Anti-sets

If you put a caret (^) at the start of the set, the meaning changes to “anything other than these characters”. Since #px"[aeiou ]" means “a or e or i or o or u or space”, #px"[^aeiou ]" means “any single letter other than a, e, i, o, u , or space”.

> (regexp-replace* #px"[^aeiou ]" "The vorpal sword went snicker-snack" "_")
"__e _o__a_ __o__ _e__ __i__e____a__"

Repeating patterns

In some cases, we want expressions that repeat a pattern. A regular expression with a star (*) at the end represents “0 or more repetitions of the expression”. A regular expression with a plus sign (+) at the end represents “1 or more repetitions of the expression”. So #px"a+" means “a or aa or aaa or aaaa or aaaaa or ….”. Similarly, #px"a*" means “the empty string or a or aa or aaa or aaaa or aaaaa or ….”.

> (regexp-match* #px"a+" "aaah!  amazing aardvarks alphabetize")
'("aaa" "a" "a" "aa" "a" "a" "a")

Since the asterisk can match no copies, it can produce some very strange results.

> (regexp-match* #px"a*" "aaah!  amazing aardvarks alphabetize")
'("aaa" "" "" "" "" "a" "" "a" "" "" "" "" "" "aa" "" "" "" "a" "" "" "" "" "a" "" "" "" "a" "" "" "" "" "" "" "")
> (string-split  "aaah!  amazing aardvarks alphabetize" #px"a*")
'("" "h" "!" " " " " "" "m" "" "z" "i" "n" "g" " " "" "r" "d" "v" "" "r" "k" "s" " " "" "l" "p" "h" "" "b" "e" "t" "i" "z" "e" "")

Remember: #px"a*" means “0 or more a’s in row”. As the second example suggests since there are, say, no a’s between “i”, “n”, and “g” in “amazing”, it needs to split between each of those.

More frequently, we will use + and * with more complex regular expressions. For example, #px"[a-z]+" means “a sequence of one or more lowercase letters”, #px"a[a-z]*a" means “a sequence of lowercase letters that starts with an a and ends with an a”, and #px"a[a-z]+a" means “a sequence of at least three lowercase letters that starts with an a and ends with an a”.

> (regexp-match* #px"[a-z]+" "aah! three (3) amazing aardvarks alphabetize")
'("aah" "three" "amazing" "aardvarks" "alphabetize")
> (regexp-match* #px"a[a-z]*a" "aah! three (3) amazing aardvarks alphabetize")
'("aa" "ama" "aardva" "alpha")
> (regexp-match* #px"a[a-z]+a" "aah! three (3) amazing aardvarks alphabetize")
'("ama" "aardva" "alpha")

Allowing multiple patterns

Sometimes we want ot match any of a few words, such as eggs, ham, and fish. The alternation symbol, |, allows us to write express “either of these regular expressions”.

> (regexp-replace* #px"eggs|ham|fish" "One fish, two fish, red fish, blue fish eat eggs and ham, or so says the hatted cat." "thing")
"One thing, two thing, red thing, blue thing eat thing and thing, or so says the hatted cat."

Parenthesized expressions

You can safely put parentheses around a regular expression for clear grouping and for things like the * expression and + expressions. We’ll see some other uses of parentheses in a bit. For example, while #px"a[a-z]+" means “a sequence of lowercase letters that starts with an “a” and has at least one more letter”, #px"(a[a-z])+" means “a sequence of lowercase letters in which a appears every-other letter”.

> (regexp-match* #px"a[a-z]+" "aah! amazing aardvarks alphabetize with abandon")
'("aah" "amazing" "aardvarks" "alphabetize" "abandon")
> (regexp-match* #px"(a[a-z])+" "aah! amazing aardvarks alphabetize with abandon")
'("aa" "amaz" "aa" "ar" "al" "ab" "aban")

Procedures that use regular expressions, revisited

In the examples above, we’ve used three of the five basic operations we can perform with regular expressions. Let’s review the structure of each

The expression (string-split string regexp) splits a string at each instance of the regular expression, resulting in a list of strings.

> (string-split "One   and two   and    three" #px" +")
'("One" "and" "two" "and" "three")
> (string-split "One   and two   and    three" #px"[aeiou]")
'("On" "   " "nd tw" "   " "nd    thr" "")

The expression (regexp-match* regexp string) identifies each instance of the regular expression in the string and returns them in list form.

> (regexp-match* #px".or." "Is there a vorpal sword in this morbid world?")
'("vorp" "word" "morb" "worl")

The expression (regexp-match regexp string) identifies the first instance of the regular expression in the string and returns it as a list of one string.

> (regexp-match #px".or." "Is there a vorpal sword in this morbid world?")
'("vorp")

The expression (regexp-replace* regexp string replacement) identifies each copy of regexp in the given string and replaces it with replacement. If replacement is a string that contains \\1 (or \\2 or …), we use the appropriate parenthesized part of the the original. If replacement is a procedure, that procedure is called on the matches. (If the matches include parentheses, it’s a bit more complicated.)

> (regexp-replace* #px"(eggs|ham|fish)" "One fish, two fish, red fish, blue fish eat eggs and ham, or so says the hatted cat." "\\1\\1")
"One fishfish, two fishfish, red fishfish, blue fishfish eat eggseggs and hamham, or so says the hatted cat."
> (regexp-replace* #px"eggs|ham|fish" "One fish, two fish, red fish, blue fish eat eggs and ham, or so says the hatted cat." string-upcase)
"One FISH, two FISH, red FISH, blue FISH eat EGGS and HAM, or so says the hatted cat."

The expression (regexp-replace regexp string replacement) behaves much like regexp-replace*, except that it only replaces the first instance of the pattern. For this version, the replacement can also be a procedure, in which case the procedure is applied to the matching text.

> (regexp-replace #px"(eggs|ham|fish)" "One fish, two fish, red fish, blue fish eat eggs and ham, or so says the hatted cat." "\\1\\1\\1")
"One fishfishfish, two fish, red fish, blue fish eat eggs and ham, or so says the hatted cat."
> (regexp-replace #px"eggs|ham|fish" "One fish, two fish, red fish, blue fish eat eggs and ham, or so says the hatted cat." string-upcase)
"One FISH, two fish, red fish, blue fish eat eggs and ham, or so says the hatted cat."

Self checks

Check 1: Alternate spellings (‡)

People with the name “Jared” often encounter various “alternate spellings” of their name, including “Jered”, “Jerod”, and “Jarod”. Write two regular expressions we could use to match any of those four spellings.

Check 2: Counting gendered words

One common analysis of a text is the ratio of male pronouns to female pronouns in a text.

Suppose we want to identify the number of times that the words “she” and “her” appear in a string and we do not want to count the number of times they appear in other words (e.g., “sheet” and “there”).

How could you use regular expressions to count those appearances?

Acknowledgements

Ideas and information were taken from a section of the Racket reference on regular expressions and a section of the Racket guideon Regular Expressions.