Regular expressions and pattern matching

Due: Wednesday, 6 February 2019
Summary: We explore pattern matching in Racket, particularly the use of what are called regular expressions to express general patterns.

Introduction

As we’ve already discovered, Racket provides a variety of tools for working with strings and lists of strings. We can extract substrings, split strings, count strings, and more.

But we’ve encountered one significant obstacle: Sometimes instead of a particular string, we want to work with a “pattern” of sorts, such as “any vowel” rather than “a”, or “any sequence of non-alphanumeric characters” rather than “a space”.

As you might expect, needing to express patterns in strings is a fairly common task in computings. Some years ago, the Mathematician Stephen Kleene invented a notation, which he called “Regular Expressions”. (Fun fact: Kleene is the great-grand-advisor of Professor Rebelsky in CS.) Most modern programming languages now provide some version of regular expressions.

We can use regular expressions in Racket in a wide variety of ways, including splitting strings (with string-split), identifying parts of a string (with regexp-match and regexp-match*), and replacing parts of a string (with regexp-replace and regexp-replace*).

Expressing regular expression basics

Racket provides two kinds of regular expressions, some denoted by #px"EXPRESSION" and some by #rx"EXPRESSION". The details of the differences are not important at this stage and we will focus primarily on those that use #px syntax.

There are a wide variety of kinds of regular expressions available to you.

Exact letters

Almost any letter or sequence of letters can be used as a regular expression. For example, the following expression extracts all instances of “or” from a string.

> (regexp-match* #px"or" "The vorpal sword went snicker-snack.")
'("or" "or")

Any letter

A period matches any letter. We can use that with sequences of exact letters to match all four-letter sequences that end in ck with the pattern #px"..ck" (any letter, any letter, c, k).

> (regexp-match* #px"..ck" "The vorpal sword went snicker-snack.")
'("nick" "nack")

Groups of letters

There are a few “backslash patterns” that match certain groups of letters. \d is digits (0, 1, 2, …, 9). \D is anything other than digits. \w is letters, digits, and underscore. \W is anything not in \w. \s is a whitespace character (space, tab, newline, formfeed, or return) and S is any non-space character. You’ll explore most of these in lab. When you write these in a regular expression, you use two backslashes.

> (regexp-match* #px".sn..." "The vorpal sword went snicker-snack")
'(" snick" "-snack")
> (regexp-match* #px"\\ssn..." "The vorpal sword went snicker-snack")
'(" snick")

Sets

The simplest non-trival regular expressions are sets of characters, denoted by a square left bracket, the characters, and a square right bracket. For example, the regular expression #px"[aeiou]" is a shorthand for “a or e or i or o or u” and therefore matches any lowercase vowel.

> (string-split "The vorpal sword went snicker-snack" #px"[aeiou]")
'("Th" " v" "rp" "l sw" "rd w" "nt sn" "ck" "r-sn" "ck")

One can also use ranges in expressing sets of characters, with a dash in-between the start and end of the range. #px"[a-e]" represents “a or b or c or d or e”.

> (string-split "The vorpal sword went snicker-snack" #px"[a-e]")
'("Th" " vorp" "l swor" " w" "nt sni" "k" "r-sn" "" "k")

One common use of sets is to split strings at punctuation marks. For example, we might use #px"[,.;:?!]" to represent any of comma or period or semicolon or colon or question mark or exclamation point.

> (string-split "I am Sam.  Sam I am.  Do I, perhaps, like green eggs and ham?" #px"[,.;:?!]")
'("I am Sam" "  Sam I am" "  Do I" " perhaps" " like green eggs and ham")

Anti-sets

If you put a caret (^) at the start of the set, the meaning changes to “anything other than these characters”. Since #px"[aeiou ]" means “a or e or i or o or u or space”, #px"[^aeiou ]" means “any single letter other than a, e, i, o, u , or space”.

> (regexp-replace* #px"[^aeiou ]" "The vorpal sword went snicker-snack" "_")
"__e _o__a_ __o__ _e__ __i__e____a__"

Repeating patterns

In some cases, we want expressions that repeat a pattern. A regular expression with a star (*) at the end represents “0 or more repetitions of the expression”. A regular expression with a plus sign (+) at the end represents “1 or more repetitions of the expression”. So #px"a+" means “a or aa or aaa or aaaa or aaaaa or ….”. Similarly, #px"a*" means “the empty string or a or aa or aaa or aaaa or aaaaa or ….”.

> (regexp-match* #px"a+" "aaah!  amazing aardvarks alphabetize")
'("aaa" "a" "a" "aa" "a" "a" "a")

Since the asterisk can match no copies, it can produce some very strange results.

> (regexp-match* #px"a*" "aaah!  amazing aardvarks alphabetize")
'("aaa" "" "" "" "" "a" "" "a" "" "" "" "" "" "aa" "" "" "" "a" "" "" "" "" "a" "" "" "" "a" "" "" "" "" "" "" "")
> (string-split  "aaah!  amazing aardvarks alphabetize" #px"a*")
'("" "h" "!" " " " " "" "m" "" "z" "i" "n" "g" " " "" "r" "d" "v" "" "r" "k" "s" " " "" "l" "p" "h" "" "b" "e" "t" "i" "z" "e" "")

Remember: #px"a*" means “0 or more a’s in row”. As the second example suggests since there are, say, no a’s between “i”, “n”, and “g” in “amazing”, it needs to split between each of those.

More frequently, we will use + and * with more complex regular expressions. For example, #px"[a-z]+" means “a sequence of one or more lowercase letters”, #px"a[a-z]*a" means “a sequence of lowercase letters that starts with an a and ends with an a”, and #px"a[a-z]+a" means “a sequence of at least three lowercase letters that starts with an a and ends with an a”.

> (regexp-match* #px"[a-z]+" "aah! three (3) amazing aardvarks alphabetize")
'("aah" "three" "amazing" "aardvarks" "alphabetize")
> (regexp-match* #px"a[a-z]*a" "aah! three (3) amazing aardvarks alphabetize")
'("aa" "ama" "aardva" "alpha")
> (regexp-match* #px"a[a-z]+a" "aah! three (3) amazing aardvarks alphabetize")
'("ama" "aardva" "alpha")

Allowing multiple patterns

Sometimes we want ot match any of a few words, such as eggs, ham, and fish. The alternation symbol, |, allows us to write express “either of these regular expressions”.

> (regexp-replace* #px"eggs|ham|fish" "One fish, two fish, red fish, blue fish eat eggs and ham, or so says the hatted cat." "thing")
"One thing, two thing, red thing, blue thing eat thing and thing, or so says the hatted cat."

Parenthesized expressions

You can safely put parentheses around a regular expression for clear grouping and for things like the * expression and + expressions. We’ll see some other uses of parentheses in a bit. For example, while #px"a[a-z]+" means “a sequence of lowercase letters that starts with an a and has at least one more letter”, #px"(a[a-z])+" means “a sequence of lowercase letters in which a appears every-other letter”.

> (regexp-match* #px"a[a-z]+" "aah! amazing aardvarks alphabetize with abandon")
'("aah" "amazing" "aardvarks" "alphabetize" "abandon")
> (regexp-match* #px"(a[a-z])+" "aah! amazing aardvarks alphabetize with abandon")
'("aa" "amaz" "aa" "ar" "al" "ab" "aban")

Procedures that use regular expressions, revisited

In the examples above, we’ve used three of the five basic operations we can perform with regular expressions. Let’s review the structure of each

The expression (string-split string regexp) splits a string at each instance of the regular expression, resulting in a list of strings.

> (string-split "One   and two   and    three" #px" +")
'("One" "and" "two" "and" "three")
> (string-split "One   and two   and    three" #px"[aeiou]")
'("On" "   " "nd tw" "   " "nd    thr" "")

The expression (regexp-match* regexp string) identifies each instance of the regular expression in the string and returns them in list form.

> (regexp-match* #px".or." "Is there a vorpal sword in this morbid world?")
'("vorp" "word" "morb" "worl")

The expression (regexp-match regexp string) identifies the first instance of the regular expression in the string and returns it as a list of one string.

> (regexp-match #px".or." "Is there a vorpal sword in this morbid world?")
'("vorp")

The expression (regexp-replace* regexp string replacement) identifies each copy of regexp in the given string and replaces it with replacement. If replacement is a string that contains \\1 (or \\2 or …), we use the appropriate parenthesized part of the the original. If replacement is a procedure, that procedure is called on the matches. (If the matches include parentheses, it’s a bit more complicated.)

> (regexp-replace* #px"(eggs|ham|fish)" "One fish, two fish, red fish, blue fish eat eggs and ham, or so says the hatted cat." "\\1\\1")
"One fishfish, two fishfish, red fishfish, blue fishfish eat eggseggs and hamham, or so says the hatted cat."
> (regexp-replace* #px"eggs|ham|fish" "One fish, two fish, red fish, blue fish eat eggs and ham, or so says the hatted cat." string-upcase)
"One FISH, two FISH, red FISH, blue FISH eat EGGS and HAM, or so says the hatted cat."

The expression (regexp-replace regexp string replacement) behaves much like regexp-replace*, except that it only replaces the first instance of the pattern. For this version, the replacement can also be a procedure, in which case the procedure is applied to the matching text.

> (regexp-replace #px"(eggs|ham|fish)" "One fish, two fish, red fish, blue fish eat eggs and ham, or so says the hatted cat." "\\1\\1\\1")
"One fishfishfish, two fish, red fish, blue fish eat eggs and ham, or so says the hatted cat."
> (regexp-replace #px"eggs|ham|fish" "One fish, two fish, red fish, blue fish eat eggs and ham, or so says the hatted cat." string-upcase)
"One FISH, two fish, red fish, blue fish eat eggs and ham, or so says the hatted cat."

Self checks

Check 1: Alternate spellings

People with the name “Jared” often encounter various “alternate spellings” of their name, including “Jered”, “Jerod”, and “Jarod”. Write two regular expressions we could use to match any of those four spellings.

Check 2: Counting gendered words

One common analysis of a text is the ratio of male pronouns to female pronouns in a text.

Suppose we want to identify the number of times that the words “she” and “her” appear in a string and we do not want to count the number of times they appear in other words (e.g., “sheet” and “there”).

How could you use regular expressions to count those appearances?

Acknowledgements

Ideas and information were taken from a section of the Racket reference on regular expressions and a section of the Racket guideon Regular Expressions.

Copyright © Charlie Curtsinger, Sarah Dahlby Albright, Janet Davis, Fahmida Hamid, Titus Klinge, Samuel A. Rebelsky, and Jerod Weinman. Selected materials are copyright by John David Stone or Henry Walker and are used with permission.

Unless specified otherwise elsewhere on this page, this work is licensed under a Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/3.0/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.

This website was built using Jekyll, Twitter Bootstrap, and the Bootswatch Cosmo Theme.