# Class 07: Regular Expressions, Continued

Held: Wednesday, 4 February 2004

Summary: Today we continue our investigation of regular expressions.

Related Pages:

Overview:

• Simplifying Regular Expressions
• Common Shorthands
• Some Sample Regular Expressions
• Limitations
• Using Regular Expressions for Describing Tokens

## Simplifying Regular Expressions

• As you may have noted, we need a lot of parentheses in writing regular expressions.
• At one level, these parentheses are necessary to resolve ambiguities.
• For example, is RS|T really ((R)(S))|(T) or (R)((S)|(T))?
• Does it matter?
• Similarly, is RST really ((R)(S))(T) or (R)((S)(T))?
• Does it matter?
• Is RT* really ((R)(T))* or (R)((T)*)?
• Does it matter?
• However, we can use the same techniques as we use with simple algebra to avoid these parentheses: Assign precedence to operations.
• Kleene star has the highest precedence.
• Concatenation has the next highest precedence.
• Alternation has the lowest precedence.
• We also tend to elide the concatenations symbol, just as we elide the multiplication symbol.
• Some examples
• ab* is (a)((b)*) and not ((a)(b))*
• a|bc is (a)|((b)(c)) and not ((a)|(b))(c)
• When it's appropriate, we assume that the operations are left-associative.
• In RST, R and S bind first.

## Regular Expression Shorthands

• There are a number of shorthands for regular expression.
• None add any expressive power and all can be compiled into normal regular expressions.
• A postfix plus sign (+) for at least one instance. a+ is (aa*).
• Brackets for alternation of larger sets of symbols. [abc] is a|b|c. [a-c] is also a|b|c.
• Negation of sets of symbols (which should work only if the set of symbols is finite). [^abc] is "anything except a, b, or c.". If sigma is "lowercase letters", [^abc] is shorthand for d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z
• A postfix question mark for optional. a? is shorthand for (a|epsilon).
• A power represents repeated concatentation. an is shorthand for aa...a.
• Naming regular expressions and then using names within other regular expressions (provided the inclusion is not recursive).
• We can translate any regular expression with names to a normal regular expression by substituting the named expression for the name.
• Some of the shorthands available in Perl and such really extend the power of regular expressions, and are not permitted in standard regular expressions.
• The thing matched in an earlier part of the pattern.
• Not this regular expression
• ...

## Sample Regular Expressions

• All strings of a's and b's: (a|b)*
• Strings of a's and b's with exactly one b: a*ba*
• Strings of one or more a's: aa* (also a*a, also a*aa*)

## Limitations

• Regular expressions, while powerful, also have some limitations.
• In particular, there are relative simple languages you cannot describe with regular expressions.
• Here are a few:
• Strings of a's and b's with equal numbers of a's and b's.
• Palindromes

## Using Regular Expressions for Lexical Analysis

• How do we use regular expressions for lexical analysis?
• We start by writing a regular expression for each nonterminal in the language.
• For example:
```BEGIN: [Bb][Ee][Gg][Ii][Nn]
NUMBER: (+|-|epsilon)([1-9][0-9]*)|(0x[0-9]+)|([1-9].[0-9]+)|...
IDENTIFIER: [a-zA-Z_][a-zA-Z_0-9]*
OPEN: \(
CLOSE: \)
```
• You may have noted that we need to be careful to distingish the symbols used in building regular expressions from the similar characters.
• Most regular expression implementations provide some mechanism for distinguishing the two.
• Note that we have only discussed how to describe tokens, not how a typical lexical analyzer uses those descriptions.
• In particular, what happens if multiple regular expressions apply?
• Two general rules apply:
• Use the longest possible match.
• If two different patterns have an equally long match, use the one that appears first.
• Note that the first rule can make our matching less efficient, since we may have to look ahead a great deal to determine whether or not something matches a longer pattern and find that it doesn't.
• In practice, this problem rarely happens.
• In the next few classes, we'll learn how to turn regular expression specifications into programs that match regular expressions or tokenize.

Disclaimer: I usually create these pages on the fly, which means that I rarely proofread them and they may contain bad grammar and incorrect details. It also means that I tend to update them regularly (see the history for more details). Feel free to contact me with any suggestions for changes.

This document was generated by Siteweaver on Wed May 5 11:46:58 2004.
The source to the document was last modified on Tue Jan 20 23:06:45 2004.
This document may be found at `http://www.cs.grinnell.edu/~rebelsky/Courses/CS362/2004S/Outlines/outline.07.html`.

You may wish to validate this document's HTML ; ; Check with Bobby

Samuel A. Rebelsky, rebelsky@grinnell.edu