CS362 2002F : Class 04: Regular Expressions, Continued

Class 04: Regular Expressions, Continued

Held Friday, September 6, 2002

Summary

Today we continue our investigation of regular expressions.

Notes

We'll have class Monday in 2428 so that Marc C. can let his Calculus class play with our computers.

I've added links to the previous versions of this class to the links on most of the pages in the course, in case you want to see what we did the last two times I taught this course.

I finally remembered to update the class news. Sorry for the delay.

Overview

Simplifying Regular Expressions

Common Shorthands

Some Sample Regular Expressions

Limitations

Using Regular Expressions for Describing Tokens

Simplifying Regular Expressions

As you may have noted, we need a lot of parentheses in writing regular expressions.

At one level, these parentheses are necessary to resolve ambiguities.

For example, as R.S|T really ((R.S)|T) or (R.(S|T))?
Does it matter?
Similarly, is R.S.T really ((R.S).T) or (R.(S.T))?
Does it matter?
Is R.T* really ((R.T)*) or (R.(T*))?
Does it matter?

However, we can use the same techniques as we use with simple algebra to avoid these parentheses: Assign precedence to operations.

Kleene star has the highest precedence.
Concatenation has the next highest precedence.
Alternation has the lowest precedence.

We also tend to elide the concatenations symbol, just as we elide the multiplication symbol.

Some examples

ab* is (a.(b*)) and not ((a.b)*)
a|bc is (a|(b.c)) and not ((a|b).c)

When it's appropriate, we assume that the operations are left-associative.

RST is ((R.S).T)) and not (R.(S.T))

Regular Expression Shorthands

There are a number of shorthands for regular expression.

None add any expressive power and all can be compiled into normal regular expressions.

A postfix plus sign (+) for at least one instance. a+ is (a.a*).

Brackets for alternation of larger sets of symbols. [abc] is a|b|c. [a-c] is also a|b|c.

Negation of sets of symbols (which should work only if the set of symbols is finite). [^abc] is "anything except a, b, or c.". If sigma is "lowercase letters", [^abc] is shorthand for d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z

A postfix question mark for optional. a? is shorthand for (a|epsilon).

A power represents repeated concatentation. aⁿ is shorthand for aa...a.

Naming regular expressions and then using names within other regular expressions (provided the inclusion is not recursive).

We can translate any regular expression with names to a normal regular expression by substituting the named expression for the name.

Some of the shorthands available in Perl and such really extend the power of regular expressions, and are not permitted in standard regular expressions.

"The thing matched in an earlier part of the pattern."
"Not this regular expression"
...

Using Regular Expressions for Lexical Analysis

How do we use regular expressions for lexical analysis?

We start by writing a regular expression for each nonterminal in the language.

For example:

BEGIN: [Bb][Ee][Gg][Ii][Nn]
NUMBER: (+|-|epsilon)([1-9][0-9]*)|(0x[0-9]+)|([1-9].[0-9]+)|...
IDENTIFIER: [a-zA-Z_][a-zA-Z_0-9]*
OPEN: \(
CLOSE: \)

You may have noted that we need to be careful to distingish the symbols used in building regular expressions from the similar characters.

Most regular expression implementations provide some mechanism for distinguishing the two.

Note that we have only discussed how to describe tokens, not how a typical lexical analyzer uses those descriptions.

In particular, what happens if multiple regular expressions apply?

Two general rules apply:

Use the longest possible match.
If two different patterns have an equally long match, use the one that appears first.

Note that the first rule can make our matching less efficient, since we may have to "look ahead" a great deal to determine whether or not something matches a longer pattern and find that it doesn't.

In practice, this problem rarely happens.

In the next few classes, we'll learn how to turn regular expression specifications into programs that match regular expressions or tokenize.

History

Thursday, 29 August 2002

First version, based somewhat on outlines from CS362 2001S.

Friday, 6 September 2002

Filled in the body. Some is taken from the previous outline. Most is new.

Primary: [Skip To Body] [Front Door] [Current] [Glance] [Honesty] [Instructions] [Links] [Search] [Syllabus]
Sets: [EBoards] [Examples] [Exams] [Handouts] [Homework] [Labs] [Outlines] [Project] [Readings] [Reference]
ECA: [About] [Grades] [Quizzes] [Submit Work] [Change Password] [Reset Password]
Miscellaneous: [2001S] [98F] [SamR] [Glimmer Labs]

Disclaimer: I usually create these pages on the fly, which means that I rarely proofread them and they may contain bad grammar and incorrect details. It also means that I tend to update them regularly (see the history for more details). Feel free to contact me with any suggestions for changes.

This document was generated by Siteweaver on Fri Dec 6 10:37:53 2002.
The source to the document was last modified on Fri Sep 6 10:38:04 2002.
This document may be found at http://www.cs.grinnell.edu/~rebelsky/Courses/CS362/2002F/Outlines/outline.04.html.

You may wish to validate this document's HTML ; ; Check with Bobby

Glimmer Labs: The Grinnell Laboratory for Interactive Multimedia Experimentation & Research
glimmer@grinnell.edu