**Primary:**
[Skip To Body]
[Front Door]
[Current]
[Glance]
[Honesty]
[Instructions]
[Links]
[Search]
[Syllabus]

**Sets:**
[EBoards]
[Examples]
[Exams]
[Handouts]
[Homework]
[Labs]
[Outlines]
[Project]
[Readings]
[Reference]

**ECA:**
[About]
[Grades]
[Quizzes]
[Submit Work]
[Change Password]
[Reset Password]

**Miscellaneous:**
[2001S]
[98F]
[SamR]
[Glimmer Labs]

**Held** Friday, September 6, 2002

**Summary**

Today we continue our investigation of regular expressions.

**Notes**

- We'll have class Monday in 2428 so that Marc C. can let his Calculus class play with our computers.
- I've added links to the previous versions of this class to the links on most of the pages in the course, in case you want to see what we did the last two times I taught this course.
- I finally remembered to update the class news. Sorry for the delay.

**Overview**

- Simplifying Regular Expressions
- Common Shorthands
- Some Sample Regular Expressions
- Limitations
- Using Regular Expressions for Describing Tokens

- As you may have noted, we need a lot of parentheses in writing regular expressions.
- At one level, these parentheses are necessary to resolve ambiguities.
- For example, as R.S|T really ((R.S)|T) or (R.(S|T))?
- Does it matter?
- Similarly, is R.S.T really ((R.S).T) or (R.(S.T))?
- Does it matter?
- Is R.T* really ((R.T)*) or (R.(T*))?
- Does it matter?

- However, we can use the same techniques as we use with simple algebra
to avoid these parentheses: Assign precedence to operations.
- Kleene star has the highest precedence.
- Concatenation has the next highest precedence.
- Alternation has the lowest precedence.

- We also tend to elide the concatenations symbol, just as we elide the multiplication symbol.
- Some examples
- ab* is (a.(b*)) and not ((a.b)*)
- a|bc is (a|(b.c)) and not ((a|b).c)

- When it's appropriate, we assume that the operations are left-associative.
- RST is ((R.S).T)) and not (R.(S.T))

- There are a number of shorthands for regular expression.
- None add
any expressive power and all can be
compiled

into normal regular expressions. - A postfix plus sign (+) for
at least one instance

. a+ is (a.a*). - Brackets for alternation of larger sets of symbols. [abc] is a|b|c. [a-c] is also a|b|c.
- Negation of sets of symbols (which should work only if the set of symbols is finite). [^abc] is "anything except a, b, or c.". If sigma is "lowercase letters", [^abc] is shorthand for d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z
- A postfix question mark for
optional

. a? is shorthand for (a|*epsilon*). - A power represents repeated concatentation. a
^{n}is shorthand for aa...a. - Naming regular expressions and then using names within other regular
expressions (provided the inclusion is not recursive).
- We can translate any regular expression with names to a normal regular expression by substituting the named expression for the name.

- Some of the shorthands available in Perl and such really extend the
power of regular expressions, and are not permitted in standard
regular expressions.
- "The thing matched in an earlier part of the pattern."
- "Not this regular expression"
- ...

- All strings of a's and b's: (a+b)*
- Strings of a's and b's with exactly one b: a*ba*
- Strings of one or more a's: aa* (also a*a, also a*aa*)

- Regular expressions, while powerful, also have some limitations.
- In particular, there are relative simple languages you cannot describe with regular expressions.
- Here are a few:
- Strings of a's and b's with equal numbers of a's and b's.
- Palindromes

- How do we use regular expressions for lexical analysis?
- We start by writing a regular expression for each nonterminal in the language.
- For example:
BEGIN: [Bb][Ee][Gg][Ii][Nn] NUMBER: (+|-|epsilon)([1-9][0-9]*)|(0x[0-9]+)|([1-9].[0-9]+)|... IDENTIFIER: [a-zA-Z_][a-zA-Z_0-9]* OPEN: \( CLOSE: \)

- You may have noted that we need to be careful to distingish the symbols
used in building regular expressions from the similar characters.
- Most regular expression implementations provide some mechanism for distinguishing the two.

- Note that we have only discussed how to describe tokens, not how a
typical lexical analyzer uses those descriptions.
- In particular, what happens if multiple regular expressions apply?

- Two general rules apply:
- Use the longest possible match.
- If two different patterns have an equally long match, use the one that appears first.

- Note that the first rule can make our matching less efficient, since we
may have to "look ahead" a great deal to determine whether or not
something matches a longer pattern and find that it doesn't.
- In practice, this problem rarely happens.

- In the next few classes, we'll learn how to turn regular expression specifications into programs that match regular expressions or tokenize.

Thursday, 29 August 2002

- First version, based somewhat on outlines from CS362 2001S.

Friday, 6 September 2002

- Filled in the body. Some is taken from the previous outline. Most is new.

**Primary:**
[Skip To Body]
[Front Door]
[Current]
[Glance]
[Honesty]
[Instructions]
[Links]
[Search]
[Syllabus]

**Sets:**
[EBoards]
[Examples]
[Exams]
[Handouts]
[Homework]
[Labs]
[Outlines]
[Project]
[Readings]
[Reference]

**ECA:**
[About]
[Grades]
[Quizzes]
[Submit Work]
[Change Password]
[Reset Password]

**Miscellaneous:**
[2001S]
[98F]
[SamR]
[Glimmer Labs]

**Disclaimer**:
I usually create these pages on the fly

, which means that I rarely
proofread them and they may contain bad grammar and incorrect details.
It also means that I tend to update them regularly (see the history for
more details). Feel free to contact me with any suggestions for changes.

This document was generated by
Siteweaver on Fri Dec 6 10:37:53 2002.

The source to the document was last modified on Fri Sep 6 10:38:04 2002.

This document may be found at `http://www.cs.grinnell.edu/~rebelsky/Courses/CS362/2002F/Outlines/outline.04.html`

.

You may wish to validate this document's HTML ; ; Check with Bobby

Glimmer Labs: The Grinnell Laboratory for Interactive Multimedia Experimentation & Researchglimmer@grinnell.edu