[Current] [Discussions] [Glance] [Honesty] [Instructions] [Links] [Search] [Syllabus] Primary
[Examples] [Exams] [Handouts] [Homework] [Labs] [Outlines] [Project] [Quizzes] [Readings] [Reference] Sets
[Blackboard] [98F] Links

Class 04: Introduction to Lexical Analysis and Regular Expressions

Back to Compilation Steps. On to Regular Expressions, Continued.

Held Monday, January 29, 2001

Summary

Today we begin our consideration of lexical analysis through a visit to the regular expressions that are typically used to denote lexemes.

Notes

I've finally set up the blackboard discussion board for posting your sample programs (and comments on them).
I probably won't be available this afternoon (day care was cancelled because of the snow and Michelle needs help with the kids).
Rable rousing: Grinnell entrepreneurs are violating basic Grinnell precepts.

Overview

The process of lexical analysis
Lexical analyzer generators
Regular expressions

An introduction to lexical analysis

Lexical analysis is traditionally the first ``real'' step in compilation. During lexical analysis, one identifies the simple tokens (also called lexemes) that make up a program.
What are these tokens? Things like identifiers, particular keywords, symbols, and such.
- Some tokens have associated semantic values, such as the name of an identifier or the value of an integer.
During the tokenizing phase, a compiler converts a sequence of characters (sometimes thought of as a sequence of bytes) to a sequence of tokens.
During this conversion we often eliminate ``unnecessary'' components, such as whitespace (spaces, tabs, newlines) and comments.
- However, there are reasons to preserve some related information, such as the line number (used in printing error messages).
There are many techniques for writing lexical analyzers (sometimes called lexers), including
- Hand coding (as we did in class)
- Relying on lexical analyzer generators
- Relying on parser generators
In general, it's best to use a lexical analyzer generator because it gives you the best performance.

Generating Lexical Analyzers

Lexical analyzer generators typically take as input a description of the various tokens in the language.
They output code that will read characters from some input and then generate tokens.
Often, lexers operate on a demand-driven basis: when you ask for the next token, they read enough input to get it.
They may also provide a peek operation.

Regular Expressions

Regular expressions can only describe relatively simple languages, but they permit very efficient membership tests and matching.
Provlem: need a language to describe regular expressions, but we're only now learning how to describe languages, which leads to an interesting problem in recursion.
- We'll use an informal description of regular expressions.
Like all languages, the language of regular expressions is based on an underlying alphabet, Sigma.
We'll denote the set of utterances (i.e., the langauge) a regular expression denotes by L(exp).
There is a special symbol, epsilon, not in Sigma.
- epsilon is written as the greek letter epsilon
- L(epsilon) is the set of the empty string, { "" }.
Any single symbol in Sigma is a regular expression.
- The regular expression for that symbol is that symbol
- For each symbol s in Sigma, L(s) = { s }.
The concatenation of any two regular expressions is a regular expression denoting the combinations of strings from those two regular expressions.
- If R and S are regular expressions, the concatenation of R and S is written (R.S)
- L((R.S)) = { concat(x,y) | x in L(R), y in L(S) }
The concatenation of any two strings is the two strings written in sequence with no intervening space.
- concat("hello", "world") = "helloworld"
epsilon is the identity string for concatenation
- concat(epsilon,s) = concat(s,epsilon) = s
We sometimes will want shorthand
- (R.R) may also be written as (R)²; ((R.R).(R)) as (R)³; and so on and so forth.
The alternation of any two regular expressions is a regular expression denoting strings in either language.
- If R and S are regular expressions, the alternation of R and S is written (R|S).
- L((R|S)) = L(R) union L(S)
If R is a regular expression, then the Kleene star of R is a regular expression.
- If R is a regular expression, the Kleene star of R is written (R*).
- L((R*)) = L(epsilon) union L((R)¹) union L((R)²) ...
Because the parentheses are cumbersome, you may remove them if the meaning is obvious without them. In addition, we often don't write the concatenation symbol.
- Kleene star has highest precedence; concatenation next; alternation least.
- ab* is (a.(b*)) and not ((a.b)*)
- a|bc is (a|(b.c)) and not ((a|b).c)
There are a number of shorthands for regular expression. None add any expressive power and all can be ``compiled'' into normal regular expressions. These include
- A postfix plus sign (+) for ``at least one instance''. a+ is (a.a*).
- Brackets for alternation of larger sets. [abc] is a|b|c. [a-c] is also a|b|c.
- A postfix question mark for ``optional''. a? is shorthand for (a|epsilon).
What are some sample regular expressions?
- All strings of a's and b's: (a+b)*
- Strings of a's and b's with exactly one b: a*ba*
- Strings of one or more a's: aa* (also a*a, also a*aa*)

History

Monday, 22 January 2001

Created as a blank outline.

Monday, 29 January 2001

Filled in the details.

Back to Compilation Steps. On to Regular Expressions, Continued.

Disclaimer: I usually create these pages on the fly. This means that they are rarely proofread and may contain bad grammar and incorrect details. It also means that I may update them regularly (see the history for more details). Feel free to contact me with any suggestions for changes.

This page was generated by Siteweaver on Mon Apr 30 10:51:45 2001.
This page may be found at http://www.cs.grinnell.edu/~rebelsky/Courses/CS362/2001S/outline.04.html.
You may validate this page's HTML.
The source was last modified Mon Jan 29 10:49:54 2001.